Index: head/UPDATING
===================================================================
--- head/UPDATING	(revision 222812)
+++ head/UPDATING	(revision 222813)
@@ -1,1431 +1,1435 @@
 Updating Information for FreeBSD current users
 
 This file is maintained and copyrighted by M. Warner Losh <imp@freebsd.org>.
 See end of file for further details.  For commonly done items, please see the
 COMMON ITEMS: section later in the file.  These instructions assume that you
 basically know what you are doing.  If not, then please consult the FreeBSD
 handbook.
 
 Items affecting the ports and packages system can be found in
 /usr/ports/UPDATING.  Please read that file before running portupgrade.
 
 NOTE TO PEOPLE WHO THINK THAT FreeBSD 9.x IS SLOW:
 	FreeBSD 9.x has many debugging features turned on, in both the kernel
 	and userland.  These features attempt to detect incorrect use of
 	system primitives, and encourage loud failure through extra sanity
 	checking and fail stop semantics.  They also substantially impact
 	system performance.  If you want to do performance measurement,
 	benchmarking, and optimization, you'll want to turn them off.  This
 	includes various WITNESS- related kernel options, INVARIANTS, malloc
 	debugging flags in userland, and various verbose features in the
 	kernel.  Many developers choose to disable these features on build
 	machines to maximize performance.  (To disable malloc debugging, run
 	ln -s aj /etc/malloc.conf.)
 
+20110607:
+	cpumask_t type is retired and cpuset_t is used in order to describe
+	a mask of CPUs.
+
 20110513:
 	Support for sun4v architecture is officially dropped
 
 20110430:
 	Users of the Atheros AR71xx SoC code now need to add 'device ar71xx_pci'
 	into their kernel configurations along with 'device pci'.
 
 20110427:
 	The default NFS client is now the new NFS client, so fstype "newnfs"
 	is now "nfs" and the regular/old NFS client is now fstype "oldnfs".
 	Although mounts via fstype "nfs" will usually work without userland
 	changes, it is recommended that the mount(8) and mount_nfs(8)
 	commands be rebuilt from sources and that a link to mount_nfs called
 	mount_oldnfs be created. The new client is compiled into the
 	kernel with "options NFSCL" and this is needed for diskless root
 	file systems. The GENERIC kernel configs have been changed to use
 	NFSCL and NFSD (the new server) instead of NFSCLIENT and NFSSERVER.
 	To use the regular/old client, you can "mount -t oldnfs ...". For
 	a diskless root file system, you must also include a line like:
 	
 	vfs.root.mountfrom="oldnfs:"
 
 	in the boot/loader.conf on the root fs on the NFS server to make
 	a diskless root fs use the old client.
 
 20110424:
 	The GENERIC kernels for all architectures now default to the new
 	CAM-based ATA stack. It means that all legacy ATA drivers were
 	removed and replaced by respective CAM drivers. If you are using
 	ATA device names in /etc/fstab or other places, make sure to update
 	them respectively (adX -> adaY, acdX -> cdY, afdX -> daY, astX -> saY,
 	where 'Y's are the sequential numbers starting from zero for each type
 	in order of detection, unless configured otherwise with tunables,
 	see cam(4)). There will be symbolic links created in /dev/ to map
 	old adX devices to the respective adaY. They should provide basic
 	compatibility for file systems mounting in most cases, but they do
 	not support old user-level APIs and do not have respective providers
 	in GEOM. Consider using updated management tools with new device names.
 
 	It is possible to load devices ahci, ata, siis and mvs as modules,
 	but option ATA_CAM should remain in kernel configuration to make ata
 	module work as CAM driver supporting legacy ATA controllers. Device ata
 	still can be used in modular fashion (atacore + ...). Modules atadisk
 	and atapi* are not used and won't affect operation in ATA_CAM mode.
 	Note that to use CAM-based ATA kernel should include CAM devices
 	scbus, pass, da (or explicitly ada), cd and optionally others. All of
 	them are parts of the cam module.
 
 	ataraid(4) functionality is now supported by the RAID GEOM class.
 	To use it you can load geom_raid kernel module and use graid(8) tool
 	for management. Instead of /dev/arX device names, use /dev/raid/rX.
 
 	No kernel config options or code have been removed, so if a problem
 	arises, please report it and optionally revert to the old ATA stack.
 	In order to do it you can remove from the kernel config:
 	    options        ATA_CAM
 	    device         ahci
 	    device         mvs
 	    device         siis
 	, and instead add back:
 	    device         atadisk         # ATA disk drives
 	    device         ataraid         # ATA RAID drives
 	    device         atapicd         # ATAPI CDROM drives
 	    device         atapifd         # ATAPI floppy drives
 	    device         atapist         # ATAPI tape drives
 
 20110423:
 	The default NFS server has been changed to the new server, which
 	was referred to as the experimental server. If you need to switch
 	back to the old NFS server, you must now put the "-o" option on
 	both the mountd and nfsd commands. This can be done using the
 	mountd_flags and nfs_server_flags rc.conf variables until an
 	update to the rc scripts is committed, which is coming soon.
 
 20110418:
 	The GNU Objective-C runtime library (libobjc), and other Objective-C
 	related components have been removed from the base system.  If you
 	require an Objective-C library, please use one of the available ports.
 
 20110331:
 	ath(4) has been split into bus- and device- modules. if_ath contains
 	the HAL, the TX rate control and the network device code. if_ath_pci
 	contains the PCI bus glue. For Atheros MIPS embedded systems, if_ath_ahb
 	contains the AHB glue. Users need to load both if_ath_pci and if_ath
 	in order to use ath on everything else.
 
 	TO REPEAT: if_ath_ahb is not needed for normal users. Normal users only
 	need to load if_ath and if_ath_pci for ath(4) operation.
 
 20110314:
 	As part of the replacement of sysinstall, the process of building
 	release media has changed significantly. For details, please re-read
 	release(7), which has been updated to reflect the new build process.
 
 20110218:
 	GNU binutils 2.17.50 (as of 2007-07-03) has been merged to -HEAD.  This
 	is the last available version under GPLv2.  It brings a number of new
 	features, such as support for newer x86 CPU's (with SSE-3, SSSE-3, SSE
 	4.1 and SSE 4.2), better support for powerpc64, a number of new
 	directives, and lots of other small improvements.  See the ChangeLog
 	file in contrib/binutils for the full details.
 
 20110218:
 	IPsec's HMAC_SHA256-512 support has been fixed to be RFC4868
 	compliant, and will now use half of hash for authentication.
 	This will break interoperability with all stacks (including all
 	actual FreeBSD versions) who implement
 	draft-ietf-ipsec-ciph-sha-256-00 (they use 96 bits of hash for
 	authentication).
 	The only workaround with such peers is to use another HMAC
 	algorithm for IPsec ("phase 2") authentication.
 
 20110207:
 	Remove the uio_yield prototype and symbol.  This function has
 	been misnamed since it was introduced and should not be
 	globally exposed with this name.  The equivalent functionality
 	is now available using kern_yield(curthread->td_user_pri).
 	The function remains undocumented.
 
 20110112:
 	A SYSCTL_[ADD_]UQUAD was added for unsigned uint64_t pointers,
 	symmetric with the existing SYSCTL_[ADD_]QUAD.  Type checking
 	for scalar sysctls is defined but disabled.  Code that needs
 	UQUAD to pass the type checking that must compile on older
 	systems where the define is not present can check against
 	__FreeBSD_version >= 900030.
 
 	The system dialog(1) has been replaced with a new version previously
 	in ports as devel/cdialog. dialog(1) is mostly command-line compatible
 	with the previous version, but the libdialog associated with it has
 	a largely incompatible API. As such, the original version of libdialog
 	will be kept temporarily as libodialog, until its base system consumers
 	are replaced or updated. Bump __FreeBSD_version to 900030.
 
 20110103:
 	If you are trying to run make universe on a -stable system, and you get
 	the following warning:
 	"Makefile", line 356: "Target architecture for i386/conf/GENERIC 
 	unknown.  config(8) likely too old."
 	or something similar to it, then you must upgrade your -stable system
 	to 8.2-Release or newer (really, any time after r210146 7/15/2010 in
 	stable/8) or build the config from the latest stable/8 branch and
 	install it on your system.
 
 	Prior to this date, building a current universe on 8-stable system from
 	between 7/15/2010 and 1/2/2011 would result in a weird shell parsing
 	error in the first kernel build phase.  A new config on those old 
 	systems will fix that problem for older versions of -current.
 
 20101228:
 	The TCP stack has been modified to allow Khelp modules to interact with
 	it via helper hook points and store per-connection data in the TCP
 	control block. Bump __FreeBSD_version to 900029. User space tools that
 	rely on the size of struct tcpcb in tcp_var.h (e.g. sockstat) need to
 	be recompiled.
 
 20101114:
 	Generic IEEE 802.3 annex 31B full duplex flow control support has been
 	added to mii(4) and bge(4), bce(4), msk(4), nfe(4) and stge(4) along
 	with brgphy(4), e1000phy(4) as well as ip1000phy() have been converted
 	to take advantage of it instead of using custom implementations.  This
 	means that these drivers now no longer unconditionally advertise
 	support for flow control but only do so if flow control is a selected
 	media option.  This was implemented in the generic support that way in
 	order to allow flow control to be switched on and off via ifconfig(8)
 	with the PHY specific default to typically off in order to protect
 	from unwanted effects.  Consequently, if you used flow control with
 	one of the above mentioned drivers you now need to explicitly enable
 	it, for example via:
 		ifconfig bge0 media auto mediaopt flowcontrol
 
 	Along with the above mentioned changes generic support for setting
 	1000baseT master mode also has been added and brgphy(4), ciphy(4),
 	e1000phy(4) as well as ip1000phy(4) have been converted to take
 	advantage of it.  This means that these drivers now no longer take the
 	link0 parameter for selecting master mode but the master media option
 	has to be used instead, for example like in the following:
 		ifconfig bge0 media 1000baseT mediaopt full-duplex,master
 
 	Selection of master mode now is also available with all other PHY
 	drivers supporting 1000baseT.
 
 20101111:
 	The TCP stack has received a significant update to add support for
 	modularised congestion control and generally improve the clarity of
 	congestion control decisions. Bump __FreeBSD_version to 900025. User
 	space tools that rely on the size of struct tcpcb in tcp_var.h (e.g.
 	sockstat) need to be recompiled.
 
 20101002:
 	The man(1) utility has been replaced by a new version that no longer
 	uses /etc/manpath.config. Please consult man.conf(5) for how to
 	migrate local entries to the new format.
 
 20100928:
 	The copyright strings printed by login(1) and sshd(8) at the time of a
 	new connection have been removed to follow other operating systems and
 	upstream sshd.
 
 20100915:
 	A workaround for a fixed ld bug has been removed in kernel code,
 	so make sure that your system ld is built from sources after
 	revision 210245 from 2010-07-19 (r211583 if building head kernel
 	on stable/8, r211584 for stable/7; both from 2010-08-21).
 	A symptom of incorrect ld version is different addresses for
 	set_pcpu section and __start_set_pcpu symbol in kernel and/or modules.
 
 20100913:
 	The $ipv6_prefer variable in rc.conf(5) has been split into
 	$ip6addrctl_policy and $ipv6_activate_all_interfaces.
 
 	The $ip6addrctl_policy is a variable to choose a pre-defined
 	address selection policy set by ip6addrctl(8).  A value
 	"ipv4_prefer", "ipv6_prefer" or "AUTO" can be specified.  The
 	default is "AUTO".
 
 	The $ipv6_activate_all_interfaces specifies whether IFDISABLED
 	flag (see an entry of 20090926) is set on an interface with no
 	corresponding $ifconfig_IF_ipv6 line.  The default is "NO" for
 	security reason.  If you want IPv6 link-local address on all
 	interfaces by default, set this to "YES".
 
 	The old ipv6_prefer="YES" is equivalent to
 	ipv6_activate_all_interfaces="YES" and
 	ip6addrctl_policy="ipv6_prefer".
 
 20100913:
 	DTrace has grown support for userland tracing. Due to this, DTrace is
 	now i386 and amd64 only.
 	dtruss(1) is now installed by default on those systems and a new
 	kernel module is needed for userland tracing: fasttrap.
 	No changes to your kernel config file are necessary to enable
 	userland tracing, but you might consider adding 'STRIP=' and
 	'CFLAGS+=-fno-omit-frame-pointer' to your make.conf if you want
 	to have informative userland stack traces in DTrace (ustack).
 
 20100725:
 	The acpi_aiboost(4) driver has been removed in favor of the new
 	aibs(4) driver. You should update your kernel configuration file.
 
 20100722:
 	BSD grep has been imported to the base system and it is built by
 	default.  It is completely BSD licensed, highly GNU-compatible, uses
 	less memory than its GNU counterpart and has a small codebase.
 	However, it is slower than its GNU counterpart, which is mostly
 	noticeable for larger searches, for smaller ones it is measurable
 	but not significant.  The reason is complex, the most important factor
 	is that we lack a modern and efficient regex library and GNU
 	overcomes this by optimizing the searches internally.  Future work
 	on improving the regex performance is planned, for the meantime,
 	users that need better performance, can build GNU grep instead by
 	setting the WITH_GNU_GREP knob.
 
 20100713:
 	Due to the import of powerpc64 support, all existing powerpc kernel
 	configuration files must be updated with a machine directive like this:
 	    machine powerpc powerpc
 
 	In addition, an updated config(8) is required to build powerpc kernels
 	after this change.
 
 20100713:
 	A new version of ZFS (version 15) has been merged to -HEAD.
 	This version uses a python library for the following subcommands:
 	zfs allow, zfs unallow, zfs groupspace, zfs userspace.
 	For full functionality of these commands the following port must
 	be installed: sysutils/py-zfs
 
 20100429:
 	'vm_page's are now hashed by physical address to an array of mutexes.
 	Currently this is only used to serialize access to hold_count. Over 
 	time the page queue mutex will be peeled away. This changes the size
 	of pmap on every architecture. And requires all callers of vm_page_hold
 	and vm_page_unhold to be updated. 
  
 20100402:
 	WITH_CTF can now be specified in src.conf (not recommended, there
 	are some problems with static executables), make.conf (would also
 	affect ports which do not use GNU make and do not override the
 	compile targets) or in the kernel config (via "makeoptions
 	WITH_CTF=yes").
 	When WITH_CTF was specified there before this was silently ignored,
 	so make sure that WITH_CTF is not used in places which could lead
 	to unwanted behavior.
 
 20100311:
 	The kernel option COMPAT_IA32 has been replaced with COMPAT_FREEBSD32
 	to allow 32-bit compatibility on non-x86 platforms. All kernel
 	configurations on amd64 and ia64 platforms using these options must
 	be modified accordingly.
 
 20100113:
 	The utmp user accounting database has been replaced with utmpx,
 	the user accounting interface standardized by POSIX.
 	Unfortunately the semantics of utmp and utmpx don't match,
 	making it practically impossible to support both interfaces.
 	The user accounting database is used by tools like finger(1),
 	last(1), talk(1), w(1) and ac(8).
 
 	All applications in the base system use utmpx.  This means only
 	local binaries (e.g. from the ports tree) may still use these
 	utmp database files.  These applications must be rebuilt to make
 	use of utmpx.
 
 	After the system has been upgraded, it is safe to remove the old
 	log files (/var/run/utmp, /var/log/lastlog and /var/log/wtmp*),
 	assuming their contents is of no importance anymore.  Old wtmp
 	databases can only be used by last(1) and ac(8) after they have
 	been converted to the new format using wtmpcvt(1).
 
 20100108:
 	Introduce the kernel thread "deadlock resolver" (which can be enabled
 	via the DEADLKRES option, see NOTES for more details) and the
 	sleepq_type() function for sleepqueues.
 
 20091202:
 	The rc.firewall and rc.firewall6 were unified, and
 	rc.firewall6 and rc.d/ip6fw were removed.
 	According to the removal of rc.d/ip6fw, ipv6_firewall_* rc
 	variables are obsoleted.  Instead, the following new rc
 	variables are added to rc.d/ipfw:
 
 		firewall_client_net_ipv6, firewall_simple_iif_ipv6,
 		firewall_simple_inet_ipv6, firewall_simple_oif_ipv6,
 		firewall_simple_onet_ipv6, firewall_trusted_ipv6
 
 	The meanings correspond to the relevant IPv4 variables.
 
 20091125:
 	8.0-RELEASE.
 
 20091113:
 	The default terminal emulation for syscons(4) has been changed
 	from cons25 to xterm on all platforms except pc98.  This means
 	that the /etc/ttys file needs to be updated to ensure correct
 	operation of applications on the console.
 
 	The terminal emulation style can be toggled per window by using
 	vidcontrol(1)'s -T flag.  The TEKEN_CONS25 kernel configuration
 	options can be used to change the compile-time default back to
 	cons25.
 
 	To prevent graphical artifacts, make sure the TERM environment
 	variable is set to match the terminal emulation that is being
 	performed by syscons(4).
 
 20091109:
 	The layout of the structure ieee80211req_scan_result has changed.
 	Applications that require wireless scan results (e.g. ifconfig(8))
 	from net80211 need to be recompiled.
 
 	Applications such as wpa_supplicant(8) may require a full world
 	build without using NO_CLEAN in order to get synchronized with the
 	new structure.
 
 20091025:
 	The iwn(4) driver has been updated to support the 5000 and 5150 series.
 	There's one kernel module for each firmware. Adding "device iwnfw"
 	to the kernel configuration file means including all three firmware
 	images inside the kernel. If you want to include just the one for
 	your wireless card, use the the devices iwn4965fw, iwn5000fw or
 	iwn5150fw.
 
 20090926:
 	The rc.d/network_ipv6, IPv6 configuration script has been integrated
 	into rc.d/netif.  The changes are the following:
 
 	1. To use IPv6, simply define $ifconfig_IF_ipv6 like $ifconfig_IF
 	   for IPv4.  For aliases, $ifconfig_IF_aliasN should be used.
 	   Note that both variables need the "inet6" keyword at the head.
 
 	   Do not set $ipv6_network_interfaces manually if you do not
 	   understand what you are doing.  It is not needed in most cases. 
 
 	   $ipv6_ifconfig_IF and $ipv6_ifconfig_IF_aliasN still work, but
 	   they are obsolete.
 
 	2. $ipv6_enable is obsolete.  Use $ipv6_prefer and
 	   "inet6 accept_rtadv" keyword in ifconfig(8) instead.
 
 	   If you define $ipv6_enable=YES, it means $ipv6_prefer=YES and
 	   all configured interfaces have "inet6 accept_rtadv" in the
 	   $ifconfig_IF_ipv6.  These are for backward compatibility.
 
 	3. A new variable $ipv6_prefer has been added.  If NO, IPv6
 	   functionality of interfaces with no corresponding
 	   $ifconfig_IF_ipv6 is disabled by using "inet6 ifdisabled" flag,
 	   and the default address selection policy of ip6addrctl(8) 
 	   is the IPv4-preferred one (see rc.d/ip6addrctl for more details).
 	   Note that if you want to configure IPv6 functionality on the
 	   disabled interfaces after boot, first you need to clear the flag by
 	   using ifconfig(8) like:
 
 		ifconfig em0 inet6 -ifdisabled
 
 	   If YES, the default address selection policy is set as
 	   IPv6-preferred.
 
 	   The default value of $ipv6_prefer is NO.
 
 	4. If your system need to receive Router Advertisement messages,
 	   define "inet6 accept_rtadv" in $ifconfig_IF_ipv6.  The rc(8)
 	   scripts automatically invoke rtsol(8) when the interface becomes
 	   UP.  The Router Advertisement messages are used for SLAAC
 	   (State-Less Address AutoConfiguration).
 
 20090922:
 	802.11s D3.03 support was committed. This is incompatible with the
 	previous code, which was based on D3.0.
 
 20090912:
 	A sysctl variable net.inet6.ip6.accept_rtadv now sets the default value
 	of a per-interface flag ND6_IFF_ACCEPT_RTADV, not a global knob to
 	control whether accepting Router Advertisement messages or not.
 	Also, a per-interface flag ND6_IFF_AUTO_LINKLOCAL has been added and
 	a sysctl variable net.inet6.ip6.auto_linklocal is its default value.
 	The ifconfig(8) utility now supports these flags.
 
 20090910:
 	ZFS snapshots are now mounted with MNT_IGNORE flag. Use -v option for
 	mount(8) and -a option for df(1) to see them.
 
 20090825:
 	The old tunable hw.bus.devctl_disable has been superseded by
 	hw.bus.devctl_queue.  hw.bus.devctl_disable=1 in loader.conf should be
 	replaced by hw.bus.devctl_queue=0.  The default for this new tunable
 	is 1000.
 
 20090813:
 	Remove the option STOP_NMI.  The default action is now to use NMI only
 	for KDB via the newly introduced function stop_cpus_hard() and
 	maintain stop_cpus() to just use a normal IPI_STOP on ia32 and amd64.
 
 20090803:
 	The stable/8 branch created in subversion.  This corresponds to the
 	RELENG_8 branch in CVS.
 
 20090719:
 	Bump the shared library version numbers for all libraries that do not
 	use symbol versioning as part of the 8.0-RELEASE cycle.  Bump
 	__FreeBSD_version to 800105.
 
 20090714:
 	Due to changes in the implementation of virtual network stack support,
 	all network-related kernel modules must be recompiled.  As this change
 	breaks the ABI, bump __FreeBSD_version to 800104.
 
 20090713:
 	The TOE interface to the TCP syncache has been modified to remove
 	struct tcpopt (<netinet/tcp_var.h>) from the ABI of the network stack.
 	The cxgb driver is the only TOE consumer affected by this change, and
 	needs to be recompiled along with the kernel. As this change breaks
 	the ABI, bump __FreeBSD_version to 800103.
 
 20090712: 
 	Padding has been added to struct tcpcb, sackhint and tcpstat in
 	<netinet/tcp_var.h> to facilitate future MFCs and bug fixes whilst
 	maintaining the ABI. However, this change breaks the ABI, so bump
 	__FreeBSD_version to 800102. User space tools that rely on the size of
 	any of these structs (e.g. sockstat) need to be recompiled.
 
 20090630:
 	The NFS_LEGACYRPC option has been removed along with the old kernel
 	RPC implementation that this option selected. Kernel configurations
 	may need to be adjusted.
 
 20090629:
 	The network interface device nodes at /dev/net/<interface> have been
 	removed.  All ioctl operations can be performed the normal way using
 	routing sockets.  The kqueue functionality can generally be replaced
 	with routing sockets.
 
 20090628:
 	The documentation from the FreeBSD Documentation Project (Handbook,
 	FAQ, etc.) is now installed via packages by sysinstall(8) and under
 	the /usr/local/share/doc/freebsd directory instead of /usr/share/doc.
 
 20090624:
 	The ABI of various structures related to the SYSV IPC API have been
 	changed.  As a result, the COMPAT_FREEBSD[456] and COMPAT_43 kernel
 	options now all require COMPAT_FREEBSD7.  Bump __FreeBSD_version to
 	800100.
 
 20090622:
 	Layout of struct vnet has changed as routing related variables were
 	moved to their own Vimage module. Modules need to be recompiled.  Bump
 	__FreeBSD_version to 800099.
 
 20090619:
 	NGROUPS_MAX and NGROUPS have been increased from 16 to 1023 and 1024
 	respectively.  As long as no more than 16 groups per process are used,
 	no changes should be visible.  When more than 16 groups are used, old
 	binaries may fail if they call getgroups() or getgrouplist() with
 	statically sized storage.  Recompiling will work around this, but
 	applications should be modified to use dynamically allocated storage
 	for group arrays as POSIX.1-2008 does not cap an implementation's
 	number of supported groups at NGROUPS_MAX+1 as previous versions did.
 
 	NFS and portalfs mounts may also be affected as the list of groups is
 	truncated to 16.  Users of NFS who use more than 16 groups, should
 	take care that negative group permissions are not used on the exported
 	file systems as they will not be reliable unless a GSSAPI based
 	authentication method is used.
 
 20090616: 
 	The compiling option ADAPTIVE_LOCKMGRS has been introduced.  This
 	option compiles in the support for adaptive spinning for lockmgrs
 	which want to enable it.  The lockinit() function now accepts the flag
 	LK_ADAPTIVE in order to make the lock object subject to adaptive
 	spinning when both held in write and read mode.
 
 20090613:
 	The layout of the structure returned by IEEE80211_IOC_STA_INFO has
 	changed.  User applications that use this ioctl need to be rebuilt.
 
 20090611:
 	The layout of struct thread has changed.  Kernel and modules need to
 	be rebuilt.
 
 20090608:
 	The layout of structs ifnet, domain, protosw and vnet_net has changed.
 	Kernel modules need to be rebuilt.  Bump __FreeBSD_version to 800097.
 
 20090602:
 	window(1) has been removed from the base system. It can now be
 	installed from ports. The port is called misc/window.
 
 20090601:
 	The way we are storing and accessing `routing table' entries has
 	changed. Programs reading the FIB, like netstat, need to be
 	re-compiled.
 
 20090601:
 	A new netisr implementation has been added for FreeBSD 8.  Network
 	file system modules, such as igmp, ipdivert, and others, should be
 	rebuilt.
 	Bump __FreeBSD_version to 800096.
 
 20090530:
 	Remove the tunable/sysctl debug.mpsafevfs as its initial purpose is no
 	more valid.
 
 20090530:
 	Add VOP_ACCESSX(9).  File system modules need to be rebuilt.
 	Bump __FreeBSD_version to 800094.
 
 20090529:
 	Add mnt_xflag field to 'struct mount'.  File system modules need to be
 	rebuilt.
 	Bump __FreeBSD_version to 800093.
 
 20090528:
 	The compiling option ADAPTIVE_SX has been retired while it has been
 	introduced the option NO_ADAPTIVE_SX which handles the reversed logic.
 	The KPI for sx_init_flags() changes as accepting flags:
 	SX_ADAPTIVESPIN flag has been retired while the SX_NOADAPTIVE flag has
 	been introduced in order to handle the reversed logic.
 	Bump __FreeBSD_version to 800092.
 
 20090527:
 	Add support for hierarchical jails.  Remove global securelevel.
 	Bump __FreeBSD_version to 800091.
 
 20090523:
 	The layout of struct vnet_net has changed, therefore modules
 	need to be rebuilt.
 	Bump __FreeBSD_version to 800090.
 
 20090523:
 	The newly imported zic(8) produces a new format in the output. Please
 	run tzsetup(8) to install the newly created data to /etc/localtime.
 
 20090520:
 	The sysctl tree for the usb stack has renamed from hw.usb2.* to
 	hw.usb.* and is now consistent again with previous releases.
 
 20090520:
 	802.11 monitor mode support was revised and driver api's were changed.
 	Drivers dependent on net80211 now support DLT_IEEE802_11_RADIO instead
 	of DLT_IEEE802_11.  No user-visible data structures were changed but
 	applications that use DLT_IEEE802_11 may require changes.
 	Bump __FreeBSD_version to 800088.
 
 20090430:
 	The layout of the following structs has changed: sysctl_oid,
 	socket, ifnet, inpcbinfo, tcpcb, syncache_head, vnet_inet,
 	vnet_inet6 and vnet_ipfw.  Most modules need to be rebuild or
 	panics may be experienced.  World rebuild is required for
 	correctly checking networking state from userland.
 	Bump __FreeBSD_version to 800085.
 
 20090429:
 	MLDv2 and Source-Specific Multicast (SSM) have been merged
 	to the IPv6 stack. VIMAGE hooks are in but not yet used.
 	The implementation of SSM within FreeBSD's IPv6 stack closely
 	follows the IPv4 implementation.
 
 	For kernel developers:
 
 	* The most important changes are that the ip6_output() and
 	  ip6_input() paths no longer take the IN6_MULTI_LOCK,
 	  and this lock has been downgraded to a non-recursive mutex.
 
 	* As with the changes to the IPv4 stack to support SSM, filtering
 	  of inbound multicast traffic must now be performed by transport
 	  protocols within the IPv6 stack. This does not apply to TCP and
 	  SCTP, however, it does apply to UDP in IPv6 and raw IPv6.
 
 	* The KPIs used by IPv6 multicast are similar to those used by
 	  the IPv4 stack, with the following differences:
 	   * im6o_mc_filter() is analogous to imo_multicast_filter().
 	   * The legacy KAME entry points in6_joingroup and in6_leavegroup()
 	     are shimmed to in6_mc_join() and in6_mc_leave() respectively.
 	   * IN6_LOOKUP_MULTI() has been deprecated and removed.
 	   * IPv6 relies on MLD for the DAD mechanism. KAME's internal KPIs
 	     for MLDv1 have an additional 'timer' argument which is used to
 	     jitter the initial membership report for the solicited-node
 	     multicast membership on-link.
 	   * This is not strictly needed for MLDv2, which already jitters
 	     its report transmissions.  However, the 'timer' argument is
 	     preserved in case MLDv1 is active on the interface.
 
 	* The KAME linked-list based IPv6 membership implementation has
 	  been refactored to use a vector similar to that used by the IPv4
 	  stack.
 	  Code which maintains a list of its own multicast memberships
 	  internally, e.g. carp, has been updated to reflect the new
 	  semantics.
 
 	* There is a known Lock Order Reversal (LOR) due to in6_setscope()
 	  acquiring the IF_AFDATA_LOCK and being called within ip6_output().
 	  Whilst MLDv2 tries to avoid this otherwise benign LOR, it is an
 	  implementation constraint which needs to be addressed in HEAD.
 
 	For application developers:
 
 	* The changes are broadly similar to those made for the IPv4
 	  stack.
 
 	* The use of IPv4 and IPv6 multicast socket options on the same
 	  socket, using mapped addresses, HAS NOT been tested or supported.
 
 	* There are a number of issues with the implementation of various
 	  IPv6 multicast APIs which need to be resolved in the API surface
 	  before the implementation is fully compatible with KAME userland
 	  use, and these are mostly to do with interface index treatment.
 
 	* The literature available discusses the use of either the delta / ASM
 	  API with setsockopt(2)/getsockopt(2), or the full-state / ASM API
 	  using setsourcefilter(3)/getsourcefilter(3). For more information
 	  please refer to RFC 3768, 'Socket Interface Extensions for
 	  Multicast Source Filters'.
 
 	* Applications which use the published RFC 3678 APIs should be fine.
 
 	For systems administrators:
 
 	* The mtest(8) utility has been refactored to support IPv6, in
 	  addition to IPv4. Interface addresses are no longer accepted
 	  as arguments, their names must be used instead. The utility
 	  will map the interface name to its first IPv4 address as
 	  returned by getifaddrs(3).
 
 	* The ifmcstat(8) utility has also been updated to print the MLDv2
 	  endpoint state and source filter lists via sysctl(3).
 
 	* The net.inet6.ip6.mcast.loop sysctl may be tuned to 0 to disable
 	  loopback of IPv6 multicast datagrams by default; it defaults to 1
 	  to preserve the existing behaviour. Disabling multicast loopback is
 	  recommended for optimal system performance.
 
 	* The IPv6 MROUTING code has been changed to examine this sysctl
 	  instead of attempting to perform a group lookup before looping
 	  back forwarded datagrams.
 
 	Bump __FreeBSD_version to 800084.
 
 20090422:
 	Implement low-level Bluetooth HCI API.
 	Bump __FreeBSD_version to 800083.
 
 20090419:
 	The layout of struct malloc_type, used by modules to register new
 	memory allocation types, has changed.  Most modules will need to
 	be rebuilt or panics may be experienced.
 	Bump __FreeBSD_version to 800081.
 
 20090415:
 	Anticipate overflowing inp_flags - add inp_flags2.
 	This changes most offsets in inpcb, so checking v4 connection
 	state will require a world rebuild.
 	Bump __FreeBSD_version to 800080.
 
 20090415:
 	Add an llentry to struct route and struct route_in6. Modules
 	embedding a struct route will need to be recompiled.
 	Bump __FreeBSD_version to 800079.
 
 20090414:
 	The size of rt_metrics_lite and by extension rtentry has changed.
 	Networking administration apps will need to be recompiled.
 	The route command now supports show as an alias for get, weighting
 	of routes, sticky and nostick flags to alter the behavior of stateful
 	load balancing.
 	Bump __FreeBSD_version to 800078.
 
 20090408:
 	Do not use Giant for kbdmux(4) locking. This is wrong and
 	apparently causing more problems than it solves. This will
 	re-open the issue where interrupt handlers may race with
 	kbdmux(4) in polling mode. Typical symptoms include (but
 	not limited to) duplicated and/or missing characters when
 	low level console functions (such as gets) are used while
 	interrupts are enabled (for example geli password prompt,
 	mountroot prompt etc.). Disabling kbdmux(4) may help.
 
 20090407:
 	The size of structs vnet_net, vnet_inet and vnet_ipfw has changed;
 	kernel modules referencing any of the above need to be recompiled.
 	Bump __FreeBSD_version to 800075.
 
 20090320:
 	GEOM_PART has become the default partition slicer for storage devices,
 	replacing GEOM_MBR, GEOM_BSD, GEOM_PC98 and GEOM_GPT slicers. It
 	introduces some changes:
 
 	MSDOS/EBR: the devices created from MSDOS extended partition entries
 	(EBR) can be named differently than with GEOM_MBR and are now symlinks
 	to devices with offset-based names. fstabs may need to be modified.
 
 	BSD: the "geometry does not match label" warning is harmless in most
 	cases but it points to problems in file system misalignment with
 	disk geometry. The "c" partition is now implicit, covers the whole
 	top-level drive and cannot be (mis)used by users.
 
 	General: Kernel dumps are now not allowed to be written to devices
 	whose partition types indicate they are meant to be used for file
 	systems (or, in case of MSDOS partitions, as something else than
 	the "386BSD" type).
 
 	Most of these changes date approximately from 200812.
 
 20090319:
 	The uscanner(4) driver has been removed from the kernel. This follows
 	Linux removing theirs in 2.6 and making libusb the default interface
 	(supported by sane).
 
 20090319:
 	The multicast forwarding code has been cleaned up. netstat(1)
 	only relies on KVM now for printing bandwidth upcall meters.
 	The IPv4 and IPv6 modules are split into ip_mroute_mod and
 	ip6_mroute_mod respectively. The config(5) options for statically
 	compiling this code remain the same, i.e. 'options MROUTING'.
 
 20090315:
 	Support for the IFF_NEEDSGIANT network interface flag has been
 	removed, which means that non-MPSAFE network device drivers are no
 	longer supported.  In particular, if_ar, if_sr, and network device
 	drivers from the old (legacy) USB stack can no longer be built or
 	used.
 
 20090313:
 	POSIX.1 Native Language Support (NLS) has been enabled in libc and
 	a bunch of new language catalog files have also been added.
 	This means that some common libc messages are now localized and
 	they depend on the LC_MESSAGES environmental variable.
 
 20090313:
 	The k8temp(4) driver has been renamed to amdtemp(4) since
 	support for Family 10 and Family 11 CPU families was added.
 
 20090309:
 	IGMPv3 and Source-Specific Multicast (SSM) have been merged
 	to the IPv4 stack. VIMAGE hooks are in but not yet used.
 
 	For kernel developers, the most important changes are that the
 	ip_output() and ip_input() paths no longer take the IN_MULTI_LOCK(),
 	and this lock has been downgraded to a non-recursive mutex.
 
 	Transport protocols (UDP, Raw IP) are now responsible for filtering
 	inbound multicast traffic according to group membership and source
 	filters. The imo_multicast_filter() KPI exists for this purpose.
 	Transports which do not use multicast (SCTP, TCP) already reject
 	multicast by default. Forwarding and receive performance may improve
 	as a mutex acquisition is no longer needed in the ip_input()
 	low-level input path.  in_addmulti() and in_delmulti() are shimmed
 	to new KPIs which exist to support SSM in-kernel.
 
 	For application developers, it is recommended that loopback of
 	multicast datagrams be disabled for best performance, as this
 	will still cause the lock to be taken for each looped-back
 	datagram transmission. The net.inet.ip.mcast.loop sysctl may
 	be tuned to 0 to disable loopback by default; it defaults to 1
 	to preserve the existing behaviour.
 
 	For systems administrators, to obtain best performance with
 	multicast reception and multiple groups, it is always recommended
 	that a card with a suitably precise hash filter is used. Hash
 	collisions will still result in the lock being taken within the
 	transport protocol input path to check group membership.
 
 	If deploying FreeBSD in an environment with IGMP snooping switches,
 	it is recommended that the net.inet.igmp.sendlocal sysctl remain
 	enabled; this forces 224.0.0.0/24 group membership to be announced
 	via IGMP.
 
 	The size of 'struct igmpstat' has changed; netstat needs to be
 	recompiled to reflect this.
 	Bump __FreeBSD_version to 800070.
 
 20090309:
 	libusb20.so.1 is now installed as libusb.so.1 and the ports system
 	updated to use it. This requires a buildworld/installworld in order to
 	update the library and dependencies (usbconfig, etc). Its advisable to
 	rebuild all ports which uses libusb. More specific directions are given
 	in the ports collection UPDATING file. Any /etc/libmap.conf entries for
 	libusb are no longer required and can be removed.
 
 20090302:
 	A workaround is committed to allow the creation of System V shared
 	memory segment of size > 2 GB on the 64-bit architectures.
 	Due to a limitation of the existing ABI, the shm_segsz member
 	of the struct shmid_ds, returned by shmctl(IPC_STAT) call is
 	wrong for large segments. Note that limits must be explicitly
 	raised to allow such segments to be created.
 
 20090301:
 	The layout of struct ifnet has changed, requiring a rebuild of all
 	network device driver modules.
 
 20090227:
 	The /dev handling for the new USB stack has changed, a
 	buildworld/installworld is required for libusb20.
 
 20090223:
 	The new USB2 stack has now been permanently moved in and all kernel and
 	module names reverted to their previous values (eg, usb, ehci, ohci,
 	ums, ...).  The old usb stack can be compiled in by prefixing the name
 	with the letter 'o', the old usb modules have been removed.
 	Updating entry 20090216 for xorg and 20090215 for libmap may still
 	apply.
 
 20090217:
 	The rc.conf(5) option if_up_delay has been renamed to
 	defaultroute_delay to better reflect its purpose. If you have
 	customized this setting in /etc/rc.conf you need to update it to
 	use the new name.
 
 20090216:
 	xorg 7.4 wants to configure its input devices via hald which does not
 	yet work with USB2. If the keyboard/mouse does not work in xorg then
 	add
 		Option "AllowEmptyInput" "off"
 	to your ServerLayout section.  This will cause X to use the configured
 	kbd and mouse sections from your xorg.conf.
 
 20090215:
 	The GENERIC kernels for all architectures now default to the new USB2
 	stack. No kernel config options or code have been removed so if a
 	problem arises please report it and optionally revert to the old USB
 	stack. If you are loading USB kernel modules or have a custom kernel
 	that includes GENERIC then ensure that usb names are also changed over,
 	eg uftdi -> usb2_serial_ftdi.
 
 	Older programs linked against the ports libusb 0.1 need to be
 	redirected to the new stack's libusb20.  /etc/libmap.conf can
 	be used for this:
 		# Map old usb library to new one for usb2 stack
 		libusb-0.1.so.8	libusb20.so.1
 
 20090209:
 	All USB ethernet devices now attach as interfaces under the name ueN
 	(eg. ue0). This is to provide a predictable name as vendors often
 	change usb chipsets in a product without notice.
 
 20090203:
 	The ichsmb(4) driver has been changed to require SMBus slave
 	addresses be left-justified (xxxxxxx0b) rather than right-justified.
 	All of the other SMBus controller drivers require left-justified
 	slave addresses, so this change makes all the drivers provide the
 	same interface.
 
 20090201:
 	INET6 statistics (struct ip6stat) was updated.
 	netstat(1) needs to be recompiled.
 
 20090119:
 	NTFS has been removed from GENERIC kernel on amd64 to match
 	GENERIC on i386. Should not cause any issues since mount_ntfs(8)
 	will load ntfs.ko module automatically when NTFS support is
 	actually needed, unless ntfs.ko is not installed or security
 	level prohibits loading kernel modules. If either is the case,
 	"options NTFS" has to be added into kernel config.
 
 20090115:
 	TCP Appropriate Byte Counting (RFC 3465) support added to kernel.
 	New field in struct tcpcb breaks ABI, so bump __FreeBSD_version to
 	800061. User space tools that rely on the size of struct tcpcb in
 	tcp_var.h (e.g. sockstat) need to be recompiled.
 
 20081225:
 	ng_tty(4) module updated to match the new TTY subsystem.
 	Due to API change, user-level applications must be updated.
 	New API support added to mpd5 CVS and expected to be present
 	in next mpd5.3 release.
 
 20081219:
 	With __FreeBSD_version 800060 the makefs tool is part of
 	the base system (it was a port).
 
 20081216:
 	The afdata and ifnet locks have been changed from mutexes to
 	rwlocks, network modules will need to be re-compiled.
 
 20081214:
 	__FreeBSD_version 800059 incorporates the new arp-v2 rewrite.
 	RTF_CLONING, RTF_LLINFO and RTF_WASCLONED flags are eliminated.
 	The new code reduced struct rtentry{} by 16 bytes on 32-bit
 	architecture and 40 bytes on 64-bit architecture. The userland
 	applications "arp" and "ndp" have been updated accordingly.
 	The output from "netstat -r" shows only routing entries and
 	none of the L2 information.
 
 20081130:
 	__FreeBSD_version 800057 marks the switchover from the
 	binary ath hal to source code. Users must add the line:
 
 	options	AH_SUPPORT_AR5416
 
 	to their kernel config files when specifying:
 
 	device	ath_hal
 
 	The ath_hal module no longer exists; the code is now compiled
 	together with the driver in the ath module.  It is now
 	possible to tailor chip support (i.e. reduce the set of chips
 	and thereby the code size); consult ath_hal(4) for details.
 
 20081121:
 	__FreeBSD_version 800054 adds memory barriers to
 	<machine/atomic.h>, new interfaces to ifnet to facilitate
 	multiple hardware transmit queues for cards that support
 	them, and a lock-less ring-buffer implementation to
 	enable drivers to more efficiently manage queueing of
 	packets.
 
 20081117:
 	A new version of ZFS (version 13) has been merged to -HEAD.
 	This version has zpool attribute "listsnapshots" off by
 	default, which means "zfs list" does not show snapshots,
 	and is the same as Solaris behavior.
 
 20081028:
 	dummynet(4) ABI has changed. ipfw(8) needs to be recompiled.
 
 20081009:
 	The uhci, ohci, ehci and slhci USB Host controller drivers have
 	been put into separate modules. If you load the usb module
 	separately through loader.conf you will need to load the
 	appropriate *hci module as well. E.g. for a UHCI-based USB 2.0
 	controller add the following to loader.conf:
 
 		uhci_load="YES"
 		ehci_load="YES"
 
 20081009:
 	The ABI used by the PMC toolset has changed.  Please keep
 	userland (libpmc(3)) and the kernel module (hwpmc(4)) in
 	sync.
 
 20081009:
 	atapci kernel module now includes only generic PCI ATA
 	driver. AHCI driver moved to ataahci kernel module.
 	All vendor-specific code moved into separate kernel modules:
 	ataacard, ataacerlabs, ataadaptec, ataamd, ataati, atacenatek,
 	atacypress, atacyrix, atahighpoint, ataintel, ataite, atajmicron,
 	atamarvell, atamicron, atanational, atanetcell, atanvidia,
 	atapromise, ataserverworks, atasiliconimage, atasis, atavia
 
 20080820:
 	The TTY subsystem of the kernel has been replaced by a new
 	implementation, which provides better scalability and an
 	improved driver model. Most common drivers have been migrated to
 	the new TTY subsystem, while others have not. The following
 	drivers have not yet been ported to the new TTY layer:
 
 	PCI/ISA:
 		cy, digi, rc, rp, sio
 
 	USB:
 		ubser, ucycom
 
 	Line disciplines:
 		ng_h4, ng_tty, ppp, sl, snp
 
 	Adding these drivers to your kernel configuration file shall
 	cause compilation to fail.
 
 20080818:
 	ntpd has been upgraded to 4.2.4p5.
 
 20080801:
 	OpenSSH has been upgraded to 5.1p1.
 
 	For many years, FreeBSD's version of OpenSSH preferred DSA
 	over RSA for host and user authentication keys.  With this
 	upgrade, we've switched to the vendor's default of RSA over
 	DSA.  This may cause upgraded clients to warn about unknown
 	host keys even for previously known hosts.  Users should
 	follow the usual procedure for verifying host keys before
 	accepting the RSA key.
 
 	This can be circumvented by setting the "HostKeyAlgorithms"
 	option to "ssh-dss,ssh-rsa" in ~/.ssh/config or on the ssh
 	command line.
 
 	Please note that the sequence of keys offered for
 	authentication has been changed as well.  You may want to
 	specify IdentityFile in a different order to revert this
 	behavior.
 
 20080713:
 	The sio(4) driver has been removed from the i386 and amd64
 	kernel configuration files. This means uart(4) is now the
 	default serial port driver on those platforms as well.
 
 	To prevent collisions with the sio(4) driver, the uart(4) driver
 	uses different names for its device nodes. This means the
 	onboard serial port will now most likely be called "ttyu0"
 	instead of "ttyd0". You may need to reconfigure applications to
 	use the new device names.
 
 	When using the serial port as a boot console, be sure to update
 	/boot/device.hints and /etc/ttys before booting the new kernel.
 	If you forget to do so, you can still manually specify the hints
 	at the loader prompt:
 
 		set hint.uart.0.at="isa"
 		set hint.uart.0.port="0x3F8"
 		set hint.uart.0.flags="0x10"
 		set hint.uart.0.irq="4"
 		boot -s
 
 20080609:
 	The gpt(8) utility has been removed. Use gpart(8) to partition
 	disks instead.
 
 20080603:
 	The version that Linuxulator emulates was changed from 2.4.2
 	to 2.6.16. If you experience any problems with Linux binaries
 	please try to set sysctl compat.linux.osrelease to 2.4.2 and
 	if it fixes the problem contact emulation mailing list.
 
 20080525:
 	ISDN4BSD (I4B) was removed from the src tree. You may need to
 	update a your kernel configuration and remove relevant entries.
 
 20080509:
 	I have checked in code to support multiple routing tables.
 	See the man pages setfib(1) and setfib(2).
 	This is a hopefully backwards compatible version,
 	but to make use of it you need to compile your kernel
 	with options ROUTETABLES=2 (or more up to 16).
 
 20080420:
 	The 802.11 wireless support was redone to enable multi-bss
 	operation on devices that are capable.  The underlying device
 	is no longer used directly but instead wlanX devices are
 	cloned with ifconfig.  This requires changes to rc.conf files.
 	For example, change:
 		ifconfig_ath0="WPA DHCP"
 	to
 		wlans_ath0=wlan0
 		ifconfig_wlan0="WPA DHCP"
 	see rc.conf(5) for more details.  In addition, mergemaster of
 	/etc/rc.d is highly recommended.  Simultaneous update of userland
 	and kernel wouldn't hurt either.
 
 	As part of the multi-bss changes the wlan_scan_ap and wlan_scan_sta
 	modules were merged into the base wlan module.  All references
 	to these modules (e.g. in kernel config files) must be removed.
 
 20080408:
 	psm(4) has gained write(2) support in native operation level.
 	Arbitrary commands can be written to /dev/psm%d and status can
 	be read back from it.  Therefore, an application is responsible
 	for status validation and error recovery.  It is a no-op in
 	other operation levels.
 
 20080312:
 	Support for KSE threading has been removed from the kernel.  To
 	run legacy applications linked against KSE libmap.conf may
 	be used.  The following libmap.conf may be used to ensure
 	compatibility with any prior release:
 
 	libpthread.so.1 libthr.so.1
 	libpthread.so.2 libthr.so.2
 	libkse.so.3 libthr.so.3
 
 20080301:
 	The layout of struct vmspace has changed. This affects libkvm
 	and any executables that link against libkvm and use the
 	kvm_getprocs() function. In particular, but not exclusively,
 	it affects ps(1), fstat(1), pkill(1), systat(1), top(1) and w(1).
 	The effects are minimal, but it's advisable to upgrade world
 	nonetheless.
 
 20080229:
 	The latest em driver no longer has support in it for the
 	82575 adapter, this is now moved to the igb driver. The
 	split was done to make new features that are incompatible
 	with older hardware easier to do.
 
 20080220:
 	The new geom_lvm(4) geom class has been renamed to geom_linux_lvm(4),
 	likewise the kernel option is now GEOM_LINUX_LVM.
 
 20080211:
 	The default NFS mount mode has changed from UDP to TCP for
 	increased reliability.  If you rely on (insecurely) NFS
 	mounting across a firewall you may need to update your
 	firewall rules.
 
 20080208:
 	Belatedly note the addition of m_collapse for compacting
 	mbuf chains.
 
 20080126:
 	The fts(3) structures have been changed to use adequate
 	integer types for their members and so to be able to cope
 	with huge file trees.  The old fts(3) ABI is preserved
 	through symbol versioning in libc, so third-party binaries
 	using fts(3) should still work, although they will not take
 	advantage of the extended types.  At the same time, some
 	third-party software might fail to build after this change
 	due to unportable assumptions made in its source code about
 	fts(3) structure members.  Such software should be fixed
 	by its vendor or, in the worst case, in the ports tree.
 	FreeBSD_version 800015 marks this change for the unlikely
 	case that a portable fix is impossible.
 
 20080123:
 	To upgrade to -current after this date, you must be running
 	FreeBSD not older than 6.0-RELEASE.  Upgrading to -current
 	from 5.x now requires a stop over at RELENG_6 or RELENG_7 systems.
 
 20071128:
 	The ADAPTIVE_GIANT kernel option has been retired because its
 	functionality is the default now.
 
 20071118:
 	The AT keyboard emulation of sunkbd(4) has been turned on
 	by default. In order to make the special symbols of the Sun
 	keyboards driven by sunkbd(4) work under X these now have
 	to be configured the same way as Sun USB keyboards driven
 	by ukbd(4) (which also does AT keyboard emulation), f.e.:
 
 	Option	"XkbLayout" "us"
 	Option	"XkbRules" "xorg"
 	Option	"XkbSymbols" "pc(pc105)+sun_vndr/usb(sun_usb)+us"
 
 20071024:
 	It has been decided that it is desirable to provide ABI
 	backwards compatibility to the FreeBSD 4/5/6 versions of the
 	PCIOCGETCONF, PCIOCREAD and PCIOCWRITE IOCTLs, which was
 	broken with the introduction of PCI domain support (see the
 	20070930 entry). Unfortunately, this required the ABI of
 	PCIOCGETCONF to be broken again in order to be able to
 	provide backwards compatibility to the old version of that
 	IOCTL. Thus consumers of PCIOCGETCONF have to be recompiled
 	again. As for prominent ports this affects neither pciutils
 	nor xorg-server this time, the hal port needs to be rebuilt
 	however.
 
 20071020:
 	The misnamed kthread_create() and friends have been renamed
 	to kproc_create() etc. Many of the callers already
 	used kproc_start()..
 	I will return kthread_create() and friends in a while
 	with implementations that actually create threads, not procs.
 	Renaming corresponds with version 800002.
 
 20071010:
 	RELENG_7 branched.
 
 COMMON ITEMS:
 
 	General Notes
 	-------------
 	Avoid using make -j when upgrading.  While generally safe, there are
 	sometimes problems using -j to upgrade.  If your upgrade fails with
 	-j, please try again without -j.  From time to time in the past there
 	have been problems using -j with buildworld and/or installworld.  This
 	is especially true when upgrading between "distant" versions (eg one
 	that cross a major release boundary or several minor releases, or when
 	several months have passed on the -current branch).
 
 	Sometimes, obscure build problems are the result of environment
 	poisoning.  This can happen because the make utility reads its
 	environment when searching for values for global variables.  To run
 	your build attempts in an "environmental clean room", prefix all make
 	commands with 'env -i '.  See the env(1) manual page for more details.
 
 	When upgrading from one major version to another it is generally best
 	to upgrade to the latest code in the currently installed branch first,
 	then do an upgrade to the new branch. This is the best-tested upgrade
 	path, and has the highest probability of being successful.  Please try
 	this approach before reporting problems with a major version upgrade.
 
 	ZFS notes
 	---------
 	When upgrading the boot ZFS pool to a new version, always follow
 	these two steps:
 
 	1.) recompile and reinstall the ZFS boot loader and boot block
 	(this is part of "make buildworld" and "make installworld")
 
 	2.) update the ZFS boot block on your boot drive
 
 	The following example updates the ZFS boot block on the first
 	partition (freebsd-boot) of a GPT partitioned drive ad0:
 	"gpart bootcode -p /boot/gptzfsboot -i 1 ad0"
 
 	Non-boot pools do not need these updates.
 
 	To build a kernel
 	-----------------
 	If you are updating from a prior version of FreeBSD (even one just
 	a few days old), you should follow this procedure.  It is the most
 	failsafe as it uses a /usr/obj tree with a fresh mini-buildworld,
 
 	make kernel-toolchain
 	make -DALWAYS_CHECK_MAKE buildkernel KERNCONF=YOUR_KERNEL_HERE
 	make -DALWAYS_CHECK_MAKE installkernel KERNCONF=YOUR_KERNEL_HERE
 
 	To test a kernel once
 	---------------------
 	If you just want to boot a kernel once (because you are not sure
 	if it works, or if you want to boot a known bad kernel to provide
 	debugging information) run
 	make installkernel KERNCONF=YOUR_KERNEL_HERE KODIR=/boot/testkernel
 	nextboot -k testkernel
 
 	To just build a kernel when you know that it won't mess you up
 	--------------------------------------------------------------
 	This assumes you are already running a CURRENT system.  Replace
 	${arch} with the architecture of your machine (e.g. "i386",
 	"arm", "amd64", "ia64", "pc98", "sparc64", "powerpc", "mips", etc).
 
 	cd src/sys/${arch}/conf
 	config KERNEL_NAME_HERE
 	cd ../compile/KERNEL_NAME_HERE
 	make depend
 	make
 	make install
 
 	If this fails, go to the "To build a kernel" section.
 
 	To rebuild everything and install it on the current system.
 	-----------------------------------------------------------
 	# Note: sometimes if you are running current you gotta do more than
 	# is listed here if you are upgrading from a really old current.
 
 	<make sure you have good level 0 dumps>
 	make buildworld
 	make kernel KERNCONF=YOUR_KERNEL_HERE
 							[1]
 	<reboot in single user>				[3]
 	mergemaster -p					[5]
 	make installworld
 	mergemaster -i					[4]
 	make delete-old					[6]
 	<reboot>
 
 
 	To cross-install current onto a separate partition
 	--------------------------------------------------
 	# In this approach we use a separate partition to hold
 	# current's root, 'usr', and 'var' directories.   A partition
 	# holding "/", "/usr" and "/var" should be about 2GB in
 	# size.
 
 	<make sure you have good level 0 dumps>
 	<boot into -stable>
 	make buildworld
 	make buildkernel KERNCONF=YOUR_KERNEL_HERE
 	<maybe newfs current's root partition>
 	<mount current's root partition on directory ${CURRENT_ROOT}>
 	make installworld DESTDIR=${CURRENT_ROOT}
 	make distribution DESTDIR=${CURRENT_ROOT} # if newfs'd
 	make installkernel KERNCONF=YOUR_KERNEL_HERE DESTDIR=${CURRENT_ROOT}
 	cp /etc/fstab ${CURRENT_ROOT}/etc/fstab 		   # if newfs'd
 	<edit ${CURRENT_ROOT}/etc/fstab to mount "/" from the correct partition>
 	<reboot into current>
 	<do a "native" rebuild/install as described in the previous section>
 	<maybe install compatibility libraries from ports/misc/compat*>
 	<reboot>
 
 
 	To upgrade in-place from 8.x-stable to current
 	----------------------------------------------
 	<make sure you have good level 0 dumps>
 	make buildworld					[9]
 	make kernel KERNCONF=YOUR_KERNEL_HERE		[8]
 							[1]
 	<reboot in single user>				[3]
 	mergemaster -p					[5]
 	make installworld
 	mergemaster -i					[4]
 	make delete-old					[6]
 	<reboot>
 
 	Make sure that you've read the UPDATING file to understand the
 	tweaks to various things you need.  At this point in the life
 	cycle of current, things change often and you are on your own
 	to cope.  The defaults can also change, so please read ALL of
 	the UPDATING entries.
 
 	Also, if you are tracking -current, you must be subscribed to
 	freebsd-current@freebsd.org.  Make sure that before you update
 	your sources that you have read and understood all the recent
 	messages there.  If in doubt, please track -stable which has
 	much fewer pitfalls.
 
 	[1] If you have third party modules, such as vmware, you
 	should disable them at this point so they don't crash your
 	system on reboot.
 
 	[3] From the bootblocks, boot -s, and then do
 		fsck -p
 		mount -u /
 		mount -a
 		cd src
 		adjkerntz -i		# if CMOS is wall time
 	Also, when doing a major release upgrade, it is required that
 	you boot into single user mode to do the installworld.
 
 	[4] Note: This step is non-optional.  Failure to do this step
 	can result in a significant reduction in the functionality of the
 	system.  Attempting to do it by hand is not recommended and those
 	that pursue this avenue should read this file carefully, as well
 	as the archives of freebsd-current and freebsd-hackers mailing lists
 	for potential gotchas.  The -U option is also useful to consider.
 	See mergemaster(8) for more information.
 
 	[5] Usually this step is a noop.  However, from time to time
 	you may need to do this if you get unknown user in the following
 	step.  It never hurts to do it all the time.  You may need to
 	install a new mergemaster (cd src/usr.sbin/mergemaster && make
 	install) after the buildworld before this step if you last updated
 	from current before 20020224 or from -stable before 20020408.
 
 	[6] This only deletes old files and directories. Old libraries
 	can be deleted by "make delete-old-libs", but you have to make
 	sure that no program is using those libraries anymore.
 
 	[8] In order to have a kernel that can run the 4.x binaries needed to
 	do an installworld, you must include the COMPAT_FREEBSD4 option in
 	your kernel.  Failure to do so may leave you with a system that is
 	hard to boot to recover. A similar kernel option COMPAT_FREEBSD5 is
 	required to run the 5.x binaries on more recent kernels.  And so on
 	for COMPAT_FREEBSD6 and COMPAT_FREEBSD7.
 
 	Make sure that you merge any new devices from GENERIC since the
 	last time you updated your kernel config file.
 
 	[9] When checking out sources, you must include the -P flag to have
 	cvs prune empty directories.
 
 	If CPUTYPE is defined in your /etc/make.conf, make sure to use the
 	"?=" instead of the "=" assignment operator, so that buildworld can
 	override the CPUTYPE if it needs to.
 
 	MAKEOBJDIRPREFIX must be defined in an environment variable, and
 	not on the command line, or in /etc/make.conf.  buildworld will
 	warn if it is improperly defined.
 FORMAT:
 
 This file contains a list, in reverse chronological order, of major
 breakages in tracking -current.  Not all things will be listed here,
 and it only starts on October 16, 2004.  Updating files can found in
 previous releases if your system is older than this.
 
 Copyright information:
 
 Copyright 1998-2009 M. Warner Losh.  All Rights Reserved.
 
 Redistribution, publication, translation and use, with or without
 modification, in full or in part, in any form or format of this
 document are permitted without further permission from the author.
 
 THIS DOCUMENT IS PROVIDED BY WARNER LOSH ``AS IS'' AND ANY EXPRESS OR
 IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 DISCLAIMED.  IN NO EVENT SHALL WARNER LOSH BE LIABLE FOR ANY DIRECT,
 INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
 STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
 IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 POSSIBILITY OF SUCH DAMAGE.
 
 Contact Warner Losh if you have any questions about your use of
 this document.
 
 $FreeBSD$
Index: head/cddl/contrib/opensolaris
===================================================================
--- head/cddl/contrib/opensolaris	(revision 222812)
+++ head/cddl/contrib/opensolaris	(revision 222813)

Property changes on: head/cddl/contrib/opensolaris
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/cddl/contrib/opensolaris:r221273-222812
Index: head/contrib/bind9
===================================================================
--- head/contrib/bind9	(revision 222812)
+++ head/contrib/bind9	(revision 222813)

Property changes on: head/contrib/bind9
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/bind9:r221273-222812
Index: head/contrib/binutils
===================================================================
--- head/contrib/binutils	(revision 222812)
+++ head/contrib/binutils	(revision 222813)

Property changes on: head/contrib/binutils
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/binutils:r221273-222812
Index: head/contrib/bzip2
===================================================================
--- head/contrib/bzip2	(revision 222812)
+++ head/contrib/bzip2	(revision 222813)

Property changes on: head/contrib/bzip2
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/bzip2:r221273-222812
Index: head/contrib/compiler-rt
===================================================================
--- head/contrib/compiler-rt	(revision 222812)
+++ head/contrib/compiler-rt	(revision 222813)

Property changes on: head/contrib/compiler-rt
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/compiler-rt:r221273-222812
Index: head/contrib/dialog
===================================================================
--- head/contrib/dialog	(revision 222812)
+++ head/contrib/dialog	(revision 222813)

Property changes on: head/contrib/dialog
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/dialog:r221273-222812
Index: head/contrib/ee
===================================================================
--- head/contrib/ee	(revision 222812)
+++ head/contrib/ee	(revision 222813)

Property changes on: head/contrib/ee
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/ee:r221273-222812
Index: head/contrib/expat
===================================================================
--- head/contrib/expat	(revision 222812)
+++ head/contrib/expat	(revision 222813)

Property changes on: head/contrib/expat
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/expat:r221273-222812
Index: head/contrib/file
===================================================================
--- head/contrib/file	(revision 222812)
+++ head/contrib/file	(revision 222813)

Property changes on: head/contrib/file
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/file:r221273-222812
Index: head/contrib/gcc
===================================================================
--- head/contrib/gcc	(revision 222812)
+++ head/contrib/gcc	(revision 222813)

Property changes on: head/contrib/gcc
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/gcc:r221273-222812
Index: head/contrib/gdb
===================================================================
--- head/contrib/gdb	(revision 222812)
+++ head/contrib/gdb	(revision 222813)

Property changes on: head/contrib/gdb
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/gdb:r221273-222812
Index: head/contrib/gdtoa
===================================================================
--- head/contrib/gdtoa	(revision 222812)
+++ head/contrib/gdtoa	(revision 222813)

Property changes on: head/contrib/gdtoa
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/gdtoa:r221273-222812
Index: head/contrib/gnu-sort
===================================================================
--- head/contrib/gnu-sort	(revision 222812)
+++ head/contrib/gnu-sort	(revision 222813)

Property changes on: head/contrib/gnu-sort
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/gnu-sort:r221273-222812
Index: head/contrib/groff
===================================================================
--- head/contrib/groff	(revision 222812)
+++ head/contrib/groff	(revision 222813)

Property changes on: head/contrib/groff
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/groff:r221273-222812
Index: head/contrib/less
===================================================================
--- head/contrib/less	(revision 222812)
+++ head/contrib/less	(revision 222813)

Property changes on: head/contrib/less
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/less:r221273-222812
Index: head/contrib/libpcap
===================================================================
--- head/contrib/libpcap	(revision 222812)
+++ head/contrib/libpcap	(revision 222813)

Property changes on: head/contrib/libpcap
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/libpcap:r221273-222812
Index: head/contrib/libstdc++
===================================================================
--- head/contrib/libstdc++	(revision 222812)
+++ head/contrib/libstdc++	(revision 222813)

Property changes on: head/contrib/libstdc++
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/libstdc++:r221273-222812
Index: head/contrib/llvm/tools/clang
===================================================================
--- head/contrib/llvm/tools/clang	(revision 222812)
+++ head/contrib/llvm/tools/clang	(revision 222813)

Property changes on: head/contrib/llvm/tools/clang
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/llvm/tools/clang:r221273-222812
Index: head/contrib/llvm
===================================================================
--- head/contrib/llvm	(revision 222812)
+++ head/contrib/llvm	(revision 222813)

Property changes on: head/contrib/llvm
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/llvm:r221273-222812
Index: head/contrib/ncurses
===================================================================
--- head/contrib/ncurses	(revision 222812)
+++ head/contrib/ncurses	(revision 222813)

Property changes on: head/contrib/ncurses
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/ncurses:r221273-222812
Index: head/contrib/netcat
===================================================================
--- head/contrib/netcat	(revision 222812)
+++ head/contrib/netcat	(revision 222813)

Property changes on: head/contrib/netcat
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/netcat:r221273-222812
Index: head/contrib/ntp
===================================================================
--- head/contrib/ntp	(revision 222812)
+++ head/contrib/ntp	(revision 222813)

Property changes on: head/contrib/ntp
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/ntp:r221273-222812
Index: head/contrib/one-true-awk
===================================================================
--- head/contrib/one-true-awk	(revision 222812)
+++ head/contrib/one-true-awk	(revision 222813)

Property changes on: head/contrib/one-true-awk
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/one-true-awk:r221273-222812
Index: head/contrib/openbsm
===================================================================
--- head/contrib/openbsm	(revision 222812)
+++ head/contrib/openbsm	(revision 222813)

Property changes on: head/contrib/openbsm
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/openbsm:r221273-222812
Index: head/contrib/openpam
===================================================================
--- head/contrib/openpam	(revision 222812)
+++ head/contrib/openpam	(revision 222813)

Property changes on: head/contrib/openpam
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/openpam:r221273-222812
Index: head/contrib/pf
===================================================================
--- head/contrib/pf	(revision 222812)
+++ head/contrib/pf	(revision 222813)

Property changes on: head/contrib/pf
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/pf:r221273-222812
Index: head/contrib/sendmail
===================================================================
--- head/contrib/sendmail	(revision 222812)
+++ head/contrib/sendmail	(revision 222813)

Property changes on: head/contrib/sendmail
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/sendmail:r221273-222812
Index: head/contrib/tcpdump
===================================================================
--- head/contrib/tcpdump	(revision 222812)
+++ head/contrib/tcpdump	(revision 222813)

Property changes on: head/contrib/tcpdump
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/tcpdump:r221273-222812
Index: head/contrib/tcsh
===================================================================
--- head/contrib/tcsh	(revision 222812)
+++ head/contrib/tcsh	(revision 222813)

Property changes on: head/contrib/tcsh
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/tcsh:r221273-222812
Index: head/contrib/top/install-sh
===================================================================
--- head/contrib/top/install-sh	(revision 222812)
+++ head/contrib/top/install-sh	(revision 222813)

Property changes on: head/contrib/top/install-sh
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/top/install-sh:r221273-222812
Index: head/contrib/top
===================================================================
--- head/contrib/top	(revision 222812)
+++ head/contrib/top	(revision 222813)

Property changes on: head/contrib/top
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/top:r221273-222812
Index: head/contrib/tzcode/stdtime
===================================================================
--- head/contrib/tzcode/stdtime	(revision 222812)
+++ head/contrib/tzcode/stdtime	(revision 222813)

Property changes on: head/contrib/tzcode/stdtime
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/tzcode/stdtime:r221273-222812
Index: head/contrib/tzcode/zic
===================================================================
--- head/contrib/tzcode/zic	(revision 222812)
+++ head/contrib/tzcode/zic	(revision 222813)

Property changes on: head/contrib/tzcode/zic
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/tzcode/zic:r221273-222812
Index: head/contrib/tzdata
===================================================================
--- head/contrib/tzdata	(revision 222812)
+++ head/contrib/tzdata	(revision 222813)

Property changes on: head/contrib/tzdata
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/tzdata:r221273-222812
Index: head/contrib/wpa
===================================================================
--- head/contrib/wpa	(revision 222812)
+++ head/contrib/wpa	(revision 222813)

Property changes on: head/contrib/wpa
___________________________________________________________________
Modified: svn:mergeinfo
## -0,1 +0,1 ##
   Reverse-merged /vendor/wpa/dist:r189252-189254
   Merged /projects/largeSMP/contrib/wpa:r221273-222812
Index: head/contrib/xz
===================================================================
--- head/contrib/xz	(revision 222812)
+++ head/contrib/xz	(revision 222813)

Property changes on: head/contrib/xz
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/contrib/xz:r221273-222812
Index: head/crypto/openssh
===================================================================
--- head/crypto/openssh	(revision 222812)
+++ head/crypto/openssh	(revision 222813)

Property changes on: head/crypto/openssh
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/crypto/openssh:r221273-222812
Index: head/crypto/openssl
===================================================================
--- head/crypto/openssl	(revision 222812)
+++ head/crypto/openssl	(revision 222813)

Property changes on: head/crypto/openssl
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/crypto/openssl:r221273-222812
Index: head/gnu/lib
===================================================================
--- head/gnu/lib	(revision 222812)
+++ head/gnu/lib	(revision 222813)

Property changes on: head/gnu/lib
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/gnu/lib:r221273-222812
Index: head/gnu/usr.bin/binutils
===================================================================
--- head/gnu/usr.bin/binutils	(revision 222812)
+++ head/gnu/usr.bin/binutils	(revision 222813)

Property changes on: head/gnu/usr.bin/binutils
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/gnu/usr.bin/binutils:r221273-222812
Index: head/gnu/usr.bin/cc/cc_tools
===================================================================
--- head/gnu/usr.bin/cc/cc_tools	(revision 222812)
+++ head/gnu/usr.bin/cc/cc_tools	(revision 222813)

Property changes on: head/gnu/usr.bin/cc/cc_tools
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/gnu/usr.bin/cc/cc_tools:r221273-222812
Index: head/gnu/usr.bin/gdb/kgdb/kthr.c
===================================================================
--- head/gnu/usr.bin/gdb/kgdb/kthr.c	(revision 222812)
+++ head/gnu/usr.bin/gdb/kgdb/kthr.c	(revision 222813)
@@ -1,238 +1,242 @@
 /*
  * Copyright (c) 2004 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
+#include <sys/cpuset.h>
 #include <sys/proc.h>
 #include <sys/types.h>
 #include <sys/signal.h>
 #include <err.h>
 #include <inttypes.h>
 #include <kvm.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <defs.h>
 #include <frame-unwind.h>
 
 #include "kgdb.h"
 #include <machine/pcb.h>
 
 static CORE_ADDR dumppcb;
 static int dumptid;
 
 static CORE_ADDR stoppcbs;
-static __cpumask_t stopped_cpus;
+static cpuset_t stopped_cpus;
 
 static struct kthr *first;
 struct kthr *curkthr;
 
 CORE_ADDR
 kgdb_lookup(const char *sym)
 {
 	CORE_ADDR addr;
 	char *name;
 
 	asprintf(&name, "&%s", sym);
 	addr = kgdb_parse(name);
 	free(name);
 	return (addr);
 }
 
 struct kthr *
 kgdb_thr_first(void)
 {
 	return (first);
 }
 
 struct kthr *
 kgdb_thr_init(void)
 {
 	struct proc p;
 	struct thread td;
+	long cpusetsize;
 	struct kthr *kt;
 	CORE_ADDR addr;
 	uintptr_t paddr;
 	
 	while (first != NULL) {
 		kt = first;
 		first = kt->next;
 		free(kt);
 	}
 
 	addr = kgdb_lookup("allproc");
 	if (addr == 0)
 		return (NULL);
 	kvm_read(kvm, addr, &paddr, sizeof(paddr));
 
 	dumppcb = kgdb_lookup("dumppcb");
 	if (dumppcb == 0)
 		return (NULL);
 
 	addr = kgdb_lookup("dumptid");
 	if (addr != 0)
 		kvm_read(kvm, addr, &dumptid, sizeof(dumptid));
 	else
 		dumptid = -1;
 
 	addr = kgdb_lookup("stopped_cpus");
-	if (addr != 0)
-		kvm_read(kvm, addr, &stopped_cpus, sizeof(stopped_cpus));
-	else
-		stopped_cpus = 0;
+	CPU_ZERO(&stopped_cpus);
+	cpusetsize = sysconf(_SC_CPUSET_SIZE);
+	if (cpusetsize != -1 && (u_long)cpusetsize <= sizeof(cpuset_t) &&
+	    addr != 0)
+		kvm_read(kvm, addr, &stopped_cpus, cpusetsize);
 
 	stoppcbs = kgdb_lookup("stoppcbs");
 
 	while (paddr != 0) {
 		if (kvm_read(kvm, paddr, &p, sizeof(p)) != sizeof(p)) {
 			warnx("kvm_read: %s", kvm_geterr(kvm));
 			break;
 		}
 		addr = (uintptr_t)TAILQ_FIRST(&p.p_threads);
 		while (addr != 0) {
 			if (kvm_read(kvm, addr, &td, sizeof(td)) !=
 			    sizeof(td)) {
 				warnx("kvm_read: %s", kvm_geterr(kvm));
 				break;
 			}
 			kt = malloc(sizeof(*kt));
 			kt->next = first;
 			kt->kaddr = addr;
 			if (td.td_tid == dumptid)
 				kt->pcb = dumppcb;
-			else if (td.td_state == TDS_RUNNING && ((1 << td.td_oncpu) & stopped_cpus)
-				&& stoppcbs != 0)
+			else if (td.td_state == TDS_RUNNING && stoppcbs != 0 &&
+			    CPU_ISSET(td.td_oncpu, &stopped_cpus))
 				kt->pcb = (uintptr_t) stoppcbs + sizeof(struct pcb) * td.td_oncpu;
 			else
 				kt->pcb = (uintptr_t)td.td_pcb;
 			kt->kstack = td.td_kstack;
 			kt->tid = td.td_tid;
 			kt->pid = p.p_pid;
 			kt->paddr = paddr;
 			kt->cpu = td.td_oncpu;
 			first = kt;
 			addr = (uintptr_t)TAILQ_NEXT(&td, td_plist);
 		}
 		paddr = (uintptr_t)LIST_NEXT(&p, p_list);
 	}
 	curkthr = kgdb_thr_lookup_tid(dumptid);
 	if (curkthr == NULL)
 		curkthr = first;
 	return (first);
 }
 
 struct kthr *
 kgdb_thr_lookup_tid(int tid)
 {
 	struct kthr *kt;
 
 	kt = first;
 	while (kt != NULL && kt->tid != tid)
 		kt = kt->next;
 	return (kt);
 }
 
 struct kthr *
 kgdb_thr_lookup_taddr(uintptr_t taddr)
 {
 	struct kthr *kt;
 
 	kt = first;
 	while (kt != NULL && kt->kaddr != taddr)
 		kt = kt->next;
 	return (kt);
 }
 
 struct kthr *
 kgdb_thr_lookup_pid(int pid)
 {
 	struct kthr *kt;
 
 	kt = first;
 	while (kt != NULL && kt->pid != pid)
 		kt = kt->next;
 	return (kt);
 }
 
 struct kthr *
 kgdb_thr_lookup_paddr(uintptr_t paddr)
 {
 	struct kthr *kt;
 
 	kt = first;
 	while (kt != NULL && kt->paddr != paddr)
 		kt = kt->next;
 	return (kt);
 }
 
 struct kthr *
 kgdb_thr_next(struct kthr *kt)
 {
 	return (kt->next);
 }
 
 struct kthr *
 kgdb_thr_select(struct kthr *kt)
 {
 	struct kthr *pcur;
 
 	pcur = curkthr;
 	curkthr = kt;
 	return (pcur);
 }
 
 char *
 kgdb_thr_extra_thread_info(int tid)
 {
 	char comm[MAXCOMLEN + 1];
 	char td_name[MAXCOMLEN + 1];
 	struct kthr *kt;
 	struct proc *p;
 	struct thread *t;
 	static char buf[64];
 
 	kt = kgdb_thr_lookup_tid(tid);
 	if (kt == NULL)
 		return (NULL);	
 	snprintf(buf, sizeof(buf), "PID=%d", kt->pid);
 	p = (struct proc *)kt->paddr;
 	if (kvm_read(kvm, (uintptr_t)&p->p_comm[0], &comm, sizeof(comm)) !=
 	    sizeof(comm))
 		return (buf);
 	strlcat(buf, ": ", sizeof(buf));
 	strlcat(buf, comm, sizeof(buf));
 	t = (struct thread *)kt->kaddr;
 	if (kvm_read(kvm, (uintptr_t)&t->td_name[0], &td_name,
 	    sizeof(td_name)) == sizeof(td_name) &&
 	    strcmp(comm, td_name) != 0) {
 		strlcat(buf, "/", sizeof(buf));
 		strlcat(buf, td_name, sizeof(buf));
 	}
 	return (buf);
 }
Index: head/gnu/usr.bin/gdb
===================================================================
--- head/gnu/usr.bin/gdb	(revision 222812)
+++ head/gnu/usr.bin/gdb	(revision 222813)

Property changes on: head/gnu/usr.bin/gdb
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/gnu/usr.bin/gdb:r221273-222812
Index: head/lib/libc/stdtime
===================================================================
--- head/lib/libc/stdtime	(revision 222812)
+++ head/lib/libc/stdtime	(revision 222813)

Property changes on: head/lib/libc/stdtime
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/lib/libc/stdtime:r221273-222812
Index: head/lib/libc
===================================================================
--- head/lib/libc	(revision 222812)
+++ head/lib/libc	(revision 222813)

Property changes on: head/lib/libc
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/lib/libc:r221273-222812
Index: head/lib/libkvm/kvm_pcpu.c
===================================================================
--- head/lib/libkvm/kvm_pcpu.c	(revision 222812)
+++ head/lib/libkvm/kvm_pcpu.c	(revision 222813)
@@ -1,291 +1,318 @@
 /*-
  * Copyright (c) 2010 Juniper Networks, Inc.
  * Copyright (c) 2009 Robert N. M. Watson
  * Copyright (c) 2009 Bjoern A. Zeeb <bz@FreeBSD.org>
  * Copyright (c) 2008 Yahoo!, Inc.
  * All rights reserved.
  *
  * Written by: John Baldwin <jhb@FreeBSD.org>
  *
  * This software was developed by Robert N. M. Watson under contract
  * to Juniper Networks, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
+#include <sys/cpuset.h>
 #include <sys/pcpu.h>
 #include <sys/sysctl.h>
 #include <kvm.h>
 #include <limits.h>
 #include <stdlib.h>
+#include <unistd.h>
 
 #include "kvm_private.h"
 
 static struct nlist kvm_pcpu_nl[] = {
 	{ .n_name = "_cpuid_to_pcpu" },
 	{ .n_name = "_mp_maxcpus" },
 	{ .n_name = NULL },
 };
 
 /*
  * Kernel per-CPU data state.  We cache this stuff on the first
  * access.	
  *
  * XXXRW: Possibly, this (and kvmpcpu_nl) should be per-kvm_t, in case the
  * consumer has multiple handles in flight to differently configured
  * kernels/crashdumps.
  */
 static void **pcpu_data;
 static int maxcpu;
 
 #define	NL_CPUID_TO_PCPU	0
 #define	NL_MP_MAXCPUS		1
 
 static int
 _kvm_pcpu_init(kvm_t *kd)
 {
 	size_t len;
 	int max;
 	void *data;
 
 	if (kvm_nlist(kd, kvm_pcpu_nl) < 0)
 		return (-1);
 	if (kvm_pcpu_nl[NL_CPUID_TO_PCPU].n_value == 0) {
 		_kvm_err(kd, kd->program, "unable to find cpuid_to_pcpu");
 		return (-1);
 	}
 	if (kvm_pcpu_nl[NL_MP_MAXCPUS].n_value == 0) {
 		_kvm_err(kd, kd->program, "unable to find mp_maxcpus");
 		return (-1);
 	}
 	if (kvm_read(kd, kvm_pcpu_nl[NL_MP_MAXCPUS].n_value, &max,
 	    sizeof(max)) != sizeof(max)) {
 		_kvm_err(kd, kd->program, "cannot read mp_maxcpus");
 		return (-1);
 	}
 	len = max * sizeof(void *);
 	data = malloc(len);
 	if (data == NULL) {
 		_kvm_err(kd, kd->program, "out of memory");
 		return (-1);
 	}
 	if (kvm_read(kd, kvm_pcpu_nl[NL_CPUID_TO_PCPU].n_value, data, len) !=
 	   (ssize_t)len) {
 		_kvm_err(kd, kd->program, "cannot read cpuid_to_pcpu array");
 		free(data);
 		return (-1);
 	}
 	pcpu_data = data;
 	maxcpu = max;
 	return (0);
 }
 
 static void
 _kvm_pcpu_clear(void)
 {
 
 	maxcpu = 0;
 	free(pcpu_data);
 	pcpu_data = NULL;
 }
 
 void *
 kvm_getpcpu(kvm_t *kd, int cpu)
 {
+	long kcpusetsize;
+	ssize_t nbytes;
+	uintptr_t readptr;
 	char *buf;
 
 	if (kd == NULL) {
 		_kvm_pcpu_clear();
 		return (NULL);
 	}
 
+	kcpusetsize = sysconf(_SC_CPUSET_SIZE);
+	if (kcpusetsize == -1 || (u_long)kcpusetsize > sizeof(cpuset_t))
+		return ((void *)-1);
+
 	if (maxcpu == 0)
 		if (_kvm_pcpu_init(kd) < 0)
 			return ((void *)-1);
 
 	if (cpu >= maxcpu || pcpu_data[cpu] == NULL)
 		return (NULL);
 
 	buf = malloc(sizeof(struct pcpu));
 	if (buf == NULL) {
 		_kvm_err(kd, kd->program, "out of memory");
 		return ((void *)-1);
 	}
-	if (kvm_read(kd, (uintptr_t)pcpu_data[cpu], buf, sizeof(struct pcpu)) !=
-	    sizeof(struct pcpu)) {
+	nbytes = sizeof(struct pcpu) - 2 * kcpusetsize;
+	readptr = (uintptr_t)pcpu_data[cpu];
+	if (kvm_read(kd, readptr, buf, nbytes) != nbytes) {
+		_kvm_err(kd, kd->program, "unable to read per-CPU data");
+		free(buf);
+		return ((void *)-1);
+	}
+
+	/* Fetch the valid cpuset_t objects. */
+	CPU_ZERO((cpuset_t *)(buf + nbytes));
+	CPU_ZERO((cpuset_t *)(buf + nbytes + sizeof(cpuset_t)));
+	readptr += nbytes;
+	if (kvm_read(kd, readptr, buf + nbytes, kcpusetsize) != kcpusetsize) {
+		_kvm_err(kd, kd->program, "unable to read per-CPU data");
+		free(buf);
+		return ((void *)-1);
+	}
+	readptr += kcpusetsize;
+	if (kvm_read(kd, readptr, buf + nbytes + sizeof(cpuset_t),
+	    kcpusetsize) != kcpusetsize) {
 		_kvm_err(kd, kd->program, "unable to read per-CPU data");
 		free(buf);
 		return ((void *)-1);
 	}
 	return (buf);
 }
 
 int
 kvm_getmaxcpu(kvm_t *kd)
 {
 
 	if (kd == NULL) {
 		_kvm_pcpu_clear();
 		return (0);
 	}
 
 	if (maxcpu == 0)
 		if (_kvm_pcpu_init(kd) < 0)
 			return (-1);
 	return (maxcpu);
 }
 
 static int
 _kvm_dpcpu_setcpu(kvm_t *kd, u_int cpu, int report_error)
 {
 
 	if (!kd->dpcpu_initialized) {
 		if (report_error)
 			_kvm_err(kd, kd->program, "%s: not initialized",
 			    __func__);
 		return (-1);
 	}
 	if (cpu >= kd->dpcpu_maxcpus) {
 		if (report_error)
 			_kvm_err(kd, kd->program, "%s: CPU %u too big",
 			    __func__, cpu);
 		return (-1);
 	}
 	if (kd->dpcpu_off[cpu] == 0) {
 		if (report_error)
 			_kvm_err(kd, kd->program, "%s: CPU %u not found",
 			    __func__, cpu);
 		return (-1);
 	}
 	kd->dpcpu_curcpu = cpu;
 	kd->dpcpu_curoff = kd->dpcpu_off[cpu];
 	return (0);
 }
 
 /*
  * Set up libkvm to handle dynamic per-CPU memory.
  */
 static int
 _kvm_dpcpu_init(kvm_t *kd)
 {
 	struct nlist nl[] = {
 #define	NLIST_START_SET_PCPU	0
 		{ .n_name = "___start_" DPCPU_SETNAME },
 #define	NLIST_STOP_SET_PCPU	1
 		{ .n_name = "___stop_" DPCPU_SETNAME },
 #define	NLIST_DPCPU_OFF		2
 		{ .n_name = "_dpcpu_off" },
 #define	NLIST_MP_MAXCPUS	3
 		{ .n_name = "_mp_maxcpus" },
 		{ .n_name = NULL },
 	};
 	uintptr_t *dpcpu_off_buf;
 	size_t len;
 	u_int dpcpu_maxcpus;
 
 	/*
 	 * Locate and cache locations of important symbols using the internal
 	 * version of _kvm_nlist, turning off initialization to avoid
 	 * recursion in case of unresolveable symbols.
 	 */
 	if (_kvm_nlist(kd, nl, 0) != 0)
 		return (-1);
 	if (kvm_read(kd, nl[NLIST_MP_MAXCPUS].n_value, &dpcpu_maxcpus,
 	    sizeof(dpcpu_maxcpus)) != sizeof(dpcpu_maxcpus))
 		return (-1);
 	len = dpcpu_maxcpus * sizeof(*dpcpu_off_buf);
 	dpcpu_off_buf = malloc(len);
 	if (dpcpu_off_buf == NULL)
 		return (-1);
 	if (kvm_read(kd, nl[NLIST_DPCPU_OFF].n_value, dpcpu_off_buf, len) !=
 	    (ssize_t)len) {
 		free(dpcpu_off_buf);
 		return (-1);
 	}
 	kd->dpcpu_start = nl[NLIST_START_SET_PCPU].n_value;
 	kd->dpcpu_stop = nl[NLIST_STOP_SET_PCPU].n_value;
 	kd->dpcpu_maxcpus = dpcpu_maxcpus;
 	kd->dpcpu_off = dpcpu_off_buf;
 	kd->dpcpu_initialized = 1;
 	(void)_kvm_dpcpu_setcpu(kd, 0, 0);
 	return (0);
 }
 
 /*
  * Check whether the dpcpu module has been initialized sucessfully or not,
  * initialize it if permitted.
  */
 int
 _kvm_dpcpu_initialized(kvm_t *kd, int intialize)
 {
 
 	if (kd->dpcpu_initialized || !intialize)
 		return (kd->dpcpu_initialized);
 
 	(void)_kvm_dpcpu_init(kd);
 
 	return (kd->dpcpu_initialized);
 }
 
 /*
  * Check whether the value is within the dpcpu symbol range and only if so
  * adjust the offset relative to the current offset.
  */
 uintptr_t
 _kvm_dpcpu_validaddr(kvm_t *kd, uintptr_t value)
 {
 
 	if (value == 0)
 		return (value);
 
 	if (!kd->dpcpu_initialized)
 		return (value);
 
 	if (value < kd->dpcpu_start || value >= kd->dpcpu_stop)
 		return (value);
 
 	return (kd->dpcpu_curoff + value);
 }
 
 int
 kvm_dpcpu_setcpu(kvm_t *kd, u_int cpu)
 {
 	int ret;
 
 	if (!kd->dpcpu_initialized) {
 		ret = _kvm_dpcpu_init(kd);
 		if (ret != 0) {
 			_kvm_err(kd, kd->program, "%s: init failed",
 			    __func__);
 			return (ret);
 		}
 	}
 
 	return (_kvm_dpcpu_setcpu(kd, cpu, 1));
 }
Index: head/lib/libmemstat/memstat_uma.c
===================================================================
--- head/lib/libmemstat/memstat_uma.c	(revision 222812)
+++ head/lib/libmemstat/memstat_uma.c	(revision 222813)
@@ -1,467 +1,476 @@
 /*-
  * Copyright (c) 2005-2006 Robert N. M. Watson
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #include <sys/param.h>
+#include <sys/cpuset.h>
 #include <sys/sysctl.h>
 
 #define	LIBMEMSTAT	/* Cause vm_page.h not to include opt_vmpage.h */
 #include <vm/vm.h>
 #include <vm/vm_page.h>
 
 #include <vm/uma.h>
 #include <vm/uma_int.h>
 
 #include <err.h>
 #include <errno.h>
 #include <kvm.h>
 #include <nlist.h>
 #include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <unistd.h>
 
 #include "memstat.h"
 #include "memstat_internal.h"
 
 static struct nlist namelist[] = {
 #define	X_UMA_KEGS	0
 	{ .n_name = "_uma_kegs" },
 #define	X_MP_MAXID	1
 	{ .n_name = "_mp_maxid" },
 #define	X_ALL_CPUS	2
 	{ .n_name = "_all_cpus" },
 	{ .n_name = "" },
 };
 
 /*
  * Extract uma(9) statistics from the running kernel, and store all memory
  * type information in the passed list.  For each type, check the list for an
  * existing entry with the right name/allocator -- if present, update that
  * entry.  Otherwise, add a new entry.  On error, the entire list will be
  * cleared, as entries will be in an inconsistent state.
  *
  * To reduce the level of work for a list that starts empty, we keep around a
  * hint as to whether it was empty when we began, so we can avoid searching
  * the list for entries to update.  Updates are O(n^2) due to searching for
  * each entry before adding it.
  */
 int
 memstat_sysctl_uma(struct memory_type_list *list, int flags)
 {
 	struct uma_stream_header *ushp;
 	struct uma_type_header *uthp;
 	struct uma_percpu_stat *upsp;
 	struct memory_type *mtp;
 	int count, hint_dontsearch, i, j, maxcpus;
 	char *buffer, *p;
 	size_t size;
 
 	hint_dontsearch = LIST_EMPTY(&list->mtl_list);
 
 	/*
 	 * Query the number of CPUs, number of malloc types so that we can
 	 * guess an initial buffer size.  We loop until we succeed or really
 	 * fail.  Note that the value of maxcpus we query using sysctl is not
 	 * the version we use when processing the real data -- that is read
 	 * from the header.
 	 */
 retry:
 	size = sizeof(maxcpus);
 	if (sysctlbyname("kern.smp.maxcpus", &maxcpus, &size, NULL, 0) < 0) {
 		if (errno == EACCES || errno == EPERM)
 			list->mtl_error = MEMSTAT_ERROR_PERMISSION;
 		else
 			list->mtl_error = MEMSTAT_ERROR_DATAERROR;
 		return (-1);
 	}
 	if (size != sizeof(maxcpus)) {
 		list->mtl_error = MEMSTAT_ERROR_DATAERROR;
 		return (-1);
 	}
 
 	if (maxcpus > MEMSTAT_MAXCPU) {
 		list->mtl_error = MEMSTAT_ERROR_TOOMANYCPUS;
 		return (-1);
 	}
 
 	size = sizeof(count);
 	if (sysctlbyname("vm.zone_count", &count, &size, NULL, 0) < 0) {
 		if (errno == EACCES || errno == EPERM)
 			list->mtl_error = MEMSTAT_ERROR_PERMISSION;
 		else
 			list->mtl_error = MEMSTAT_ERROR_VERSION;
 		return (-1);
 	}
 	if (size != sizeof(count)) {
 		list->mtl_error = MEMSTAT_ERROR_DATAERROR;
 		return (-1);
 	}
 
 	size = sizeof(*uthp) + count * (sizeof(*uthp) + sizeof(*upsp) *
 	    maxcpus);
 
 	buffer = malloc(size);
 	if (buffer == NULL) {
 		list->mtl_error = MEMSTAT_ERROR_NOMEMORY;
 		return (-1);
 	}
 
 	if (sysctlbyname("vm.zone_stats", buffer, &size, NULL, 0) < 0) {
 		/*
 		 * XXXRW: ENOMEM is an ambiguous return, we should bound the
 		 * number of loops, perhaps.
 		 */
 		if (errno == ENOMEM) {
 			free(buffer);
 			goto retry;
 		}
 		if (errno == EACCES || errno == EPERM)
 			list->mtl_error = MEMSTAT_ERROR_PERMISSION;
 		else
 			list->mtl_error = MEMSTAT_ERROR_VERSION;
 		free(buffer);
 		return (-1);
 	}
 
 	if (size == 0) {
 		free(buffer);
 		return (0);
 	}
 
 	if (size < sizeof(*ushp)) {
 		list->mtl_error = MEMSTAT_ERROR_VERSION;
 		free(buffer);
 		return (-1);
 	}
 	p = buffer;
 	ushp = (struct uma_stream_header *)p;
 	p += sizeof(*ushp);
 
 	if (ushp->ush_version != UMA_STREAM_VERSION) {
 		list->mtl_error = MEMSTAT_ERROR_VERSION;
 		free(buffer);
 		return (-1);
 	}
 
 	if (ushp->ush_maxcpus > MEMSTAT_MAXCPU) {
 		list->mtl_error = MEMSTAT_ERROR_TOOMANYCPUS;
 		free(buffer);
 		return (-1);
 	}
 
 	/*
 	 * For the remainder of this function, we are quite trusting about
 	 * the layout of structures and sizes, since we've determined we have
 	 * a matching version and acceptable CPU count.
 	 */
 	maxcpus = ushp->ush_maxcpus;
 	count = ushp->ush_count;
 	for (i = 0; i < count; i++) {
 		uthp = (struct uma_type_header *)p;
 		p += sizeof(*uthp);
 
 		if (hint_dontsearch == 0) {
 			mtp = memstat_mtl_find(list, ALLOCATOR_UMA,
 			    uthp->uth_name);
 		} else
 			mtp = NULL;
 		if (mtp == NULL)
 			mtp = _memstat_mt_allocate(list, ALLOCATOR_UMA,
 			    uthp->uth_name);
 		if (mtp == NULL) {
 			_memstat_mtl_empty(list);
 			free(buffer);
 			list->mtl_error = MEMSTAT_ERROR_NOMEMORY;
 			return (-1);
 		}
 
 		/*
 		 * Reset the statistics on a current node.
 		 */
 		_memstat_mt_reset_stats(mtp);
 
 		mtp->mt_numallocs = uthp->uth_allocs;
 		mtp->mt_numfrees = uthp->uth_frees;
 		mtp->mt_failures = uthp->uth_fails;
 		mtp->mt_sleeps = uthp->uth_sleeps;
 
 		for (j = 0; j < maxcpus; j++) {
 			upsp = (struct uma_percpu_stat *)p;
 			p += sizeof(*upsp);
 
 			mtp->mt_percpu_cache[j].mtp_free =
 			    upsp->ups_cache_free;
 			mtp->mt_free += upsp->ups_cache_free;
 			mtp->mt_numallocs += upsp->ups_allocs;
 			mtp->mt_numfrees += upsp->ups_frees;
 		}
 
 		mtp->mt_size = uthp->uth_size;
 		mtp->mt_memalloced = mtp->mt_numallocs * uthp->uth_size;
 		mtp->mt_memfreed = mtp->mt_numfrees * uthp->uth_size;
 		mtp->mt_bytes = mtp->mt_memalloced - mtp->mt_memfreed;
 		mtp->mt_countlimit = uthp->uth_limit;
 		mtp->mt_byteslimit = uthp->uth_limit * uthp->uth_size;
 
 		mtp->mt_count = mtp->mt_numallocs - mtp->mt_numfrees;
 		mtp->mt_zonefree = uthp->uth_zone_free;
 
 		/*
 		 * UMA secondary zones share a keg with the primary zone.  To
 		 * avoid double-reporting of free items, report keg free
 		 * items only in the primary zone.
 		 */
 		if (!(uthp->uth_zone_flags & UTH_ZONE_SECONDARY)) {
 			mtp->mt_kegfree = uthp->uth_keg_free;
 			mtp->mt_free += mtp->mt_kegfree;
 		}
 		mtp->mt_free += mtp->mt_zonefree;
 	}
 
 	free(buffer);
 
 	return (0);
 }
 
 static int
 kread(kvm_t *kvm, void *kvm_pointer, void *address, size_t size,
     size_t offset)
 {
 	ssize_t ret;
 
 	ret = kvm_read(kvm, (unsigned long)kvm_pointer + offset, address,
 	    size);
 	if (ret < 0)
 		return (MEMSTAT_ERROR_KVM);
 	if ((size_t)ret != size)
 		return (MEMSTAT_ERROR_KVM_SHORTREAD);
 	return (0);
 }
 
 static int
 kread_string(kvm_t *kvm, void *kvm_pointer, char *buffer, int buflen)
 {
 	ssize_t ret;
 	int i;
 
 	for (i = 0; i < buflen; i++) {
 		ret = kvm_read(kvm, (unsigned long)kvm_pointer + i,
 		    &(buffer[i]), sizeof(char));
 		if (ret < 0)
 			return (MEMSTAT_ERROR_KVM);
 		if ((size_t)ret != sizeof(char))
 			return (MEMSTAT_ERROR_KVM_SHORTREAD);
 		if (buffer[i] == '\0')
 			return (0);
 	}
 	/* Truncate. */
 	buffer[i-1] = '\0';
 	return (0);
 }
 
 static int
 kread_symbol(kvm_t *kvm, int index, void *address, size_t size,
     size_t offset)
 {
 	ssize_t ret;
 
 	ret = kvm_read(kvm, namelist[index].n_value + offset, address, size);
 	if (ret < 0)
 		return (MEMSTAT_ERROR_KVM);
 	if ((size_t)ret != size)
 		return (MEMSTAT_ERROR_KVM_SHORTREAD);
 	return (0);
 }
 
 /*
  * memstat_kvm_uma() is similar to memstat_sysctl_uma(), only it extracts
  * UMA(9) statistics from a kernel core/memory file.
  */
 int
 memstat_kvm_uma(struct memory_type_list *list, void *kvm_handle)
 {
 	LIST_HEAD(, uma_keg) uma_kegs;
 	struct memory_type *mtp;
 	struct uma_bucket *ubp, ub;
 	struct uma_cache *ucp, *ucp_array;
 	struct uma_zone *uzp, uz;
 	struct uma_keg *kzp, kz;
 	int hint_dontsearch, i, mp_maxid, ret;
 	char name[MEMTYPE_MAXNAME];
-	__cpumask_t all_cpus;
+	cpuset_t all_cpus;
+	long cpusetsize;
 	kvm_t *kvm;
 
 	kvm = (kvm_t *)kvm_handle;
 	hint_dontsearch = LIST_EMPTY(&list->mtl_list);
 	if (kvm_nlist(kvm, namelist) != 0) {
 		list->mtl_error = MEMSTAT_ERROR_KVM;
 		return (-1);
 	}
 	if (namelist[X_UMA_KEGS].n_type == 0 ||
 	    namelist[X_UMA_KEGS].n_value == 0) {
 		list->mtl_error = MEMSTAT_ERROR_KVM_NOSYMBOL;
 		return (-1);
 	}
 	ret = kread_symbol(kvm, X_MP_MAXID, &mp_maxid, sizeof(mp_maxid), 0);
 	if (ret != 0) {
 		list->mtl_error = ret;
 		return (-1);
 	}
 	ret = kread_symbol(kvm, X_UMA_KEGS, &uma_kegs, sizeof(uma_kegs), 0);
 	if (ret != 0) {
 		list->mtl_error = ret;
 		return (-1);
 	}
-	ret = kread_symbol(kvm, X_ALL_CPUS, &all_cpus, sizeof(all_cpus), 0);
+	cpusetsize = sysconf(_SC_CPUSET_SIZE);
+	if (cpusetsize == -1 || (u_long)cpusetsize > sizeof(cpuset_t)) {
+		list->mtl_error = MEMSTAT_ERROR_KVM_NOSYMBOL;
+		return (-1);
+	}
+	CPU_ZERO(&all_cpus);
+	ret = kread_symbol(kvm, X_ALL_CPUS, &all_cpus, cpusetsize, 0);
 	if (ret != 0) {
 		list->mtl_error = ret;
 		return (-1);
 	}
 	ucp_array = malloc(sizeof(struct uma_cache) * (mp_maxid + 1));
 	if (ucp_array == NULL) {
 		list->mtl_error = MEMSTAT_ERROR_NOMEMORY;
 		return (-1);
 	}
 	for (kzp = LIST_FIRST(&uma_kegs); kzp != NULL; kzp =
 	    LIST_NEXT(&kz, uk_link)) {
 		ret = kread(kvm, kzp, &kz, sizeof(kz), 0);
 		if (ret != 0) {
 			free(ucp_array);
 			_memstat_mtl_empty(list);
 			list->mtl_error = ret;
 			return (-1);
 		}
 		for (uzp = LIST_FIRST(&kz.uk_zones); uzp != NULL; uzp =
 		    LIST_NEXT(&uz, uz_link)) {
 			ret = kread(kvm, uzp, &uz, sizeof(uz), 0);
 			if (ret != 0) {
 				free(ucp_array);
 				_memstat_mtl_empty(list);
 				list->mtl_error = ret;
 				return (-1);
 			}
 			ret = kread(kvm, uzp, ucp_array,
 			    sizeof(struct uma_cache) * (mp_maxid + 1),
 			    offsetof(struct uma_zone, uz_cpu[0]));
 			if (ret != 0) {
 				free(ucp_array);
 				_memstat_mtl_empty(list);
 				list->mtl_error = ret;
 				return (-1);
 			}
 			ret = kread_string(kvm, uz.uz_name, name,
 			    MEMTYPE_MAXNAME);
 			if (ret != 0) {
 				free(ucp_array);
 				_memstat_mtl_empty(list);
 				list->mtl_error = ret;
 				return (-1);
 			}
 			if (hint_dontsearch == 0) {
 				mtp = memstat_mtl_find(list, ALLOCATOR_UMA,
 				    name);
 			} else
 				mtp = NULL;
 			if (mtp == NULL)
 				mtp = _memstat_mt_allocate(list, ALLOCATOR_UMA,
 				    name);
 			if (mtp == NULL) {
 				free(ucp_array);
 				_memstat_mtl_empty(list);
 				list->mtl_error = MEMSTAT_ERROR_NOMEMORY;
 				return (-1);
 			}
 			/*
 			 * Reset the statistics on a current node.
 			 */
 			_memstat_mt_reset_stats(mtp);
 			mtp->mt_numallocs = uz.uz_allocs;
 			mtp->mt_numfrees = uz.uz_frees;
 			mtp->mt_failures = uz.uz_fails;
 			mtp->mt_sleeps = uz.uz_sleeps;
 			if (kz.uk_flags & UMA_ZFLAG_INTERNAL)
 				goto skip_percpu;
 			for (i = 0; i < mp_maxid + 1; i++) {
-				if ((all_cpus & (1 << i)) == 0)
+				if (!CPU_ISSET(i, &all_cpus))
 					continue;
 				ucp = &ucp_array[i];
 				mtp->mt_numallocs += ucp->uc_allocs;
 				mtp->mt_numfrees += ucp->uc_frees;
 
 				if (ucp->uc_allocbucket != NULL) {
 					ret = kread(kvm, ucp->uc_allocbucket,
 					    &ub, sizeof(ub), 0);
 					if (ret != 0) {
 						free(ucp_array);
 						_memstat_mtl_empty(list);
 						list->mtl_error = ret;
 						return (-1);
 					}
 					mtp->mt_free += ub.ub_cnt;
 				}
 				if (ucp->uc_freebucket != NULL) {
 					ret = kread(kvm, ucp->uc_freebucket,
 					    &ub, sizeof(ub), 0);
 					if (ret != 0) {
 						free(ucp_array);
 						_memstat_mtl_empty(list);
 						list->mtl_error = ret;
 						return (-1);
 					}
 					mtp->mt_free += ub.ub_cnt;
 				}
 			}
 skip_percpu:
 			mtp->mt_size = kz.uk_size;
 			mtp->mt_memalloced = mtp->mt_numallocs * mtp->mt_size;
 			mtp->mt_memfreed = mtp->mt_numfrees * mtp->mt_size;
 			mtp->mt_bytes = mtp->mt_memalloced - mtp->mt_memfreed;
 			if (kz.uk_ppera > 1)
 				mtp->mt_countlimit = kz.uk_maxpages /
 				    kz.uk_ipers;
 			else
 				mtp->mt_countlimit = kz.uk_maxpages *
 				    kz.uk_ipers;
 			mtp->mt_byteslimit = mtp->mt_countlimit * mtp->mt_size;
 			mtp->mt_count = mtp->mt_numallocs - mtp->mt_numfrees;
 			for (ubp = LIST_FIRST(&uz.uz_full_bucket); ubp !=
 			    NULL; ubp = LIST_NEXT(&ub, ub_link)) {
 				ret = kread(kvm, ubp, &ub, sizeof(ub), 0);
 				mtp->mt_zonefree += ub.ub_cnt;
 			}
 			if (!((kz.uk_flags & UMA_ZONE_SECONDARY) &&
 			    LIST_FIRST(&kz.uk_zones) != uzp)) {
 				mtp->mt_kegfree = kz.uk_free;
 				mtp->mt_free += mtp->mt_kegfree;
 			}
 			mtp->mt_free += mtp->mt_zonefree;
 		}
 	}
 	free(ucp_array);
 	return (0);
 }
Index: head/lib/libutil
===================================================================
--- head/lib/libutil	(revision 222812)
+++ head/lib/libutil	(revision 222813)

Property changes on: head/lib/libutil
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/lib/libutil:r221273-222812
Index: head/lib/libz
===================================================================
--- head/lib/libz	(revision 222812)
+++ head/lib/libz	(revision 222813)

Property changes on: head/lib/libz
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/lib/libz:r221273-222812
Index: head/sbin/ipfw
===================================================================
--- head/sbin/ipfw	(revision 222812)
+++ head/sbin/ipfw	(revision 222813)

Property changes on: head/sbin/ipfw
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sbin/ipfw:r221273-222812
Index: head/sbin
===================================================================
--- head/sbin	(revision 222812)
+++ head/sbin	(revision 222813)

Property changes on: head/sbin
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sbin:r221273-222812
Index: head/share/man/man4/geom_map.4
===================================================================
--- head/share/man/man4/geom_map.4	(revision 222812)
+++ head/share/man/man4/geom_map.4	(revision 222813)

Property changes on: head/share/man/man4/geom_map.4
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: head/share/mk/bsd.arch.inc.mk
===================================================================
--- head/share/mk/bsd.arch.inc.mk	(revision 222812)
+++ head/share/mk/bsd.arch.inc.mk	(revision 222813)

Property changes on: head/share/mk/bsd.arch.inc.mk
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/share/mk/bsd.arch.inc.mk:r221273-222812
Index: head/share/zoneinfo
===================================================================
--- head/share/zoneinfo	(revision 222812)
+++ head/share/zoneinfo	(revision 222813)

Property changes on: head/share/zoneinfo
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/share/zoneinfo:r221273-222812
Index: head/sys/amd64/acpica/acpi_wakeup.c
===================================================================
--- head/sys/amd64/acpica/acpi_wakeup.c	(revision 222812)
+++ head/sys/amd64/acpica/acpi_wakeup.c	(revision 222813)
@@ -1,410 +1,409 @@
 /*-
  * Copyright (c) 2001 Takanori Watanabe <takawata@jp.freebsd.org>
  * Copyright (c) 2001 Mitsuru IWASAKI <iwasaki@jp.freebsd.org>
  * Copyright (c) 2003 Peter Wemm
  * Copyright (c) 2008-2010 Jung-uk Kim <jkim@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/memrange.h>
 #include <sys/smp.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <machine/intr_machdep.h>
 #include <x86/mca.h>
 #include <machine/pcb.h>
 #include <machine/pmap.h>
 #include <machine/specialreg.h>
 
 #ifdef SMP
 #include <x86/apicreg.h>
 #include <machine/smp.h>
 #include <machine/vmparam.h>
 #endif
 
 #include <contrib/dev/acpica/include/acpi.h>
 
 #include <dev/acpica/acpivar.h>
 
 #include "acpi_wakecode.h"
 #include "acpi_wakedata.h"
 
 /* Make sure the code is less than a page and leave room for the stack. */
 CTASSERT(sizeof(wakecode) < PAGE_SIZE - 1024);
 
 extern int		acpi_resume_beep;
 extern int		acpi_reset_video;
 
 #ifdef SMP
 extern struct pcb	**susppcbs;
 #else
 static struct pcb	**susppcbs;
 #endif
 
 int			acpi_restorecpu(vm_offset_t, struct pcb *);
 
 static void		*acpi_alloc_wakeup_handler(void);
 static void		acpi_stop_beep(void *);
 
 #ifdef SMP
 static int		acpi_wakeup_ap(struct acpi_softc *, int);
-static void		acpi_wakeup_cpus(struct acpi_softc *, cpumask_t);
+static void		acpi_wakeup_cpus(struct acpi_softc *, const cpuset_t *);
 #endif
 
 #define	WAKECODE_VADDR(sc)	((sc)->acpi_wakeaddr + (3 * PAGE_SIZE))
 #define	WAKECODE_PADDR(sc)	((sc)->acpi_wakephys + (3 * PAGE_SIZE))
 #define	WAKECODE_FIXUP(offset, type, val) do	{	\
 	type	*addr;					\
 	addr = (type *)(WAKECODE_VADDR(sc) + offset);	\
 	*addr = val;					\
 } while (0)
 
 /* Turn off bits 1&2 of the PIT, stopping the beep. */
 static void
 acpi_stop_beep(void *arg)
 {
 	outb(0x61, inb(0x61) & ~0x3);
 }
 
 #ifdef SMP
 static int
 acpi_wakeup_ap(struct acpi_softc *sc, int cpu)
 {
 	int		vector = (WAKECODE_PADDR(sc) >> 12) & 0xff;
 	int		apic_id = cpu_apic_ids[cpu];
 	int		ms;
 
 	WAKECODE_FIXUP(wakeup_pcb, struct pcb *, susppcbs[cpu]);
 	WAKECODE_FIXUP(wakeup_gdt, uint16_t, susppcbs[cpu]->pcb_gdt.rd_limit);
 	WAKECODE_FIXUP(wakeup_gdt + 2, uint64_t,
 	    susppcbs[cpu]->pcb_gdt.rd_base);
 	WAKECODE_FIXUP(wakeup_cpu, int, cpu);
 
 	/* do an INIT IPI: assert RESET */
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_ASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_INIT, apic_id);
 
 	/* wait for pending status end */
 	lapic_ipi_wait(-1);
 
 	/* do an INIT IPI: deassert RESET */
 	lapic_ipi_raw(APIC_DEST_ALLESELF | APIC_TRIGMOD_LEVEL |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_INIT, 0);
 
 	/* wait for pending status end */
 	DELAY(10000);		/* wait ~10mS */
 	lapic_ipi_wait(-1);
 
 	/*
 	 * next we do a STARTUP IPI: the previous INIT IPI might still be
 	 * latched, (P5 bug) this 1st STARTUP would then terminate
 	 * immediately, and the previously started INIT IPI would continue. OR
 	 * the previous INIT IPI has already run. and this STARTUP IPI will
 	 * run. OR the previous INIT IPI was ignored. and this STARTUP IPI
 	 * will run.
 	 */
 
 	/* do a STARTUP IPI */
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_STARTUP |
 	    vector, apic_id);
 	lapic_ipi_wait(-1);
 	DELAY(200);		/* wait ~200uS */
 
 	/*
 	 * finally we do a 2nd STARTUP IPI: this 2nd STARTUP IPI should run IF
 	 * the previous STARTUP IPI was cancelled by a latched INIT IPI. OR
 	 * this STARTUP IPI will be ignored, as only ONE STARTUP IPI is
 	 * recognized after hardware RESET or INIT IPI.
 	 */
 
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_STARTUP |
 	    vector, apic_id);
 	lapic_ipi_wait(-1);
 	DELAY(200);		/* wait ~200uS */
 
 	/* Wait up to 5 seconds for it to start. */
 	for (ms = 0; ms < 5000; ms++) {
 		if (*(int *)(WAKECODE_VADDR(sc) + wakeup_cpu) == 0)
 			return (1);	/* return SUCCESS */
 		DELAY(1000);
 	}
 	return (0);		/* return FAILURE */
 }
 
 #define	WARMBOOT_TARGET		0
 #define	WARMBOOT_OFF		(KERNBASE + 0x0467)
 #define	WARMBOOT_SEG		(KERNBASE + 0x0469)
 
 #define	CMOS_REG		(0x70)
 #define	CMOS_DATA		(0x71)
 #define	BIOS_RESET		(0x0f)
 #define	BIOS_WARM		(0x0a)
 
 static void
-acpi_wakeup_cpus(struct acpi_softc *sc, cpumask_t wakeup_cpus)
+acpi_wakeup_cpus(struct acpi_softc *sc, const cpuset_t *wakeup_cpus)
 {
 	uint32_t	mpbioswarmvec;
 	int		cpu;
 	u_char		mpbiosreason;
 
 	/* save the current value of the warm-start vector */
 	mpbioswarmvec = *((uint32_t *)WARMBOOT_OFF);
 	outb(CMOS_REG, BIOS_RESET);
 	mpbiosreason = inb(CMOS_DATA);
 
 	/* setup a vector to our boot code */
 	*((volatile u_short *)WARMBOOT_OFF) = WARMBOOT_TARGET;
 	*((volatile u_short *)WARMBOOT_SEG) = WAKECODE_PADDR(sc) >> 4;
 	outb(CMOS_REG, BIOS_RESET);
 	outb(CMOS_DATA, BIOS_WARM);	/* 'warm-start' */
 
 	/* Wake up each AP. */
 	for (cpu = 1; cpu < mp_ncpus; cpu++) {
-		if ((wakeup_cpus & (1 << cpu)) == 0)
+		if (!CPU_ISSET(cpu, wakeup_cpus))
 			continue;
 		if (acpi_wakeup_ap(sc, cpu) == 0) {
 			/* restore the warmstart vector */
 			*(uint32_t *)WARMBOOT_OFF = mpbioswarmvec;
 			panic("acpi_wakeup: failed to resume AP #%d (PHY #%d)",
 			    cpu, cpu_apic_ids[cpu]);
 		}
 	}
 
 	/* restore the warmstart vector */
 	*(uint32_t *)WARMBOOT_OFF = mpbioswarmvec;
 
 	outb(CMOS_REG, BIOS_RESET);
 	outb(CMOS_DATA, mpbiosreason);
 }
 #endif
 
 int
 acpi_sleep_machdep(struct acpi_softc *sc, int state)
 {
 #ifdef SMP
-	cpumask_t	wakeup_cpus;
+	cpuset_t	wakeup_cpus;
 #endif
 	register_t	cr3, rf;
 	ACPI_STATUS	status;
 	int		ret;
 
 	ret = -1;
 
 	if (sc->acpi_wakeaddr == 0ul)
 		return (ret);
 
 #ifdef SMP
 	wakeup_cpus = PCPU_GET(other_cpus);
 #endif
 
 	AcpiSetFirmwareWakingVector(WAKECODE_PADDR(sc));
 
 	rf = intr_disable();
 	intr_suspend();
 
 	/*
 	 * Temporarily switch to the kernel pmap because it provides
 	 * an identity mapping (setup at boot) for the low physical
 	 * memory region containing the wakeup code.
 	 */
 	cr3 = rcr3();
 	load_cr3(KPML4phys);
 
 	if (savectx(susppcbs[0])) {
 #ifdef SMP
-		if (wakeup_cpus != 0 && suspend_cpus(wakeup_cpus) == 0) {
-			device_printf(sc->acpi_dev,
-			    "Failed to suspend APs: CPU mask = 0x%jx\n",
-			    (uintmax_t)(wakeup_cpus & ~stopped_cpus));
+		if (!CPU_EMPTY(&wakeup_cpus) &&
+		    suspend_cpus(wakeup_cpus) == 0) {
+			device_printf(sc->acpi_dev, "Failed to suspend APs\n");
 			goto out;
 		}
 #endif
 
 		WAKECODE_FIXUP(resume_beep, uint8_t, (acpi_resume_beep != 0));
 		WAKECODE_FIXUP(reset_video, uint8_t, (acpi_reset_video != 0));
 
 		WAKECODE_FIXUP(wakeup_pcb, struct pcb *, susppcbs[0]);
 		WAKECODE_FIXUP(wakeup_gdt, uint16_t,
 		    susppcbs[0]->pcb_gdt.rd_limit);
 		WAKECODE_FIXUP(wakeup_gdt + 2, uint64_t,
 		    susppcbs[0]->pcb_gdt.rd_base);
 		WAKECODE_FIXUP(wakeup_cpu, int, 0);
 
 		/* Call ACPICA to enter the desired sleep state */
 		if (state == ACPI_STATE_S4 && sc->acpi_s4bios)
 			status = AcpiEnterSleepStateS4bios();
 		else
 			status = AcpiEnterSleepState(state);
 
 		if (status != AE_OK) {
 			device_printf(sc->acpi_dev,
 			    "AcpiEnterSleepState failed - %s\n",
 			    AcpiFormatException(status));
 			goto out;
 		}
 
 		for (;;)
 			ia32_pause();
 	} else {
 		pmap_init_pat();
 		PCPU_SET(switchtime, 0);
 		PCPU_SET(switchticks, ticks);
 #ifdef SMP
-		if (wakeup_cpus != 0)
-			acpi_wakeup_cpus(sc, wakeup_cpus);
+		if (!CPU_EMPTY(&wakeup_cpus))
+			acpi_wakeup_cpus(sc, &wakeup_cpus);
 #endif
 		acpi_resync_clock(sc);
 		ret = 0;
 	}
 
 out:
 #ifdef SMP
-	if (wakeup_cpus != 0)
+	if (!CPU_EMPTY(&wakeup_cpus))
 		restart_cpus(wakeup_cpus);
 #endif
 
 	load_cr3(cr3);
 	mca_resume();
 	intr_resume();
 	intr_restore(rf);
 
 	AcpiSetFirmwareWakingVector(0);
 
 	if (ret == 0 && mem_range_softc.mr_op != NULL &&
 	    mem_range_softc.mr_op->reinit != NULL)
 		mem_range_softc.mr_op->reinit(&mem_range_softc);
 
 	/* If we beeped, turn it off after a delay. */
 	if (acpi_resume_beep)
 		timeout(acpi_stop_beep, NULL, 3 * hz);
 
 	return (ret);
 }
 
 static void *
 acpi_alloc_wakeup_handler(void)
 {
 	void		*wakeaddr;
 	int		i;
 
 	/*
 	 * Specify the region for our wakeup code.  We want it in the low 1 MB
 	 * region, excluding real mode IVT (0-0x3ff), BDA (0x400-0x4ff), EBDA
 	 * (less than 128KB, below 0xa0000, must be excluded by SMAP and DSDT),
 	 * and ROM area (0xa0000 and above).  The temporary page tables must be
 	 * page-aligned.
 	 */
 	wakeaddr = contigmalloc(4 * PAGE_SIZE, M_DEVBUF, M_NOWAIT, 0x500,
 	    0xa0000, PAGE_SIZE, 0ul);
 	if (wakeaddr == NULL) {
 		printf("%s: can't alloc wake memory\n", __func__);
 		return (NULL);
 	}
 	susppcbs = malloc(mp_ncpus * sizeof(*susppcbs), M_DEVBUF, M_WAITOK);
 	for (i = 0; i < mp_ncpus; i++)
 		susppcbs[i] = malloc(sizeof(**susppcbs), M_DEVBUF, M_WAITOK);
 
 	return (wakeaddr);
 }
 
 void
 acpi_install_wakeup_handler(struct acpi_softc *sc)
 {
 	static void	*wakeaddr = NULL;
 	uint64_t	*pt4, *pt3, *pt2;
 	int		i;
 
 	if (wakeaddr != NULL)
 		return;
 
 	wakeaddr = acpi_alloc_wakeup_handler();
 	if (wakeaddr == NULL)
 		return;
 
 	sc->acpi_wakeaddr = (vm_offset_t)wakeaddr;
 	sc->acpi_wakephys = vtophys(wakeaddr);
 
 	bcopy(wakecode, (void *)WAKECODE_VADDR(sc), sizeof(wakecode));
 
 	/* Patch GDT base address, ljmp targets and page table base address. */
 	WAKECODE_FIXUP((bootgdtdesc + 2), uint32_t,
 	    WAKECODE_PADDR(sc) + bootgdt);
 	WAKECODE_FIXUP((wakeup_sw32 + 2), uint32_t,
 	    WAKECODE_PADDR(sc) + wakeup_32);
 	WAKECODE_FIXUP((wakeup_sw64 + 1), uint32_t,
 	    WAKECODE_PADDR(sc) + wakeup_64);
 	WAKECODE_FIXUP(wakeup_pagetables, uint32_t, sc->acpi_wakephys);
 
 	/* Save pointers to some global data. */
 	WAKECODE_FIXUP(wakeup_retaddr, void *, acpi_restorecpu);
 	WAKECODE_FIXUP(wakeup_kpml4, uint64_t, KPML4phys);
 	WAKECODE_FIXUP(wakeup_ctx, vm_offset_t,
 	    WAKECODE_VADDR(sc) + wakeup_ctx);
 	WAKECODE_FIXUP(wakeup_efer, uint64_t, rdmsr(MSR_EFER));
 	WAKECODE_FIXUP(wakeup_star, uint64_t, rdmsr(MSR_STAR));
 	WAKECODE_FIXUP(wakeup_lstar, uint64_t, rdmsr(MSR_LSTAR));
 	WAKECODE_FIXUP(wakeup_cstar, uint64_t, rdmsr(MSR_CSTAR));
 	WAKECODE_FIXUP(wakeup_sfmask, uint64_t, rdmsr(MSR_SF_MASK));
 
 	/* Build temporary page tables below realmode code. */
 	pt4 = wakeaddr;
 	pt3 = pt4 + (PAGE_SIZE) / sizeof(uint64_t);
 	pt2 = pt3 + (PAGE_SIZE) / sizeof(uint64_t);
 
 	/* Create the initial 1GB replicated page tables */
 	for (i = 0; i < 512; i++) {
 		/*
 		 * Each slot of the level 4 pages points
 		 * to the same level 3 page
 		 */
 		pt4[i] = (uint64_t)(sc->acpi_wakephys + PAGE_SIZE);
 		pt4[i] |= PG_V | PG_RW | PG_U;
 
 		/*
 		 * Each slot of the level 3 pages points
 		 * to the same level 2 page
 		 */
 		pt3[i] = (uint64_t)(sc->acpi_wakephys + (2 * PAGE_SIZE));
 		pt3[i] |= PG_V | PG_RW | PG_U;
 
 		/* The level 2 page slots are mapped with 2MB pages for 1GB. */
 		pt2[i] = i * (2 * 1024 * 1024);
 		pt2[i] |= PG_V | PG_RW | PG_PS | PG_U;
 	}
 
 	if (bootverbose)
 		device_printf(sc->acpi_dev, "wakeup code va %p pa %p\n",
 		    (void *)sc->acpi_wakeaddr, (void *)sc->acpi_wakephys);
 }
Index: head/sys/amd64/amd64/intr_machdep.c
===================================================================
--- head/sys/amd64/amd64/intr_machdep.c	(revision 222812)
+++ head/sys/amd64/amd64/intr_machdep.c	(revision 222813)
@@ -1,553 +1,555 @@
 /*-
  * Copyright (c) 2003 John Baldwin <jhb@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /*
  * Machine dependent interrupt code for amd64.  For amd64, we have to
  * deal with different PICs.  Thus, we use the passed in vector to lookup
  * an interrupt source associated with that vector.  The interrupt source
  * describes which PIC the source belongs to and includes methods to handle
  * that source.
  */
 
 #include "opt_atpic.h"
 #include "opt_ddb.h"
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/interrupt.h>
 #include <sys/ktr.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 #include <sys/syslog.h>
 #include <sys/systm.h>
 #include <machine/clock.h>
 #include <machine/intr_machdep.h>
 #include <machine/smp.h>
 #ifdef DDB
 #include <ddb/ddb.h>
 #endif
 
 #ifndef DEV_ATPIC
 #include <machine/segments.h>
 #include <machine/frame.h>
 #include <dev/ic/i8259.h>
 #include <x86/isa/icu.h>
 #include <x86/isa/isa.h>
 #endif
 
 #define	MAX_STRAY_LOG	5
 
 typedef void (*mask_fn)(void *);
 
 static int intrcnt_index;
 static struct intsrc *interrupt_sources[NUM_IO_INTS];
 static struct mtx intr_table_lock;
 static struct mtx intrcnt_lock;
 static STAILQ_HEAD(, pic) pics;
 
 #ifdef SMP
 static int assign_cpu;
 #endif
 
 static int	intr_assign_cpu(void *arg, u_char cpu);
 static void	intr_disable_src(void *arg);
 static void	intr_init(void *__dummy);
 static int	intr_pic_registered(struct pic *pic);
 static void	intrcnt_setname(const char *name, int index);
 static void	intrcnt_updatename(struct intsrc *is);
 static void	intrcnt_register(struct intsrc *is);
 
 static int
 intr_pic_registered(struct pic *pic)
 {
 	struct pic *p;
 
 	STAILQ_FOREACH(p, &pics, pics) {
 		if (p == pic)
 			return (1);
 	}
 	return (0);
 }
 
 /*
  * Register a new interrupt controller (PIC).  This is to support suspend
  * and resume where we suspend/resume controllers rather than individual
  * sources.  This also allows controllers with no active sources (such as
  * 8259As in a system using the APICs) to participate in suspend and resume.
  */
 int
 intr_register_pic(struct pic *pic)
 {
 	int error;
 
 	mtx_lock(&intr_table_lock);
 	if (intr_pic_registered(pic))
 		error = EBUSY;
 	else {
 		STAILQ_INSERT_TAIL(&pics, pic, pics);
 		error = 0;
 	}
 	mtx_unlock(&intr_table_lock);
 	return (error);
 }
 
 /*
  * Register a new interrupt source with the global interrupt system.
  * The global interrupts need to be disabled when this function is
  * called.
  */
 int
 intr_register_source(struct intsrc *isrc)
 {
 	int error, vector;
 
 	KASSERT(intr_pic_registered(isrc->is_pic), ("unregistered PIC"));
 	vector = isrc->is_pic->pic_vector(isrc);
 	if (interrupt_sources[vector] != NULL)
 		return (EEXIST);
 	error = intr_event_create(&isrc->is_event, isrc, 0, vector,
 	    intr_disable_src, (mask_fn)isrc->is_pic->pic_enable_source,
 	    (mask_fn)isrc->is_pic->pic_eoi_source, intr_assign_cpu, "irq%d:",
 	    vector);
 	if (error)
 		return (error);
 	mtx_lock(&intr_table_lock);
 	if (interrupt_sources[vector] != NULL) {
 		mtx_unlock(&intr_table_lock);
 		intr_event_destroy(isrc->is_event);
 		return (EEXIST);
 	}
 	intrcnt_register(isrc);
 	interrupt_sources[vector] = isrc;
 	isrc->is_handlers = 0;
 	mtx_unlock(&intr_table_lock);
 	return (0);
 }
 
 struct intsrc *
 intr_lookup_source(int vector)
 {
 
 	return (interrupt_sources[vector]);
 }
 
 int
 intr_add_handler(const char *name, int vector, driver_filter_t filter,
     driver_intr_t handler, void *arg, enum intr_type flags, void **cookiep)
 {
 	struct intsrc *isrc;
 	int error;
 
 	isrc = intr_lookup_source(vector);
 	if (isrc == NULL)
 		return (EINVAL);
 	error = intr_event_add_handler(isrc->is_event, name, filter, handler,
 	    arg, intr_priority(flags), flags, cookiep);
 	if (error == 0) {
 		mtx_lock(&intr_table_lock);
 		intrcnt_updatename(isrc);
 		isrc->is_handlers++;
 		if (isrc->is_handlers == 1) {
 			isrc->is_pic->pic_enable_intr(isrc);
 			isrc->is_pic->pic_enable_source(isrc);
 		}
 		mtx_unlock(&intr_table_lock);
 	}
 	return (error);
 }
 
 int
 intr_remove_handler(void *cookie)
 {
 	struct intsrc *isrc;
 	int error;
 
 	isrc = intr_handler_source(cookie);
 	error = intr_event_remove_handler(cookie);
 	if (error == 0) {
 		mtx_lock(&intr_table_lock);
 		isrc->is_handlers--;
 		if (isrc->is_handlers == 0) {
 			isrc->is_pic->pic_disable_source(isrc, PIC_NO_EOI);
 			isrc->is_pic->pic_disable_intr(isrc);
 		}
 		intrcnt_updatename(isrc);
 		mtx_unlock(&intr_table_lock);
 	}
 	return (error);
 }
 
 int
 intr_config_intr(int vector, enum intr_trigger trig, enum intr_polarity pol)
 {
 	struct intsrc *isrc;
 
 	isrc = intr_lookup_source(vector);
 	if (isrc == NULL)
 		return (EINVAL);
 	return (isrc->is_pic->pic_config_intr(isrc, trig, pol));
 }
 
 static void
 intr_disable_src(void *arg)
 {
 	struct intsrc *isrc;
 
 	isrc = arg;
 	isrc->is_pic->pic_disable_source(isrc, PIC_EOI);
 }
 
 void
 intr_execute_handlers(struct intsrc *isrc, struct trapframe *frame)
 {
 	struct intr_event *ie;
 	int vector;
 
 	/*
 	 * We count software interrupts when we process them.  The
 	 * code here follows previous practice, but there's an
 	 * argument for counting hardware interrupts when they're
 	 * processed too.
 	 */
 	(*isrc->is_count)++;
 	PCPU_INC(cnt.v_intr);
 
 	ie = isrc->is_event;
 
 	/*
 	 * XXX: We assume that IRQ 0 is only used for the ISA timer
 	 * device (clk).
 	 */
 	vector = isrc->is_pic->pic_vector(isrc);
 	if (vector == 0)
 		clkintr_pending = 1;
 
 	/*
 	 * For stray interrupts, mask and EOI the source, bump the
 	 * stray count, and log the condition.
 	 */
 	if (intr_event_handle(ie, frame) != 0) {
 		isrc->is_pic->pic_disable_source(isrc, PIC_EOI);
 		(*isrc->is_straycount)++;
 		if (*isrc->is_straycount < MAX_STRAY_LOG)
 			log(LOG_ERR, "stray irq%d\n", vector);
 		else if (*isrc->is_straycount == MAX_STRAY_LOG)
 			log(LOG_CRIT,
 			    "too many stray irq %d's: not logging anymore\n",
 			    vector);
 	}
 }
 
 void
 intr_resume(void)
 {
 	struct pic *pic;
 
 #ifndef DEV_ATPIC
 	atpic_reset();
 #endif
 	mtx_lock(&intr_table_lock);
 	STAILQ_FOREACH(pic, &pics, pics) {
 		if (pic->pic_resume != NULL)
 			pic->pic_resume(pic);
 	}
 	mtx_unlock(&intr_table_lock);
 }
 
 void
 intr_suspend(void)
 {
 	struct pic *pic;
 
 	mtx_lock(&intr_table_lock);
 	STAILQ_FOREACH(pic, &pics, pics) {
 		if (pic->pic_suspend != NULL)
 			pic->pic_suspend(pic);
 	}
 	mtx_unlock(&intr_table_lock);
 }
 
 static int
 intr_assign_cpu(void *arg, u_char cpu)
 {
 #ifdef SMP
 	struct intsrc *isrc;
 	int error;
 
 	/*
 	 * Don't do anything during early boot.  We will pick up the
 	 * assignment once the APs are started.
 	 */
 	if (assign_cpu && cpu != NOCPU) {
 		isrc = arg;
 		mtx_lock(&intr_table_lock);
 		error = isrc->is_pic->pic_assign_cpu(isrc, cpu_apic_ids[cpu]);
 		mtx_unlock(&intr_table_lock);
 	} else
 		error = 0;
 	return (error);
 #else
 	return (EOPNOTSUPP);
 #endif
 }
 
 static void
 intrcnt_setname(const char *name, int index)
 {
 
 	snprintf(intrnames + (MAXCOMLEN + 1) * index, MAXCOMLEN + 1, "%-*s",
 	    MAXCOMLEN, name);
 }
 
 static void
 intrcnt_updatename(struct intsrc *is)
 {
 
 	intrcnt_setname(is->is_event->ie_fullname, is->is_index);
 }
 
 static void
 intrcnt_register(struct intsrc *is)
 {
 	char straystr[MAXCOMLEN + 1];
 
 	KASSERT(is->is_event != NULL, ("%s: isrc with no event", __func__));
 	mtx_lock_spin(&intrcnt_lock);
 	is->is_index = intrcnt_index;
 	intrcnt_index += 2;
 	snprintf(straystr, MAXCOMLEN + 1, "stray irq%d",
 	    is->is_pic->pic_vector(is));
 	intrcnt_updatename(is);
 	is->is_count = &intrcnt[is->is_index];
 	intrcnt_setname(straystr, is->is_index + 1);
 	is->is_straycount = &intrcnt[is->is_index + 1];
 	mtx_unlock_spin(&intrcnt_lock);
 }
 
 void
 intrcnt_add(const char *name, u_long **countp)
 {
 
 	mtx_lock_spin(&intrcnt_lock);
 	*countp = &intrcnt[intrcnt_index];
 	intrcnt_setname(name, intrcnt_index);
 	intrcnt_index++;
 	mtx_unlock_spin(&intrcnt_lock);
 }
 
 static void
 intr_init(void *dummy __unused)
 {
 
 	intrcnt_setname("???", 0);
 	intrcnt_index = 1;
 	STAILQ_INIT(&pics);
 	mtx_init(&intr_table_lock, "intr sources", NULL, MTX_DEF);
 	mtx_init(&intrcnt_lock, "intrcnt", NULL, MTX_SPIN);
 }
 SYSINIT(intr_init, SI_SUB_INTR, SI_ORDER_FIRST, intr_init, NULL);
 
 #ifndef DEV_ATPIC
 /* Initialize the two 8259A's to a known-good shutdown state. */
 void
 atpic_reset(void)
 {
 
 	outb(IO_ICU1, ICW1_RESET | ICW1_IC4);
 	outb(IO_ICU1 + ICU_IMR_OFFSET, IDT_IO_INTS);
 	outb(IO_ICU1 + ICU_IMR_OFFSET, 1 << 2);
 	outb(IO_ICU1 + ICU_IMR_OFFSET, ICW4_8086);
 	outb(IO_ICU1 + ICU_IMR_OFFSET, 0xff);
 	outb(IO_ICU1, OCW3_SEL | OCW3_RR);
 
 	outb(IO_ICU2, ICW1_RESET | ICW1_IC4);
 	outb(IO_ICU2 + ICU_IMR_OFFSET, IDT_IO_INTS + 8);
 	outb(IO_ICU2 + ICU_IMR_OFFSET, 2);
 	outb(IO_ICU2 + ICU_IMR_OFFSET, ICW4_8086);
 	outb(IO_ICU2 + ICU_IMR_OFFSET, 0xff);
 	outb(IO_ICU2, OCW3_SEL | OCW3_RR);
 }
 #endif
 
 /* Add a description to an active interrupt handler. */
 int
 intr_describe(u_int vector, void *ih, const char *descr)
 {
 	struct intsrc *isrc;
 	int error;
 
 	isrc = intr_lookup_source(vector);
 	if (isrc == NULL)
 		return (EINVAL);
 	error = intr_event_describe_handler(isrc->is_event, ih, descr);
 	if (error)
 		return (error);
 	intrcnt_updatename(isrc);
 	return (0);
 }
 
 #ifdef DDB
 /*
  * Dump data about interrupt handlers
  */
 DB_SHOW_COMMAND(irqs, db_show_irqs)
 {
 	struct intsrc **isrc;
 	int i, verbose;
 
 	if (strcmp(modif, "v") == 0)
 		verbose = 1;
 	else
 		verbose = 0;
 	isrc = interrupt_sources;
 	for (i = 0; i < NUM_IO_INTS && !db_pager_quit; i++, isrc++)
 		if (*isrc != NULL)
 			db_dump_intr_event((*isrc)->is_event, verbose);
 }
 #endif
 
 #ifdef SMP
 /*
  * Support for balancing interrupt sources across CPUs.  For now we just
  * allocate CPUs round-robin.
  */
 
-/* The BSP is always a valid target. */
-static cpumask_t intr_cpus = (1 << 0);
+static cpuset_t intr_cpus;
 static int current_cpu;
 
 /*
  * Return the CPU that the next interrupt source should use.  For now
  * this just returns the next local APIC according to round-robin.
  */
 u_int
 intr_next_cpu(void)
 {
 	u_int apic_id;
 
 	/* Leave all interrupts on the BSP during boot. */
 	if (!assign_cpu)
 		return (PCPU_GET(apic_id));
 
 	mtx_lock_spin(&icu_lock);
 	apic_id = cpu_apic_ids[current_cpu];
 	do {
 		current_cpu++;
 		if (current_cpu > mp_maxid)
 			current_cpu = 0;
-	} while (!(intr_cpus & (1 << current_cpu)));
+	} while (!CPU_ISSET(current_cpu, &intr_cpus));
 	mtx_unlock_spin(&icu_lock);
 	return (apic_id);
 }
 
 /* Attempt to bind the specified IRQ to the specified CPU. */
 int
 intr_bind(u_int vector, u_char cpu)
 {
 	struct intsrc *isrc;
 
 	isrc = intr_lookup_source(vector);
 	if (isrc == NULL)
 		return (EINVAL);
 	return (intr_event_bind(isrc->is_event, cpu));
 }
 
 /*
  * Add a CPU to our mask of valid CPUs that can be destinations of
  * interrupts.
  */
 void
 intr_add_cpu(u_int cpu)
 {
 
 	if (cpu >= MAXCPU)
 		panic("%s: Invalid CPU ID", __func__);
 	if (bootverbose)
 		printf("INTR: Adding local APIC %d as a target\n",
 		    cpu_apic_ids[cpu]);
 
-	intr_cpus |= (1 << cpu);
+	CPU_SET(cpu, &intr_cpus);
 }
 
 /*
  * Distribute all the interrupt sources among the available CPUs once the
  * AP's have been launched.
  */
 static void
 intr_shuffle_irqs(void *arg __unused)
 {
 	struct intsrc *isrc;
 	int i;
+
+	/* The BSP is always a valid target. */
+	CPU_SETOF(0, &intr_cpus);
 
 	/* Don't bother on UP. */
 	if (mp_ncpus == 1)
 		return;
 
 	/* Round-robin assign a CPU to each enabled source. */
 	mtx_lock(&intr_table_lock);
 	assign_cpu = 1;
 	for (i = 0; i < NUM_IO_INTS; i++) {
 		isrc = interrupt_sources[i];
 		if (isrc != NULL && isrc->is_handlers > 0) {
 			/*
 			 * If this event is already bound to a CPU,
 			 * then assign the source to that CPU instead
 			 * of picking one via round-robin.  Note that
 			 * this is careful to only advance the
 			 * round-robin if the CPU assignment succeeds.
 			 */
 			if (isrc->is_event->ie_cpu != NOCPU)
 				(void)isrc->is_pic->pic_assign_cpu(isrc,
 				    cpu_apic_ids[isrc->is_event->ie_cpu]);
 			else if (isrc->is_pic->pic_assign_cpu(isrc,
 				cpu_apic_ids[current_cpu]) == 0)
 				(void)intr_next_cpu();
 
 		}
 	}
 	mtx_unlock(&intr_table_lock);
 }
 SYSINIT(intr_shuffle_irqs, SI_SUB_SMP, SI_ORDER_SECOND, intr_shuffle_irqs,
     NULL);
 #else
 /*
  * Always route interrupts to the current processor in the UP case.
  */
 u_int
 intr_next_cpu(void)
 {
 
 	return (PCPU_GET(apic_id));
 }
 #endif
Index: head/sys/amd64/amd64/mp_machdep.c
===================================================================
--- head/sys/amd64/amd64/mp_machdep.c	(revision 222812)
+++ head/sys/amd64/amd64/mp_machdep.c	(revision 222813)
@@ -1,1654 +1,1672 @@
 /*-
  * Copyright (c) 1996, by Steve Passe
  * Copyright (c) 2003, by Peter Wemm
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. The name of the developer may NOT be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_cpu.h"
 #include "opt_kstack_pages.h"
 #include "opt_mp_watchdog.h"
 #include "opt_sched.h"
 #include "opt_smp.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
+#include <sys/cpuset.h>
 #ifdef GPROF 
 #include <sys/gmon.h>
 #endif
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/memrange.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_extern.h>
 
 #include <x86/apicreg.h>
 #include <machine/clock.h>
 #include <machine/cputypes.h>
 #include <machine/cpufunc.h>
 #include <x86/mca.h>
 #include <machine/md_var.h>
 #include <machine/mp_watchdog.h>
 #include <machine/pcb.h>
 #include <machine/psl.h>
 #include <machine/smp.h>
 #include <machine/specialreg.h>
 #include <machine/tss.h>
 
 #define WARMBOOT_TARGET		0
 #define WARMBOOT_OFF		(KERNBASE + 0x0467)
 #define WARMBOOT_SEG		(KERNBASE + 0x0469)
 
 #define CMOS_REG		(0x70)
 #define CMOS_DATA		(0x71)
 #define BIOS_RESET		(0x0f)
 #define BIOS_WARM		(0x0a)
 
 /* lock region used by kernel profiling */
 int	mcount_lock;
 
 int	mp_naps;		/* # of Applications processors */
 int	boot_cpu_id = -1;	/* designated BSP */
 
 extern  struct pcpu __pcpu[];
 
 /* AP uses this during bootstrap.  Do not staticize.  */
 char *bootSTK;
 static int bootAP;
 
 /* Free these after use */
 void *bootstacks[MAXCPU];
 
 /* Temporary variables for init_secondary()  */
 char *doublefault_stack;
 char *nmi_stack;
 void *dpcpu;
 
 struct pcb stoppcbs[MAXCPU];
 struct pcb **susppcbs = NULL;
 
 /* Variables needed for SMP tlb shootdown. */
 vm_offset_t smp_tlb_addr1;
 vm_offset_t smp_tlb_addr2;
 volatile int smp_tlb_wait;
 
 #ifdef COUNT_IPIS
 /* Interrupt counts. */
 static u_long *ipi_preempt_counts[MAXCPU];
 static u_long *ipi_ast_counts[MAXCPU];
 u_long *ipi_invltlb_counts[MAXCPU];
 u_long *ipi_invlrng_counts[MAXCPU];
 u_long *ipi_invlpg_counts[MAXCPU];
 u_long *ipi_invlcache_counts[MAXCPU];
 u_long *ipi_rendezvous_counts[MAXCPU];
 static u_long *ipi_hardclock_counts[MAXCPU];
 #endif
 
 extern inthand_t IDTVEC(fast_syscall), IDTVEC(fast_syscall32);
 
 /*
  * Local data and functions.
  */
 
-static volatile cpumask_t ipi_nmi_pending;
+static volatile cpuset_t ipi_nmi_pending;
 
 /* used to hold the AP's until we are ready to release them */
 static struct mtx ap_boot_mtx;
 
 /* Set to 1 once we're ready to let the APs out of the pen. */
 static volatile int aps_ready = 0;
 
 /*
  * Store data from cpu_add() until later in the boot when we actually setup
  * the APs.
  */
 struct cpu_info {
 	int	cpu_present:1;
 	int	cpu_bsp:1;
 	int	cpu_disabled:1;
 	int	cpu_hyperthread:1;
 } static cpu_info[MAX_APIC_ID + 1];
 int cpu_apic_ids[MAXCPU];
 int apic_cpuids[MAX_APIC_ID + 1];
 
 /* Holds pending bitmap based IPIs per CPU */
 static volatile u_int cpu_ipi_pending[MAXCPU];
 
 static u_int boot_address;
 static int cpu_logical;			/* logical cpus per core */
 static int cpu_cores;			/* cores per package */
 
 static void	assign_cpu_ids(void);
 static void	set_interrupt_apic_ids(void);
 static int	start_all_aps(void);
 static int	start_ap(int apic_id);
 static void	release_aps(void *dummy);
 
 static int	hlt_logical_cpus;
 static u_int	hyperthreading_cpus;	/* logical cpus sharing L1 cache */
-static cpumask_t	hyperthreading_cpus_mask;
+static cpuset_t	hyperthreading_cpus_mask;
 static int	hyperthreading_allowed = 1;
 static struct	sysctl_ctx_list logical_cpu_clist;
 static u_int	bootMP_size;
 
 static void
 mem_range_AP_init(void)
 {
 	if (mem_range_softc.mr_op && mem_range_softc.mr_op->initAP)
 		mem_range_softc.mr_op->initAP(&mem_range_softc);
 }
 
 static void
 topo_probe_amd(void)
 {
 	int core_id_bits;
 	int id;
 
 	/* AMD processors do not support HTT. */
 	cpu_logical = 1;
 
 	if ((amd_feature2 & AMDID2_CMP) == 0) {
 		cpu_cores = 1;
 		return;
 	}
 
 	core_id_bits = (cpu_procinfo2 & AMDID_COREID_SIZE) >>
 	    AMDID_COREID_SIZE_SHIFT;
 	if (core_id_bits == 0) {
 		cpu_cores = (cpu_procinfo2 & AMDID_CMP_CORES) + 1;
 		return;
 	}
 
 	/* Fam 10h and newer should get here. */
 	for (id = 0; id <= MAX_APIC_ID; id++) {
 		/* Check logical CPU availability. */
 		if (!cpu_info[id].cpu_present || cpu_info[id].cpu_disabled)
 			continue;
 		/* Check if logical CPU has the same package ID. */
 		if ((id >> core_id_bits) != (boot_cpu_id >> core_id_bits))
 			continue;
 		cpu_cores++;
 	}
 }
 
 /*
  * Round up to the next power of two, if necessary, and then
  * take log2.
  * Returns -1 if argument is zero.
  */
 static __inline int
 mask_width(u_int x)
 {
 
 	return (fls(x << (1 - powerof2(x))) - 1);
 }
 
 static void
 topo_probe_0x4(void)
 {
 	u_int p[4];
 	int pkg_id_bits;
 	int core_id_bits;
 	int max_cores;
 	int max_logical;
 	int id;
 
 	/* Both zero and one here mean one logical processor per package. */
 	max_logical = (cpu_feature & CPUID_HTT) != 0 ?
 	    (cpu_procinfo & CPUID_HTT_CORES) >> 16 : 1;
 	if (max_logical <= 1)
 		return;
 
 	/*
 	 * Because of uniformity assumption we examine only
 	 * those logical processors that belong to the same
 	 * package as BSP.  Further, we count number of
 	 * logical processors that belong to the same core
 	 * as BSP thus deducing number of threads per core.
 	 */
 	if (cpu_high >= 0x4) {
 		cpuid_count(0x04, 0, p);
 		max_cores = ((p[0] >> 26) & 0x3f) + 1;
 	} else
 		max_cores = 1;
 	core_id_bits = mask_width(max_logical/max_cores);
 	if (core_id_bits < 0)
 		return;
 	pkg_id_bits = core_id_bits + mask_width(max_cores);
 
 	for (id = 0; id <= MAX_APIC_ID; id++) {
 		/* Check logical CPU availability. */
 		if (!cpu_info[id].cpu_present || cpu_info[id].cpu_disabled)
 			continue;
 		/* Check if logical CPU has the same package ID. */
 		if ((id >> pkg_id_bits) != (boot_cpu_id >> pkg_id_bits))
 			continue;
 		cpu_cores++;
 		/* Check if logical CPU has the same package and core IDs. */
 		if ((id >> core_id_bits) == (boot_cpu_id >> core_id_bits))
 			cpu_logical++;
 	}
 
 	KASSERT(cpu_cores >= 1 && cpu_logical >= 1,
 	    ("topo_probe_0x4 couldn't find BSP"));
 
 	cpu_cores /= cpu_logical;
 	hyperthreading_cpus = cpu_logical;
 }
 
 static void
 topo_probe_0xb(void)
 {
 	u_int p[4];
 	int bits;
 	int cnt;
 	int i;
 	int logical;
 	int type;
 	int x;
 
 	/* We only support three levels for now. */
 	for (i = 0; i < 3; i++) {
 		cpuid_count(0x0b, i, p);
 
 		/* Fall back if CPU leaf 11 doesn't really exist. */
 		if (i == 0 && p[1] == 0) {
 			topo_probe_0x4();
 			return;
 		}
 
 		bits = p[0] & 0x1f;
 		logical = p[1] &= 0xffff;
 		type = (p[2] >> 8) & 0xff;
 		if (type == 0 || logical == 0)
 			break;
 		/*
 		 * Because of uniformity assumption we examine only
 		 * those logical processors that belong to the same
 		 * package as BSP.
 		 */
 		for (cnt = 0, x = 0; x <= MAX_APIC_ID; x++) {
 			if (!cpu_info[x].cpu_present ||
 			    cpu_info[x].cpu_disabled)
 				continue;
 			if (x >> bits == boot_cpu_id >> bits)
 				cnt++;
 		}
 		if (type == CPUID_TYPE_SMT)
 			cpu_logical = cnt;
 		else if (type == CPUID_TYPE_CORE)
 			cpu_cores = cnt;
 	}
 	if (cpu_logical == 0)
 		cpu_logical = 1;
 	cpu_cores /= cpu_logical;
 }
 
 /*
  * Both topology discovery code and code that consumes topology
  * information assume top-down uniformity of the topology.
  * That is, all physical packages must be identical and each
  * core in a package must have the same number of threads.
  * Topology information is queried only on BSP, on which this
  * code runs and for which it can query CPUID information.
  * Then topology is extrapolated on all packages using the
  * uniformity assumption.
  */
 static void
 topo_probe(void)
 {
 	static int cpu_topo_probed = 0;
 
 	if (cpu_topo_probed)
 		return;
 
-	logical_cpus_mask = 0;
+	CPU_ZERO(&logical_cpus_mask);
 	if (mp_ncpus <= 1)
 		cpu_cores = cpu_logical = 1;
 	else if (cpu_vendor_id == CPU_VENDOR_AMD)
 		topo_probe_amd();
 	else if (cpu_vendor_id == CPU_VENDOR_INTEL) {
 		/*
 		 * See Intel(R) 64 Architecture Processor
 		 * Topology Enumeration article for details.
 		 *
 		 * Note that 0x1 <= cpu_high < 4 case should be
 		 * compatible with topo_probe_0x4() logic when
 		 * CPUID.1:EBX[23:16] > 0 (cpu_cores will be 1)
 		 * or it should trigger the fallback otherwise.
 		 */
 		if (cpu_high >= 0xb)
 			topo_probe_0xb();
 		else if (cpu_high >= 0x1)
 			topo_probe_0x4();
 	}
 
 	/*
 	 * Fallback: assume each logical CPU is in separate
 	 * physical package.  That is, no multi-core, no SMT.
 	 */
 	if (cpu_cores == 0 || cpu_logical == 0)
 		cpu_cores = cpu_logical = 1;
 	cpu_topo_probed = 1;
 }
 
 struct cpu_group *
 cpu_topo(void)
 {
 	int cg_flags;
 
 	/*
 	 * Determine whether any threading flags are
 	 * necessry.
 	 */
 	topo_probe();
 	if (cpu_logical > 1 && hyperthreading_cpus)
 		cg_flags = CG_FLAG_HTT;
 	else if (cpu_logical > 1)
 		cg_flags = CG_FLAG_SMT;
 	else
 		cg_flags = 0;
 	if (mp_ncpus % (cpu_cores * cpu_logical) != 0) {
 		printf("WARNING: Non-uniform processors.\n");
 		printf("WARNING: Using suboptimal topology.\n");
 		return (smp_topo_none());
 	}
 	/*
 	 * No multi-core or hyper-threaded.
 	 */
 	if (cpu_logical * cpu_cores == 1)
 		return (smp_topo_none());
 	/*
 	 * Only HTT no multi-core.
 	 */
 	if (cpu_logical > 1 && cpu_cores == 1)
 		return (smp_topo_1level(CG_SHARE_L1, cpu_logical, cg_flags));
 	/*
 	 * Only multi-core no HTT.
 	 */
 	if (cpu_cores > 1 && cpu_logical == 1)
 		return (smp_topo_1level(CG_SHARE_L2, cpu_cores, cg_flags));
 	/*
 	 * Both HTT and multi-core.
 	 */
 	return (smp_topo_2level(CG_SHARE_L2, cpu_cores,
 	    CG_SHARE_L1, cpu_logical, cg_flags));
 }
 
 /*
  * Calculate usable address in base memory for AP trampoline code.
  */
 u_int
 mp_bootaddress(u_int basemem)
 {
 
 	bootMP_size = mptramp_end - mptramp_start;
 	boot_address = trunc_page(basemem * 1024); /* round down to 4k boundary */
 	if (((basemem * 1024) - boot_address) < bootMP_size)
 		boot_address -= PAGE_SIZE;	/* not enough, lower by 4k */
 	/* 3 levels of page table pages */
 	mptramp_pagetables = boot_address - (PAGE_SIZE * 3);
 
 	return mptramp_pagetables;
 }
 
 void
 cpu_add(u_int apic_id, char boot_cpu)
 {
 
 	if (apic_id > MAX_APIC_ID) {
 		panic("SMP: APIC ID %d too high", apic_id);
 		return;
 	}
 	KASSERT(cpu_info[apic_id].cpu_present == 0, ("CPU %d added twice",
 	    apic_id));
 	cpu_info[apic_id].cpu_present = 1;
 	if (boot_cpu) {
 		KASSERT(boot_cpu_id == -1,
 		    ("CPU %d claims to be BSP, but CPU %d already is", apic_id,
 		    boot_cpu_id));
 		boot_cpu_id = apic_id;
 		cpu_info[apic_id].cpu_bsp = 1;
 	}
 	if (mp_ncpus < MAXCPU) {
 		mp_ncpus++;
 		mp_maxid = mp_ncpus - 1;
 	}
 	if (bootverbose)
 		printf("SMP: Added CPU %d (%s)\n", apic_id, boot_cpu ? "BSP" :
 		    "AP");
 }
 
 void
 cpu_mp_setmaxid(void)
 {
 
 	/*
 	 * mp_maxid should be already set by calls to cpu_add().
 	 * Just sanity check its value here.
 	 */
 	if (mp_ncpus == 0)
 		KASSERT(mp_maxid == 0,
 		    ("%s: mp_ncpus is zero, but mp_maxid is not", __func__));
 	else if (mp_ncpus == 1)
 		mp_maxid = 0;
 	else
 		KASSERT(mp_maxid >= mp_ncpus - 1,
 		    ("%s: counters out of sync: max %d, count %d", __func__,
 			mp_maxid, mp_ncpus));
 }
 
 int
 cpu_mp_probe(void)
 {
 
 	/*
 	 * Always record BSP in CPU map so that the mbuf init code works
 	 * correctly.
 	 */
-	all_cpus = 1;
+	CPU_SETOF(0, &all_cpus);
 	if (mp_ncpus == 0) {
 		/*
 		 * No CPUs were found, so this must be a UP system.  Setup
 		 * the variables to represent a system with a single CPU
 		 * with an id of 0.
 		 */
 		mp_ncpus = 1;
 		return (0);
 	}
 
 	/* At least one CPU was found. */
 	if (mp_ncpus == 1) {
 		/*
 		 * One CPU was found, so this must be a UP system with
 		 * an I/O APIC.
 		 */
 		mp_maxid = 0;
 		return (0);
 	}
 
 	/* At least two CPUs were found. */
 	return (1);
 }
 
 /*
  * Initialize the IPI handlers and start up the AP's.
  */
 void
 cpu_mp_start(void)
 {
 	int i;
 
 	/* Initialize the logical ID to APIC ID table. */
 	for (i = 0; i < MAXCPU; i++) {
 		cpu_apic_ids[i] = -1;
 		cpu_ipi_pending[i] = 0;
 	}
 
 	/* Install an inter-CPU IPI for TLB invalidation */
 	setidt(IPI_INVLTLB, IDTVEC(invltlb), SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IPI_INVLPG, IDTVEC(invlpg), SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IPI_INVLRNG, IDTVEC(invlrng), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install an inter-CPU IPI for cache invalidation. */
 	setidt(IPI_INVLCACHE, IDTVEC(invlcache), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install an inter-CPU IPI for all-CPU rendezvous */
 	setidt(IPI_RENDEZVOUS, IDTVEC(rendezvous), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install generic inter-CPU IPI handler */
 	setidt(IPI_BITMAP_VECTOR, IDTVEC(ipi_intr_bitmap_handler),
 	       SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install an inter-CPU IPI for CPU stop/restart */
 	setidt(IPI_STOP, IDTVEC(cpustop), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install an inter-CPU IPI for CPU suspend/resume */
 	setidt(IPI_SUSPEND, IDTVEC(cpususpend), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Set boot_cpu_id if needed. */
 	if (boot_cpu_id == -1) {
 		boot_cpu_id = PCPU_GET(apic_id);
 		cpu_info[boot_cpu_id].cpu_bsp = 1;
 	} else
 		KASSERT(boot_cpu_id == PCPU_GET(apic_id),
 		    ("BSP's APIC ID doesn't match boot_cpu_id"));
 
 	/* Probe logical/physical core configuration. */
 	topo_probe();
 
 	assign_cpu_ids();
 
 	/* Start each Application Processor */
 	start_all_aps();
 
 	set_interrupt_apic_ids();
 }
 
 
 /*
  * Print various information about the SMP system hardware and setup.
  */
 void
 cpu_mp_announce(void)
 {
 	const char *hyperthread;
 	int i;
 
 	printf("FreeBSD/SMP: %d package(s) x %d core(s)",
 	    mp_ncpus / (cpu_cores * cpu_logical), cpu_cores);
 	if (hyperthreading_cpus > 1)
 	    printf(" x %d HTT threads", cpu_logical);
 	else if (cpu_logical > 1)
 	    printf(" x %d SMT threads", cpu_logical);
 	printf("\n");
 
 	/* List active CPUs first. */
 	printf(" cpu0 (BSP): APIC ID: %2d\n", boot_cpu_id);
 	for (i = 1; i < mp_ncpus; i++) {
 		if (cpu_info[cpu_apic_ids[i]].cpu_hyperthread)
 			hyperthread = "/HT";
 		else
 			hyperthread = "";
 		printf(" cpu%d (AP%s): APIC ID: %2d\n", i, hyperthread,
 		    cpu_apic_ids[i]);
 	}
 
 	/* List disabled CPUs last. */
 	for (i = 0; i <= MAX_APIC_ID; i++) {
 		if (!cpu_info[i].cpu_present || !cpu_info[i].cpu_disabled)
 			continue;
 		if (cpu_info[i].cpu_hyperthread)
 			hyperthread = "/HT";
 		else
 			hyperthread = "";
 		printf("  cpu (AP%s): APIC ID: %2d (disabled)\n", hyperthread,
 		    i);
 	}
 }
 
 /*
  * AP CPU's call this to initialize themselves.
  */
 void
 init_secondary(void)
 {
+	cpuset_t tcpuset, tallcpus;
 	struct pcpu *pc;
 	struct nmi_pcpu *np;
 	u_int64_t msr, cr0;
 	int cpu, gsel_tss, x;
 	struct region_descriptor ap_gdt;
 
 	/* Set by the startup code for us to use */
 	cpu = bootAP;
 
 	/* Init tss */
 	common_tss[cpu] = common_tss[0];
 	common_tss[cpu].tss_rsp0 = 0;   /* not used until after switch */
 	common_tss[cpu].tss_iobase = sizeof(struct amd64tss) +
 	    IOPAGES * PAGE_SIZE;
 	common_tss[cpu].tss_ist1 = (long)&doublefault_stack[PAGE_SIZE];
 
 	/* The NMI stack runs on IST2. */
 	np = ((struct nmi_pcpu *) &nmi_stack[PAGE_SIZE]) - 1;
 	common_tss[cpu].tss_ist2 = (long) np;
 
 	/* Prepare private GDT */
 	gdt_segs[GPROC0_SEL].ssd_base = (long) &common_tss[cpu];
 	for (x = 0; x < NGDT; x++) {
 		if (x != GPROC0_SEL && x != (GPROC0_SEL + 1) &&
 		    x != GUSERLDT_SEL && x != (GUSERLDT_SEL + 1))
 			ssdtosd(&gdt_segs[x], &gdt[NGDT * cpu + x]);
 	}
 	ssdtosyssd(&gdt_segs[GPROC0_SEL],
 	    (struct system_segment_descriptor *)&gdt[NGDT * cpu + GPROC0_SEL]);
 	ap_gdt.rd_limit = NGDT * sizeof(gdt[0]) - 1;
 	ap_gdt.rd_base =  (long) &gdt[NGDT * cpu];
 	lgdt(&ap_gdt);			/* does magic intra-segment return */
 
 	/* Get per-cpu data */
 	pc = &__pcpu[cpu];
 
 	/* prime data page for it to use */
 	pcpu_init(pc, cpu, sizeof(struct pcpu));
 	dpcpu_init(dpcpu, cpu);
 	pc->pc_apic_id = cpu_apic_ids[cpu];
 	pc->pc_prvspace = pc;
 	pc->pc_curthread = 0;
 	pc->pc_tssp = &common_tss[cpu];
 	pc->pc_commontssp = &common_tss[cpu];
 	pc->pc_rsp0 = 0;
 	pc->pc_tss = (struct system_segment_descriptor *)&gdt[NGDT * cpu +
 	    GPROC0_SEL];
 	pc->pc_fs32p = &gdt[NGDT * cpu + GUFS32_SEL];
 	pc->pc_gs32p = &gdt[NGDT * cpu + GUGS32_SEL];
 	pc->pc_ldt = (struct system_segment_descriptor *)&gdt[NGDT * cpu +
 	    GUSERLDT_SEL];
 
 	/* Save the per-cpu pointer for use by the NMI handler. */
 	np->np_pcpu = (register_t) pc;
 
 	wrmsr(MSR_FSBASE, 0);		/* User value */
 	wrmsr(MSR_GSBASE, (u_int64_t)pc);
 	wrmsr(MSR_KGSBASE, (u_int64_t)pc);	/* XXX User value while we're in the kernel */
 
 	lidt(&r_idt);
 
 	gsel_tss = GSEL(GPROC0_SEL, SEL_KPL);
 	ltr(gsel_tss);
 
 	/*
 	 * Set to a known state:
 	 * Set by mpboot.s: CR0_PG, CR0_PE
 	 * Set by cpu_setregs: CR0_NE, CR0_MP, CR0_TS, CR0_WP, CR0_AM
 	 */
 	cr0 = rcr0();
 	cr0 &= ~(CR0_CD | CR0_NW | CR0_EM);
 	load_cr0(cr0);
 
 	/* Set up the fast syscall stuff */
 	msr = rdmsr(MSR_EFER) | EFER_SCE;
 	wrmsr(MSR_EFER, msr);
 	wrmsr(MSR_LSTAR, (u_int64_t)IDTVEC(fast_syscall));
 	wrmsr(MSR_CSTAR, (u_int64_t)IDTVEC(fast_syscall32));
 	msr = ((u_int64_t)GSEL(GCODE_SEL, SEL_KPL) << 32) |
 	      ((u_int64_t)GSEL(GUCODE32_SEL, SEL_UPL) << 48);
 	wrmsr(MSR_STAR, msr);
 	wrmsr(MSR_SF_MASK, PSL_NT|PSL_T|PSL_I|PSL_C|PSL_D);
 
 	/* Disable local APIC just to be sure. */
 	lapic_disable();
 
 	/* signal our startup to the BSP. */
 	mp_naps++;
 
 	/* Spin until the BSP releases the AP's. */
 	while (!aps_ready)
 		ia32_pause();
 
 	/* Initialize the PAT MSR. */
 	pmap_init_pat();
 
 	/* set up CPU registers and state */
 	cpu_setregs();
 
 	/* set up SSE/NX registers */
 	initializecpu();
 
 	/* set up FPU state on the AP */
 	fpuinit();
 
 	/* A quick check from sanity claus */
 	if (PCPU_GET(apic_id) != lapic_id()) {
 		printf("SMP: cpuid = %d\n", PCPU_GET(cpuid));
 		printf("SMP: actual apic_id = %d\n", lapic_id());
 		printf("SMP: correct apic_id = %d\n", PCPU_GET(apic_id));
 		panic("cpuid mismatch! boom!!");
 	}
 
 	/* Initialize curthread. */
 	KASSERT(PCPU_GET(idlethread) != NULL, ("no idle thread"));
 	PCPU_SET(curthread, PCPU_GET(idlethread));
 
 	mca_init();
 
 	mtx_lock_spin(&ap_boot_mtx);
 
 	/* Init local apic for irq's */
 	lapic_setup(1);
 
 	/* Set memory range attributes for this CPU to match the BSP */
 	mem_range_AP_init();
 
 	smp_cpus++;
 
 	CTR1(KTR_SMP, "SMP: AP CPU #%d Launched", PCPU_GET(cpuid));
 	printf("SMP: AP CPU #%d Launched!\n", PCPU_GET(cpuid));
+	tcpuset = PCPU_GET(cpumask);
 
 	/* Determine if we are a logical CPU. */
 	/* XXX Calculation depends on cpu_logical being a power of 2, e.g. 2 */
 	if (cpu_logical > 1 && PCPU_GET(apic_id) % cpu_logical != 0)
-		logical_cpus_mask |= PCPU_GET(cpumask);
-	
+		CPU_OR(&logical_cpus_mask, &tcpuset);
+
 	/* Determine if we are a hyperthread. */
 	if (hyperthreading_cpus > 1 &&
 	    PCPU_GET(apic_id) % hyperthreading_cpus != 0)
-		hyperthreading_cpus_mask |= PCPU_GET(cpumask);
+		CPU_OR(&hyperthreading_cpus_mask, &tcpuset);
 
 	/* Build our map of 'other' CPUs. */
-	PCPU_SET(other_cpus, all_cpus & ~PCPU_GET(cpumask));
+	tallcpus = all_cpus;
+	CPU_NAND(&tallcpus, &tcpuset);
+	PCPU_SET(other_cpus, tallcpus);
 
 	if (bootverbose)
 		lapic_dump("AP");
 
 	if (smp_cpus == mp_ncpus) {
 		/* enable IPI's, tlb shootdown, freezes etc */
 		atomic_store_rel_int(&smp_started, 1);
 		smp_active = 1;	 /* historic */
 	}
 
 	/*
 	 * Enable global pages TLB extension
 	 * This also implicitly flushes the TLB 
 	 */
 
 	load_cr4(rcr4() | CR4_PGE);
 	load_ds(_udatasel);
 	load_es(_udatasel);
 	load_fs(_ufssel);
 	mtx_unlock_spin(&ap_boot_mtx);
 
 	/* Wait until all the AP's are up. */
 	while (smp_started == 0)
 		ia32_pause();
 
 	/* Start per-CPU event timers. */
 	cpu_initclocks_ap();
 
 	sched_throw(NULL);
 
 	panic("scheduler returned us to %s", __func__);
 	/* NOTREACHED */
 }
 
 /*******************************************************************
  * local functions and data
  */
 
 /*
  * We tell the I/O APIC code about all the CPUs we want to receive
  * interrupts.  If we don't want certain CPUs to receive IRQs we
  * can simply not tell the I/O APIC code about them in this function.
  * We also do not tell it about the BSP since it tells itself about
  * the BSP internally to work with UP kernels and on UP machines.
  */
 static void
 set_interrupt_apic_ids(void)
 {
 	u_int i, apic_id;
 
 	for (i = 0; i < MAXCPU; i++) {
 		apic_id = cpu_apic_ids[i];
 		if (apic_id == -1)
 			continue;
 		if (cpu_info[apic_id].cpu_bsp)
 			continue;
 		if (cpu_info[apic_id].cpu_disabled)
 			continue;
 
 		/* Don't let hyperthreads service interrupts. */
 		if (hyperthreading_cpus > 1 &&
 		    apic_id % hyperthreading_cpus != 0)
 			continue;
 
 		intr_add_cpu(i);
 	}
 }
 
 /*
  * Assign logical CPU IDs to local APICs.
  */
 static void
 assign_cpu_ids(void)
 {
 	u_int i;
 
 	TUNABLE_INT_FETCH("machdep.hyperthreading_allowed",
 	    &hyperthreading_allowed);
 
 	/* Check for explicitly disabled CPUs. */
 	for (i = 0; i <= MAX_APIC_ID; i++) {
 		if (!cpu_info[i].cpu_present || cpu_info[i].cpu_bsp)
 			continue;
 
 		if (hyperthreading_cpus > 1 && i % hyperthreading_cpus != 0) {
 			cpu_info[i].cpu_hyperthread = 1;
 #if defined(SCHED_ULE)
 			/*
 			 * Don't use HT CPU if it has been disabled by a
 			 * tunable.
 			 */
 			if (hyperthreading_allowed == 0) {
 				cpu_info[i].cpu_disabled = 1;
 				continue;
 			}
 #endif
 		}
 
 		/* Don't use this CPU if it has been disabled by a tunable. */
 		if (resource_disabled("lapic", i)) {
 			cpu_info[i].cpu_disabled = 1;
 			continue;
 		}
 	}
 
 	/*
 	 * Assign CPU IDs to local APIC IDs and disable any CPUs
 	 * beyond MAXCPU.  CPU 0 is always assigned to the BSP.
 	 *
 	 * To minimize confusion for userland, we attempt to number
 	 * CPUs such that all threads and cores in a package are
 	 * grouped together.  For now we assume that the BSP is always
 	 * the first thread in a package and just start adding APs
 	 * starting with the BSP's APIC ID.
 	 */
 	mp_ncpus = 1;
 	cpu_apic_ids[0] = boot_cpu_id;
 	apic_cpuids[boot_cpu_id] = 0;
 	for (i = boot_cpu_id + 1; i != boot_cpu_id;
 	     i == MAX_APIC_ID ? i = 0 : i++) {
 		if (!cpu_info[i].cpu_present || cpu_info[i].cpu_bsp ||
 		    cpu_info[i].cpu_disabled)
 			continue;
 
 		if (mp_ncpus < MAXCPU) {
 			cpu_apic_ids[mp_ncpus] = i;
 			apic_cpuids[i] = mp_ncpus;
 			mp_ncpus++;
 		} else
 			cpu_info[i].cpu_disabled = 1;
 	}
 	KASSERT(mp_maxid >= mp_ncpus - 1,
 	    ("%s: counters out of sync: max %d, count %d", __func__, mp_maxid,
 	    mp_ncpus));		
 }
 
 /*
  * start each AP in our list
  */
 static int
 start_all_aps(void)
 {
+	cpuset_t tallcpus, tcpuset;
 	vm_offset_t va = boot_address + KERNBASE;
 	u_int64_t *pt4, *pt3, *pt2;
 	u_int32_t mpbioswarmvec;
 	int apic_id, cpu, i;
 	u_char mpbiosreason;
 
 	mtx_init(&ap_boot_mtx, "ap boot", NULL, MTX_SPIN);
 
 	/* install the AP 1st level boot code */
 	pmap_kenter(va, boot_address);
 	pmap_invalidate_page(kernel_pmap, va);
 	bcopy(mptramp_start, (void *)va, bootMP_size);
 
 	/* Locate the page tables, they'll be below the trampoline */
 	pt4 = (u_int64_t *)(uintptr_t)(mptramp_pagetables + KERNBASE);
 	pt3 = pt4 + (PAGE_SIZE) / sizeof(u_int64_t);
 	pt2 = pt3 + (PAGE_SIZE) / sizeof(u_int64_t);
 
 	/* Create the initial 1GB replicated page tables */
 	for (i = 0; i < 512; i++) {
 		/* Each slot of the level 4 pages points to the same level 3 page */
 		pt4[i] = (u_int64_t)(uintptr_t)(mptramp_pagetables + PAGE_SIZE);
 		pt4[i] |= PG_V | PG_RW | PG_U;
 
 		/* Each slot of the level 3 pages points to the same level 2 page */
 		pt3[i] = (u_int64_t)(uintptr_t)(mptramp_pagetables + (2 * PAGE_SIZE));
 		pt3[i] |= PG_V | PG_RW | PG_U;
 
 		/* The level 2 page slots are mapped with 2MB pages for 1GB. */
 		pt2[i] = i * (2 * 1024 * 1024);
 		pt2[i] |= PG_V | PG_RW | PG_PS | PG_U;
 	}
 
 	/* save the current value of the warm-start vector */
 	mpbioswarmvec = *((u_int32_t *) WARMBOOT_OFF);
 	outb(CMOS_REG, BIOS_RESET);
 	mpbiosreason = inb(CMOS_DATA);
 
 	/* setup a vector to our boot code */
 	*((volatile u_short *) WARMBOOT_OFF) = WARMBOOT_TARGET;
 	*((volatile u_short *) WARMBOOT_SEG) = (boot_address >> 4);
 	outb(CMOS_REG, BIOS_RESET);
 	outb(CMOS_DATA, BIOS_WARM);	/* 'warm-start' */
 
 	/* start each AP */
 	for (cpu = 1; cpu < mp_ncpus; cpu++) {
 		apic_id = cpu_apic_ids[cpu];
 
 		/* allocate and set up an idle stack data page */
 		bootstacks[cpu] = (void *)kmem_alloc(kernel_map, KSTACK_PAGES * PAGE_SIZE);
 		doublefault_stack = (char *)kmem_alloc(kernel_map, PAGE_SIZE);
 		nmi_stack = (char *)kmem_alloc(kernel_map, PAGE_SIZE);
 		dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE);
 
 		bootSTK = (char *)bootstacks[cpu] + KSTACK_PAGES * PAGE_SIZE - 8;
 		bootAP = cpu;
 
 		/* attempt to start the Application Processor */
 		if (!start_ap(apic_id)) {
 			/* restore the warmstart vector */
 			*(u_int32_t *) WARMBOOT_OFF = mpbioswarmvec;
 			panic("AP #%d (PHY# %d) failed!", cpu, apic_id);
 		}
 
-		all_cpus |= (1 << cpu);		/* record AP in CPU map */
+		CPU_SET(cpu, &all_cpus);	/* record AP in CPU map */
 	}
 
 	/* build our map of 'other' CPUs */
-	PCPU_SET(other_cpus, all_cpus & ~PCPU_GET(cpumask));
+	tallcpus = all_cpus;
+	tcpuset = PCPU_GET(cpumask);
+	CPU_NAND(&tallcpus, &tcpuset);
+	PCPU_SET(other_cpus, tallcpus);
 
 	/* restore the warmstart vector */
 	*(u_int32_t *) WARMBOOT_OFF = mpbioswarmvec;
 
 	outb(CMOS_REG, BIOS_RESET);
 	outb(CMOS_DATA, mpbiosreason);
 
 	/* number of APs actually started */
 	return mp_naps;
 }
 
 
 /*
  * This function starts the AP (application processor) identified
  * by the APIC ID 'physicalCpu'.  It does quite a "song and dance"
  * to accomplish this.  This is necessary because of the nuances
  * of the different hardware we might encounter.  It isn't pretty,
  * but it seems to work.
  */
 static int
 start_ap(int apic_id)
 {
 	int vector, ms;
 	int cpus;
 
 	/* calculate the vector */
 	vector = (boot_address >> 12) & 0xff;
 
 	/* used as a watchpoint to signal AP startup */
 	cpus = mp_naps;
 
 	/*
 	 * first we do an INIT/RESET IPI this INIT IPI might be run, reseting
 	 * and running the target CPU. OR this INIT IPI might be latched (P5
 	 * bug), CPU waiting for STARTUP IPI. OR this INIT IPI might be
 	 * ignored.
 	 */
 
 	/* do an INIT IPI: assert RESET */
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_ASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_INIT, apic_id);
 
 	/* wait for pending status end */
 	lapic_ipi_wait(-1);
 
 	/* do an INIT IPI: deassert RESET */
 	lapic_ipi_raw(APIC_DEST_ALLESELF | APIC_TRIGMOD_LEVEL |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_INIT, 0);
 
 	/* wait for pending status end */
 	DELAY(10000);		/* wait ~10mS */
 	lapic_ipi_wait(-1);
 
 	/*
 	 * next we do a STARTUP IPI: the previous INIT IPI might still be
 	 * latched, (P5 bug) this 1st STARTUP would then terminate
 	 * immediately, and the previously started INIT IPI would continue. OR
 	 * the previous INIT IPI has already run. and this STARTUP IPI will
 	 * run. OR the previous INIT IPI was ignored. and this STARTUP IPI
 	 * will run.
 	 */
 
 	/* do a STARTUP IPI */
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_STARTUP |
 	    vector, apic_id);
 	lapic_ipi_wait(-1);
 	DELAY(200);		/* wait ~200uS */
 
 	/*
 	 * finally we do a 2nd STARTUP IPI: this 2nd STARTUP IPI should run IF
 	 * the previous STARTUP IPI was cancelled by a latched INIT IPI. OR
 	 * this STARTUP IPI will be ignored, as only ONE STARTUP IPI is
 	 * recognized after hardware RESET or INIT IPI.
 	 */
 
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_STARTUP |
 	    vector, apic_id);
 	lapic_ipi_wait(-1);
 	DELAY(200);		/* wait ~200uS */
 
 	/* Wait up to 5 seconds for it to start. */
 	for (ms = 0; ms < 5000; ms++) {
 		if (mp_naps > cpus)
 			return 1;	/* return SUCCESS */
 		DELAY(1000);
 	}
 	return 0;		/* return FAILURE */
 }
 
 #ifdef COUNT_XINVLTLB_HITS
 u_int xhits_gbl[MAXCPU];
 u_int xhits_pg[MAXCPU];
 u_int xhits_rng[MAXCPU];
 SYSCTL_NODE(_debug, OID_AUTO, xhits, CTLFLAG_RW, 0, "");
 SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, global, CTLFLAG_RW, &xhits_gbl,
     sizeof(xhits_gbl), "IU", "");
 SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, page, CTLFLAG_RW, &xhits_pg,
     sizeof(xhits_pg), "IU", "");
 SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, range, CTLFLAG_RW, &xhits_rng,
     sizeof(xhits_rng), "IU", "");
 
 u_int ipi_global;
 u_int ipi_page;
 u_int ipi_range;
 u_int ipi_range_size;
 SYSCTL_UINT(_debug_xhits, OID_AUTO, ipi_global, CTLFLAG_RW, &ipi_global, 0, "");
 SYSCTL_UINT(_debug_xhits, OID_AUTO, ipi_page, CTLFLAG_RW, &ipi_page, 0, "");
 SYSCTL_UINT(_debug_xhits, OID_AUTO, ipi_range, CTLFLAG_RW, &ipi_range, 0, "");
 SYSCTL_UINT(_debug_xhits, OID_AUTO, ipi_range_size, CTLFLAG_RW,
     &ipi_range_size, 0, "");
 
 u_int ipi_masked_global;
 u_int ipi_masked_page;
 u_int ipi_masked_range;
 u_int ipi_masked_range_size;
 SYSCTL_UINT(_debug_xhits, OID_AUTO, ipi_masked_global, CTLFLAG_RW,
     &ipi_masked_global, 0, "");
 SYSCTL_UINT(_debug_xhits, OID_AUTO, ipi_masked_page, CTLFLAG_RW,
     &ipi_masked_page, 0, "");
 SYSCTL_UINT(_debug_xhits, OID_AUTO, ipi_masked_range, CTLFLAG_RW,
     &ipi_masked_range, 0, "");
 SYSCTL_UINT(_debug_xhits, OID_AUTO, ipi_masked_range_size, CTLFLAG_RW,
     &ipi_masked_range_size, 0, "");
 #endif /* COUNT_XINVLTLB_HITS */
 
 /*
+ * Send an IPI to specified CPU handling the bitmap logic.
+ */
+static void
+ipi_send_cpu(int cpu, u_int ipi)
+{
+	u_int bitmap, old_pending, new_pending;
+
+	KASSERT(cpu_apic_ids[cpu] != -1, ("IPI to non-existent CPU %d", cpu));
+
+	if (IPI_IS_BITMAPED(ipi)) {
+		bitmap = 1 << ipi;
+		ipi = IPI_BITMAP_VECTOR;
+		do {
+			old_pending = cpu_ipi_pending[cpu];
+			new_pending = old_pending | bitmap;
+		} while  (!atomic_cmpset_int(&cpu_ipi_pending[cpu],
+		    old_pending, new_pending)); 
+		if (old_pending)
+			return;
+	}
+	lapic_ipi_vectored(ipi, cpu_apic_ids[cpu]);
+}
+
+/*
  * Flush the TLB on all other CPU's
  */
 static void
 smp_tlb_shootdown(u_int vector, vm_offset_t addr1, vm_offset_t addr2)
 {
 	u_int ncpu;
 
 	ncpu = mp_ncpus - 1;	/* does not shootdown self */
 	if (ncpu < 1)
 		return;		/* no other cpus */
 	if (!(read_rflags() & PSL_I))
 		panic("%s: interrupts disabled", __func__);
 	mtx_lock_spin(&smp_ipi_mtx);
 	smp_tlb_addr1 = addr1;
 	smp_tlb_addr2 = addr2;
 	atomic_store_rel_int(&smp_tlb_wait, 0);
 	ipi_all_but_self(vector);
 	while (smp_tlb_wait < ncpu)
 		ia32_pause();
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 
 static void
-smp_targeted_tlb_shootdown(cpumask_t mask, u_int vector, vm_offset_t addr1, vm_offset_t addr2)
+smp_targeted_tlb_shootdown(cpuset_t mask, u_int vector, vm_offset_t addr1, vm_offset_t addr2)
 {
-	int ncpu, othercpus;
+	int cpu, ncpu, othercpus;
 
 	othercpus = mp_ncpus - 1;
-	if (mask == (cpumask_t)-1) {
-		ncpu = othercpus;
-		if (ncpu < 1)
+	if (CPU_ISFULLSET(&mask)) {
+		if (othercpus < 1)
 			return;
 	} else {
-		mask &= ~PCPU_GET(cpumask);
-		if (mask == 0)
+		sched_pin();
+		CPU_NAND(&mask, PCPU_PTR(cpumask));
+		sched_unpin();
+		if (CPU_EMPTY(&mask))
 			return;
-		ncpu = bitcount32(mask);
-		if (ncpu > othercpus) {
-			/* XXX this should be a panic offence */
-			printf("SMP: tlb shootdown to %d other cpus (only have %d)\n",
-			    ncpu, othercpus);
-			ncpu = othercpus;
-		}
-		/* XXX should be a panic, implied by mask == 0 above */
-		if (ncpu < 1)
-			return;
 	}
 	if (!(read_rflags() & PSL_I))
 		panic("%s: interrupts disabled", __func__);
 	mtx_lock_spin(&smp_ipi_mtx);
 	smp_tlb_addr1 = addr1;
 	smp_tlb_addr2 = addr2;
 	atomic_store_rel_int(&smp_tlb_wait, 0);
-	if (mask == (cpumask_t)-1)
+	if (CPU_ISFULLSET(&mask)) {
+		ncpu = othercpus;
 		ipi_all_but_self(vector);
-	else
-		ipi_selected(mask, vector);
+	} else {
+		ncpu = 0;
+		while ((cpu = cpusetobj_ffs(&mask)) != 0) {
+			cpu--;
+			CPU_CLR(cpu, &mask);
+			CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__,
+			    cpu, vector);
+			ipi_send_cpu(cpu, vector);
+			ncpu++;
+		}
+	}
 	while (smp_tlb_wait < ncpu)
 		ia32_pause();
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 
-/*
- * Send an IPI to specified CPU handling the bitmap logic.
- */
-static void
-ipi_send_cpu(int cpu, u_int ipi)
-{
-	u_int bitmap, old_pending, new_pending;
-
-	KASSERT(cpu_apic_ids[cpu] != -1, ("IPI to non-existent CPU %d", cpu));
-
-	if (IPI_IS_BITMAPED(ipi)) {
-		bitmap = 1 << ipi;
-		ipi = IPI_BITMAP_VECTOR;
-		do {
-			old_pending = cpu_ipi_pending[cpu];
-			new_pending = old_pending | bitmap;
-		} while  (!atomic_cmpset_int(&cpu_ipi_pending[cpu],
-		    old_pending, new_pending)); 
-		if (old_pending)
-			return;
-	}
-	lapic_ipi_vectored(ipi, cpu_apic_ids[cpu]);
-}
-
 void
 smp_cache_flush(void)
 {
 
 	if (smp_started)
 		smp_tlb_shootdown(IPI_INVLCACHE, 0, 0);
 }
 
 void
 smp_invltlb(void)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLTLB, 0, 0);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_global++;
 #endif
 	}
 }
 
 void
 smp_invlpg(vm_offset_t addr)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLPG, addr, 0);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_page++;
 #endif
 	}
 }
 
 void
 smp_invlpg_range(vm_offset_t addr1, vm_offset_t addr2)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLRNG, addr1, addr2);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_range++;
 		ipi_range_size += (addr2 - addr1) / PAGE_SIZE;
 #endif
 	}
 }
 
 void
-smp_masked_invltlb(cpumask_t mask)
+smp_masked_invltlb(cpuset_t mask)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLTLB, 0, 0);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_masked_global++;
 #endif
 	}
 }
 
 void
-smp_masked_invlpg(cpumask_t mask, vm_offset_t addr)
+smp_masked_invlpg(cpuset_t mask, vm_offset_t addr)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLPG, addr, 0);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_masked_page++;
 #endif
 	}
 }
 
 void
-smp_masked_invlpg_range(cpumask_t mask, vm_offset_t addr1, vm_offset_t addr2)
+smp_masked_invlpg_range(cpuset_t mask, vm_offset_t addr1, vm_offset_t addr2)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLRNG, addr1, addr2);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_masked_range++;
 		ipi_masked_range_size += (addr2 - addr1) / PAGE_SIZE;
 #endif
 	}
 }
 
 void
 ipi_bitmap_handler(struct trapframe frame)
 {
 	struct trapframe *oldframe;
 	struct thread *td;
 	int cpu = PCPU_GET(cpuid);
 	u_int ipi_bitmap;
 
 	critical_enter();
 	td = curthread;
 	td->td_intr_nesting_level++;
 	oldframe = td->td_intr_frame;
 	td->td_intr_frame = &frame;
 	ipi_bitmap = atomic_readandclear_int(&cpu_ipi_pending[cpu]);
 	if (ipi_bitmap & (1 << IPI_PREEMPT)) {
 #ifdef COUNT_IPIS
 		(*ipi_preempt_counts[cpu])++;
 #endif
 		sched_preempt(td);
 	}
 	if (ipi_bitmap & (1 << IPI_AST)) {
 #ifdef COUNT_IPIS
 		(*ipi_ast_counts[cpu])++;
 #endif
 		/* Nothing to do for AST */
 	}
 	if (ipi_bitmap & (1 << IPI_HARDCLOCK)) {
 #ifdef COUNT_IPIS
 		(*ipi_hardclock_counts[cpu])++;
 #endif
 		hardclockintr();
 	}
 	td->td_intr_frame = oldframe;
 	td->td_intr_nesting_level--;
 	critical_exit();
 }
 
 /*
  * send an IPI to a set of cpus.
  */
 void
-ipi_selected(cpumask_t cpus, u_int ipi)
+ipi_selected(cpuset_t cpus, u_int ipi)
 {
 	int cpu;
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, cpus);
+		CPU_OR_ATOMIC(&ipi_nmi_pending, &cpus);
 
-	CTR3(KTR_SMP, "%s: cpus: %x ipi: %x", __func__, cpus, ipi);
-	while ((cpu = ffs(cpus)) != 0) {
+	while ((cpu = cpusetobj_ffs(&cpus)) != 0) {
 		cpu--;
-		cpus &= ~(1 << cpu);
+		CPU_CLR(cpu, &cpus);
+		CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, ipi);
 		ipi_send_cpu(cpu, ipi);
 	}
 }
 
 /*
  * send an IPI to a specific CPU.
  */
 void
 ipi_cpu(int cpu, u_int ipi)
 {
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, 1 << cpu);
+		CPU_SET_ATOMIC(cpu, &ipi_nmi_pending);
 
 	CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, ipi);
 	ipi_send_cpu(cpu, ipi);
 }
 
 /*
  * send an IPI to all CPUs EXCEPT myself
  */
 void
 ipi_all_but_self(u_int ipi)
 {
 
+	sched_pin();
 	if (IPI_IS_BITMAPED(ipi)) {
 		ipi_selected(PCPU_GET(other_cpus), ipi);
+		sched_unpin();
 		return;
 	}
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, PCPU_GET(other_cpus));
+		CPU_OR_ATOMIC(&ipi_nmi_pending, PCPU_PTR(other_cpus));
+	sched_unpin();
 
 	CTR2(KTR_SMP, "%s: ipi: %x", __func__, ipi);
 	lapic_ipi_vectored(ipi, APIC_IPI_DEST_OTHERS);
 }
 
 int
 ipi_nmi_handler()
 {
-	cpumask_t cpumask;
+	cpuset_t cpumask;
 
 	/*
 	 * As long as there is not a simple way to know about a NMI's
 	 * source, if the bitmask for the current CPU is present in
 	 * the global pending bitword an IPI_STOP_HARD has been issued
 	 * and should be handled.
 	 */
+	sched_pin();
 	cpumask = PCPU_GET(cpumask);
-	if ((ipi_nmi_pending & cpumask) == 0)
+	sched_unpin();
+	if (!CPU_OVERLAP(&ipi_nmi_pending, &cpumask))
 		return (1);
 
-	atomic_clear_int(&ipi_nmi_pending, cpumask);
+	CPU_NAND_ATOMIC(&ipi_nmi_pending, &cpumask);
 	cpustop_handler();
 	return (0);
 }
      
 /*
  * Handle an IPI_STOP by saving our current context and spinning until we
  * are resumed.
  */
 void
 cpustop_handler(void)
 {
-	cpumask_t cpumask;
+	cpuset_t cpumask;
 	u_int cpu;
 
+	sched_pin();
 	cpu = PCPU_GET(cpuid);
 	cpumask = PCPU_GET(cpumask);
+	sched_unpin();
 
 	savectx(&stoppcbs[cpu]);
 
 	/* Indicate that we are stopped */
-	atomic_set_int(&stopped_cpus, cpumask);
+	CPU_OR_ATOMIC(&stopped_cpus, &cpumask);
 
 	/* Wait for restart */
-	while (!(started_cpus & cpumask))
+	while (!CPU_OVERLAP(&started_cpus, &cpumask))
 	    ia32_pause();
 
-	atomic_clear_int(&started_cpus, cpumask);
-	atomic_clear_int(&stopped_cpus, cpumask);
+	CPU_NAND_ATOMIC(&started_cpus, &cpumask);
+	CPU_NAND_ATOMIC(&stopped_cpus, &cpumask);
 
 	if (cpu == 0 && cpustop_restartfunc != NULL) {
 		cpustop_restartfunc();
 		cpustop_restartfunc = NULL;
 	}
 }
 
 /*
  * Handle an IPI_SUSPEND by saving our current context and spinning until we
  * are resumed.
  */
 void
 cpususpend_handler(void)
 {
-	cpumask_t cpumask;
+	cpuset_t cpumask;
 	register_t cr3, rf;
 	u_int cpu;
 
 	cpu = PCPU_GET(cpuid);
 	cpumask = PCPU_GET(cpumask);
 
 	rf = intr_disable();
 	cr3 = rcr3();
 
 	if (savectx(susppcbs[cpu])) {
 		wbinvd();
-		atomic_set_int(&stopped_cpus, cpumask);
+		CPU_OR_ATOMIC(&stopped_cpus, &cpumask);
 	} else {
 		pmap_init_pat();
 		PCPU_SET(switchtime, 0);
 		PCPU_SET(switchticks, ticks);
 	}
 
 	/* Wait for resume */
-	while (!(started_cpus & cpumask))
+	while (!CPU_OVERLAP(&started_cpus, &cpumask))
 		ia32_pause();
 
-	atomic_clear_int(&started_cpus, cpumask);
-	atomic_clear_int(&stopped_cpus, cpumask);
+	CPU_NAND_ATOMIC(&started_cpus, &cpumask);
+	CPU_NAND_ATOMIC(&stopped_cpus, &cpumask);
 
 	/* Restore CR3 and enable interrupts */
 	load_cr3(cr3);
 	mca_resume();
 	lapic_setup(0);
 	intr_restore(rf);
 }
 
 /*
  * This is called once the rest of the system is up and running and we're
  * ready to let the AP's out of the pen.
  */
 static void
 release_aps(void *dummy __unused)
 {
 
 	if (mp_ncpus == 1) 
 		return;
 	atomic_store_rel_int(&aps_ready, 1);
 	while (smp_started == 0)
 		ia32_pause();
 }
 SYSINIT(start_aps, SI_SUB_SMP, SI_ORDER_FIRST, release_aps, NULL);
 
 static int
 sysctl_hlt_cpus(SYSCTL_HANDLER_ARGS)
 {
-	cpumask_t mask;
+	cpuset_t mask;
 	int error;
 
 	mask = hlt_cpus_mask;
-	error = sysctl_handle_int(oidp, &mask, 0, req);
+	error = sysctl_handle_opaque(oidp, &mask, sizeof(mask), req);
 	if (error || !req->newptr)
 		return (error);
 
-	if (logical_cpus_mask != 0 &&
-	    (mask & logical_cpus_mask) == logical_cpus_mask)
+	if (!CPU_EMPTY(&logical_cpus_mask) &&
+	    CPU_SUBSET(&mask, &logical_cpus_mask))
 		hlt_logical_cpus = 1;
 	else
 		hlt_logical_cpus = 0;
 
 	if (! hyperthreading_allowed)
-		mask |= hyperthreading_cpus_mask;
+		CPU_OR(&mask, &hyperthreading_cpus_mask);
 
-	if ((mask & all_cpus) == all_cpus)
-		mask &= ~(1<<0);
+	if (CPU_SUBSET(&mask, &all_cpus))
+		CPU_CLR(0, &mask);
 	hlt_cpus_mask = mask;
 	return (error);
 }
-SYSCTL_PROC(_machdep, OID_AUTO, hlt_cpus, CTLTYPE_INT|CTLFLAG_RW,
-    0, 0, sysctl_hlt_cpus, "IU",
+SYSCTL_PROC(_machdep, OID_AUTO, hlt_cpus,
+    CTLTYPE_STRUCT | CTLFLAG_RW | CTLFLAG_MPSAFE, 0, 0, sysctl_hlt_cpus, "S",
     "Bitmap of CPUs to halt.  101 (binary) will halt CPUs 0 and 2.");
 
 static int
 sysctl_hlt_logical_cpus(SYSCTL_HANDLER_ARGS)
 {
 	int disable, error;
 
 	disable = hlt_logical_cpus;
 	error = sysctl_handle_int(oidp, &disable, 0, req);
 	if (error || !req->newptr)
 		return (error);
 
 	if (disable)
-		hlt_cpus_mask |= logical_cpus_mask;
+		CPU_OR(&hlt_cpus_mask, &logical_cpus_mask);
 	else
-		hlt_cpus_mask &= ~logical_cpus_mask;
+		CPU_NAND(&hlt_cpus_mask, &logical_cpus_mask);
 
 	if (! hyperthreading_allowed)
-		hlt_cpus_mask |= hyperthreading_cpus_mask;
+		CPU_OR(&hlt_cpus_mask, &hyperthreading_cpus_mask);
 
-	if ((hlt_cpus_mask & all_cpus) == all_cpus)
-		hlt_cpus_mask &= ~(1<<0);
+	if (CPU_SUBSET(&hlt_cpus_mask, &all_cpus))
+		CPU_CLR(0, &hlt_cpus_mask);
 
 	hlt_logical_cpus = disable;
 	return (error);
 }
 
 static int
 sysctl_hyperthreading_allowed(SYSCTL_HANDLER_ARGS)
 {
 	int allowed, error;
 
 	allowed = hyperthreading_allowed;
 	error = sysctl_handle_int(oidp, &allowed, 0, req);
 	if (error || !req->newptr)
 		return (error);
 
 #ifdef SCHED_ULE
 	/*
 	 * SCHED_ULE doesn't allow enabling/disabling HT cores at
 	 * run-time.
 	 */
 	if (allowed != hyperthreading_allowed)
 		return (ENOTSUP);
 	return (error);
 #endif
 
 	if (allowed)
-		hlt_cpus_mask &= ~hyperthreading_cpus_mask;
+		CPU_NAND(&hlt_cpus_mask, &hyperthreading_cpus_mask);
 	else
-		hlt_cpus_mask |= hyperthreading_cpus_mask;
+		CPU_OR(&hlt_cpus_mask, &hyperthreading_cpus_mask);
 
-	if (logical_cpus_mask != 0 &&
-	    (hlt_cpus_mask & logical_cpus_mask) == logical_cpus_mask)
+	if (!CPU_EMPTY(&logical_cpus_mask) &&
+	    CPU_SUBSET(&hlt_cpus_mask, &logical_cpus_mask))
 		hlt_logical_cpus = 1;
 	else
 		hlt_logical_cpus = 0;
 
-	if ((hlt_cpus_mask & all_cpus) == all_cpus)
-		hlt_cpus_mask &= ~(1<<0);
+	if (CPU_SUBSET(&hlt_cpus_mask, &all_cpus))
+		CPU_CLR(0, &hlt_cpus_mask);
 
 	hyperthreading_allowed = allowed;
 	return (error);
 }
 
 static void
 cpu_hlt_setup(void *dummy __unused)
 {
 
-	if (logical_cpus_mask != 0) {
+	if (!CPU_EMPTY(&logical_cpus_mask)) {
 		TUNABLE_INT_FETCH("machdep.hlt_logical_cpus",
 		    &hlt_logical_cpus);
 		sysctl_ctx_init(&logical_cpu_clist);
 		SYSCTL_ADD_PROC(&logical_cpu_clist,
 		    SYSCTL_STATIC_CHILDREN(_machdep), OID_AUTO,
 		    "hlt_logical_cpus", CTLTYPE_INT|CTLFLAG_RW, 0, 0,
 		    sysctl_hlt_logical_cpus, "IU", "");
 		SYSCTL_ADD_UINT(&logical_cpu_clist,
 		    SYSCTL_STATIC_CHILDREN(_machdep), OID_AUTO,
 		    "logical_cpus_mask", CTLTYPE_INT|CTLFLAG_RD,
 		    &logical_cpus_mask, 0, "");
 
 		if (hlt_logical_cpus)
-			hlt_cpus_mask |= logical_cpus_mask;
+			CPU_OR(&hlt_cpus_mask, &logical_cpus_mask);
 
 		/*
 		 * If necessary for security purposes, force
 		 * hyperthreading off, regardless of the value
 		 * of hlt_logical_cpus.
 		 */
-		if (hyperthreading_cpus_mask) {
+		if (!CPU_EMPTY(&hyperthreading_cpus_mask)) {
 			SYSCTL_ADD_PROC(&logical_cpu_clist,
 			    SYSCTL_STATIC_CHILDREN(_machdep), OID_AUTO,
 			    "hyperthreading_allowed", CTLTYPE_INT|CTLFLAG_RW,
 			    0, 0, sysctl_hyperthreading_allowed, "IU", "");
 			if (! hyperthreading_allowed)
-				hlt_cpus_mask |= hyperthreading_cpus_mask;
+				CPU_OR(&hlt_cpus_mask,
+				    &hyperthreading_cpus_mask);
 		}
 	}
 }
 SYSINIT(cpu_hlt, SI_SUB_SMP, SI_ORDER_ANY, cpu_hlt_setup, NULL);
 
 int
 mp_grab_cpu_hlt(void)
 {
-	cpumask_t mask;
+	cpuset_t mask;
 #ifdef MP_WATCHDOG
 	u_int cpuid;
 #endif
 	int retval;
 
 	mask = PCPU_GET(cpumask);
 #ifdef MP_WATCHDOG
 	cpuid = PCPU_GET(cpuid);
 	ap_watchdog(cpuid);
 #endif
 
 	retval = 0;
-	while (mask & hlt_cpus_mask) {
+	while (CPU_OVERLAP(&mask, &hlt_cpus_mask)) {
 		retval = 1;
 		__asm __volatile("sti; hlt" : : : "memory");
 	}
 	return (retval);
 }
 
 #ifdef COUNT_IPIS
 /*
  * Setup interrupt counters for IPI handlers.
  */
 static void
 mp_ipi_intrcnt(void *dummy)
 {
 	char buf[64];
 	int i;
 
 	CPU_FOREACH(i) {
 		snprintf(buf, sizeof(buf), "cpu%d:invltlb", i);
 		intrcnt_add(buf, &ipi_invltlb_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:invlrng", i);
 		intrcnt_add(buf, &ipi_invlrng_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:invlpg", i);
 		intrcnt_add(buf, &ipi_invlpg_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:preempt", i);
 		intrcnt_add(buf, &ipi_preempt_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:ast", i);
 		intrcnt_add(buf, &ipi_ast_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:rendezvous", i);
 		intrcnt_add(buf, &ipi_rendezvous_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:hardclock", i);
 		intrcnt_add(buf, &ipi_hardclock_counts[i]);
 	}
 }
 SYSINIT(mp_ipi_intrcnt, SI_SUB_INTR, SI_ORDER_MIDDLE, mp_ipi_intrcnt, NULL);
 #endif
 
Index: head/sys/amd64/amd64/pmap.c
===================================================================
--- head/sys/amd64/amd64/pmap.c	(revision 222812)
+++ head/sys/amd64/amd64/pmap.c	(revision 222813)
@@ -1,5130 +1,5144 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 1994 John S. Dyson
  * All rights reserved.
  * Copyright (c) 1994 David Greenman
  * All rights reserved.
  * Copyright (c) 2003 Peter Wemm
  * All rights reserved.
  * Copyright (c) 2005-2010 Alan L. Cox <alc@cs.rice.edu>
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from:	@(#)pmap.c	7.7 (Berkeley)	5/12/91
  */
 /*-
  * Copyright (c) 2003 Networks Associates Technology, Inc.
  * All rights reserved.
  *
  * This software was developed for the FreeBSD Project by Jake Burkholder,
  * Safeport Network Services, and Network Associates Laboratories, the
  * Security Research Division of Network Associates, Inc. under
  * DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA
  * CHATS research program.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  *	Manages physical address maps.
  *
  *	In addition to hardware address maps, this
  *	module is called upon to provide software-use-only
  *	maps which may or may not be stored in the same
  *	form as hardware maps.  These pseudo-maps are
  *	used to store intermediate results from copy
  *	operations to and from address spaces.
  *
  *	Since the information managed by this module is
  *	also stored by the logical address mapping module,
  *	this module may throw away valid virtual-to-physical
  *	mappings at almost any time.  However, invalidations
  *	of virtual-to-physical mappings must be done as
  *	requested.
  *
  *	In order to cope with hardware architectures which
  *	make virtual-to-physical map invalidates expensive,
  *	this module may delay invalidate or reduced protection
  *	operations until such time as they are actually
  *	necessary.  This module is given full information as
  *	to which processors are currently using which maps,
  *	and to when physical maps must be made correct.
  */
 
 #include "opt_pmap.h"
 #include "opt_vm.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mman.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/sx.h>
 #include <sys/vmmeter.h>
 #include <sys/sched.h>
 #include <sys/sysctl.h>
 #ifdef SMP
 #include <sys/smp.h>
+#else
+#include <sys/cpuset.h>
 #endif
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 #include <vm/vm_reserv.h>
 #include <vm/uma.h>
 
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
 #include <machine/specialreg.h>
 #ifdef SMP
 #include <machine/smp.h>
 #endif
 
 #ifndef PMAP_SHPGPERPROC
 #define PMAP_SHPGPERPROC 200
 #endif
 
 #if !defined(DIAGNOSTIC)
 #ifdef __GNUC_GNU_INLINE__
 #define PMAP_INLINE	__attribute__((__gnu_inline__)) inline
 #else
 #define PMAP_INLINE	extern inline
 #endif
 #else
 #define PMAP_INLINE
 #endif
 
 #define PV_STATS
 #ifdef PV_STATS
 #define PV_STAT(x)	do { x ; } while (0)
 #else
 #define PV_STAT(x)	do { } while (0)
 #endif
 
 #define	pa_index(pa)	((pa) >> PDRSHIFT)
 #define	pa_to_pvh(pa)	(&pv_table[pa_index(pa)])
 
 struct pmap kernel_pmap_store;
 
 vm_offset_t virtual_avail;	/* VA of first avail page (after kernel bss) */
 vm_offset_t virtual_end;	/* VA of last avail page (end of kernel AS) */
 
 static int ndmpdp;
 static vm_paddr_t dmaplimit;
 vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS;
 pt_entry_t pg_nx;
 
 SYSCTL_NODE(_vm, OID_AUTO, pmap, CTLFLAG_RD, 0, "VM/pmap parameters");
 
 static int pat_works = 1;
 SYSCTL_INT(_vm_pmap, OID_AUTO, pat_works, CTLFLAG_RD, &pat_works, 1,
     "Is page attribute table fully functional?");
 
 static int pg_ps_enabled = 1;
 SYSCTL_INT(_vm_pmap, OID_AUTO, pg_ps_enabled, CTLFLAG_RDTUN, &pg_ps_enabled, 0,
     "Are large page mappings enabled?");
 
 #define	PAT_INDEX_SIZE	8
 static int pat_index[PAT_INDEX_SIZE];	/* cache mode to PAT index conversion */
 
 static u_int64_t	KPTphys;	/* phys addr of kernel level 1 */
 static u_int64_t	KPDphys;	/* phys addr of kernel level 2 */
 u_int64_t		KPDPphys;	/* phys addr of kernel level 3 */
 u_int64_t		KPML4phys;	/* phys addr of kernel level 4 */
 
 static u_int64_t	DMPDphys;	/* phys addr of direct mapped level 2 */
 static u_int64_t	DMPDPphys;	/* phys addr of direct mapped level 3 */
 
 /*
  * Data for the pv entry allocation mechanism
  */
 static int pv_entry_count = 0, pv_entry_max = 0, pv_entry_high_water = 0;
 static struct md_page *pv_table;
 static int shpgperproc = PMAP_SHPGPERPROC;
 
 /*
  * All those kernel PT submaps that BSD is so fond of
  */
 pt_entry_t *CMAP1 = 0;
 caddr_t CADDR1 = 0;
 
 /*
  * Crashdump maps.
  */
 static caddr_t crashdumpmap;
 
 static void	free_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t locked_pmap, int try);
 static void	pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa);
 static boolean_t pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa);
 static void	pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa);
 static void	pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va);
 static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap,
 		    vm_offset_t va);
 static int	pmap_pvh_wired_mappings(struct md_page *pvh, int count);
 
 static int pmap_change_attr_locked(vm_offset_t va, vm_size_t size, int mode);
 static boolean_t pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va);
 static boolean_t pmap_demote_pdpe(pmap_t pmap, pdp_entry_t *pdpe,
     vm_offset_t va);
 static boolean_t pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot);
 static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va,
     vm_page_t m, vm_prot_t prot, vm_page_t mpte);
 static void pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte);
 static void pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte);
 static boolean_t pmap_is_modified_pvh(struct md_page *pvh);
 static boolean_t pmap_is_referenced_pvh(struct md_page *pvh);
 static void pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode);
 static vm_page_t pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va);
 static void pmap_pde_attr(pd_entry_t *pde, int cache_bits);
 static void pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va);
 static boolean_t pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva,
     vm_prot_t prot);
 static void pmap_pte_attr(pt_entry_t *pte, int cache_bits);
 static int pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva,
 		vm_page_t *free);
 static int pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq,
 		vm_offset_t sva, pd_entry_t ptepde, vm_page_t *free);
 static void pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte);
 static void pmap_remove_page(pmap_t pmap, vm_offset_t va, pd_entry_t *pde,
     vm_page_t *free);
 static void pmap_remove_entry(struct pmap *pmap, vm_page_t m,
 		vm_offset_t va);
 static void pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t m);
 static boolean_t pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va,
     vm_page_t m);
 static void pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde,
     pd_entry_t newpde);
 static void pmap_update_pde_invalidate(vm_offset_t va, pd_entry_t newpde);
 
 static vm_page_t pmap_allocpde(pmap_t pmap, vm_offset_t va, int flags);
 static vm_page_t pmap_allocpte(pmap_t pmap, vm_offset_t va, int flags);
 
 static vm_page_t _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex, int flags);
 static int _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m,
                 vm_page_t* free);
 static int pmap_unuse_pt(pmap_t, vm_offset_t, pd_entry_t, vm_page_t *);
 static vm_offset_t pmap_kmem_choose(vm_offset_t addr);
 
 CTASSERT(1 << PDESHIFT == sizeof(pd_entry_t));
 CTASSERT(1 << PTESHIFT == sizeof(pt_entry_t));
 
 /*
  * Move the kernel virtual free pointer to the next
  * 2MB.  This is used to help improve performance
  * by using a large (2MB) page for much of the kernel
  * (.text, .data, .bss)
  */
 static vm_offset_t
 pmap_kmem_choose(vm_offset_t addr)
 {
 	vm_offset_t newaddr = addr;
 
 	newaddr = (addr + (NBPDR - 1)) & ~(NBPDR - 1);
 	return (newaddr);
 }
 
 /********************/
 /* Inline functions */
 /********************/
 
 /* Return a non-clipped PD index for a given VA */
 static __inline vm_pindex_t
 pmap_pde_pindex(vm_offset_t va)
 {
 	return (va >> PDRSHIFT);
 }
 
 
 /* Return various clipped indexes for a given VA */
 static __inline vm_pindex_t
 pmap_pte_index(vm_offset_t va)
 {
 
 	return ((va >> PAGE_SHIFT) & ((1ul << NPTEPGSHIFT) - 1));
 }
 
 static __inline vm_pindex_t
 pmap_pde_index(vm_offset_t va)
 {
 
 	return ((va >> PDRSHIFT) & ((1ul << NPDEPGSHIFT) - 1));
 }
 
 static __inline vm_pindex_t
 pmap_pdpe_index(vm_offset_t va)
 {
 
 	return ((va >> PDPSHIFT) & ((1ul << NPDPEPGSHIFT) - 1));
 }
 
 static __inline vm_pindex_t
 pmap_pml4e_index(vm_offset_t va)
 {
 
 	return ((va >> PML4SHIFT) & ((1ul << NPML4EPGSHIFT) - 1));
 }
 
 /* Return a pointer to the PML4 slot that corresponds to a VA */
 static __inline pml4_entry_t *
 pmap_pml4e(pmap_t pmap, vm_offset_t va)
 {
 
 	return (&pmap->pm_pml4[pmap_pml4e_index(va)]);
 }
 
 /* Return a pointer to the PDP slot that corresponds to a VA */
 static __inline pdp_entry_t *
 pmap_pml4e_to_pdpe(pml4_entry_t *pml4e, vm_offset_t va)
 {
 	pdp_entry_t *pdpe;
 
 	pdpe = (pdp_entry_t *)PHYS_TO_DMAP(*pml4e & PG_FRAME);
 	return (&pdpe[pmap_pdpe_index(va)]);
 }
 
 /* Return a pointer to the PDP slot that corresponds to a VA */
 static __inline pdp_entry_t *
 pmap_pdpe(pmap_t pmap, vm_offset_t va)
 {
 	pml4_entry_t *pml4e;
 
 	pml4e = pmap_pml4e(pmap, va);
 	if ((*pml4e & PG_V) == 0)
 		return (NULL);
 	return (pmap_pml4e_to_pdpe(pml4e, va));
 }
 
 /* Return a pointer to the PD slot that corresponds to a VA */
 static __inline pd_entry_t *
 pmap_pdpe_to_pde(pdp_entry_t *pdpe, vm_offset_t va)
 {
 	pd_entry_t *pde;
 
 	pde = (pd_entry_t *)PHYS_TO_DMAP(*pdpe & PG_FRAME);
 	return (&pde[pmap_pde_index(va)]);
 }
 
 /* Return a pointer to the PD slot that corresponds to a VA */
 static __inline pd_entry_t *
 pmap_pde(pmap_t pmap, vm_offset_t va)
 {
 	pdp_entry_t *pdpe;
 
 	pdpe = pmap_pdpe(pmap, va);
 	if (pdpe == NULL || (*pdpe & PG_V) == 0)
 		return (NULL);
 	return (pmap_pdpe_to_pde(pdpe, va));
 }
 
 /* Return a pointer to the PT slot that corresponds to a VA */
 static __inline pt_entry_t *
 pmap_pde_to_pte(pd_entry_t *pde, vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	pte = (pt_entry_t *)PHYS_TO_DMAP(*pde & PG_FRAME);
 	return (&pte[pmap_pte_index(va)]);
 }
 
 /* Return a pointer to the PT slot that corresponds to a VA */
 static __inline pt_entry_t *
 pmap_pte(pmap_t pmap, vm_offset_t va)
 {
 	pd_entry_t *pde;
 
 	pde = pmap_pde(pmap, va);
 	if (pde == NULL || (*pde & PG_V) == 0)
 		return (NULL);
 	if ((*pde & PG_PS) != 0)	/* compat with i386 pmap_pte() */
 		return ((pt_entry_t *)pde);
 	return (pmap_pde_to_pte(pde, va));
 }
 
 static __inline void
 pmap_resident_count_inc(pmap_t pmap, int count)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	pmap->pm_stats.resident_count += count;
 }
 
 static __inline void
 pmap_resident_count_dec(pmap_t pmap, int count)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	pmap->pm_stats.resident_count -= count;
 }
 
 PMAP_INLINE pt_entry_t *
 vtopte(vm_offset_t va)
 {
 	u_int64_t mask = ((1ul << (NPTEPGSHIFT + NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1);
 
 	return (PTmap + ((va >> PAGE_SHIFT) & mask));
 }
 
 static __inline pd_entry_t *
 vtopde(vm_offset_t va)
 {
 	u_int64_t mask = ((1ul << (NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1);
 
 	return (PDmap + ((va >> PDRSHIFT) & mask));
 }
 
 static u_int64_t
 allocpages(vm_paddr_t *firstaddr, int n)
 {
 	u_int64_t ret;
 
 	ret = *firstaddr;
 	bzero((void *)ret, n * PAGE_SIZE);
 	*firstaddr += n * PAGE_SIZE;
 	return (ret);
 }
 
 CTASSERT(powerof2(NDMPML4E));
 
 static void
 create_pagetables(vm_paddr_t *firstaddr)
 {
 	int i, j, ndm1g;
 
 	/* Allocate pages */
 	KPTphys = allocpages(firstaddr, NKPT);
 	KPML4phys = allocpages(firstaddr, 1);
 	KPDPphys = allocpages(firstaddr, NKPML4E);
 	KPDphys = allocpages(firstaddr, NKPDPE);
 
 	ndmpdp = (ptoa(Maxmem) + NBPDP - 1) >> PDPSHIFT;
 	if (ndmpdp < 4)		/* Minimum 4GB of dirmap */
 		ndmpdp = 4;
 	DMPDPphys = allocpages(firstaddr, NDMPML4E);
 	ndm1g = 0;
 	if ((amd_feature & AMDID_PAGE1GB) != 0)
 		ndm1g = ptoa(Maxmem) >> PDPSHIFT;
 	if (ndm1g < ndmpdp)
 		DMPDphys = allocpages(firstaddr, ndmpdp - ndm1g);
 	dmaplimit = (vm_paddr_t)ndmpdp << PDPSHIFT;
 
 	/* Fill in the underlying page table pages */
 	/* Read-only from zero to physfree */
 	/* XXX not fully used, underneath 2M pages */
 	for (i = 0; (i << PAGE_SHIFT) < *firstaddr; i++) {
 		((pt_entry_t *)KPTphys)[i] = i << PAGE_SHIFT;
 		((pt_entry_t *)KPTphys)[i] |= PG_RW | PG_V | PG_G;
 	}
 
 	/* Now map the page tables at their location within PTmap */
 	for (i = 0; i < NKPT; i++) {
 		((pd_entry_t *)KPDphys)[i] = KPTphys + (i << PAGE_SHIFT);
 		((pd_entry_t *)KPDphys)[i] |= PG_RW | PG_V;
 	}
 
 	/* Map from zero to end of allocations under 2M pages */
 	/* This replaces some of the KPTphys entries above */
 	for (i = 0; (i << PDRSHIFT) < *firstaddr; i++) {
 		((pd_entry_t *)KPDphys)[i] = i << PDRSHIFT;
 		((pd_entry_t *)KPDphys)[i] |= PG_RW | PG_V | PG_PS | PG_G;
 	}
 
 	/* And connect up the PD to the PDP */
 	for (i = 0; i < NKPDPE; i++) {
 		((pdp_entry_t *)KPDPphys)[i + KPDPI] = KPDphys +
 		    (i << PAGE_SHIFT);
 		((pdp_entry_t *)KPDPphys)[i + KPDPI] |= PG_RW | PG_V | PG_U;
 	}
 
 	/*
 	 * Now, set up the direct map region using 2MB and/or 1GB pages.  If
 	 * the end of physical memory is not aligned to a 1GB page boundary,
 	 * then the residual physical memory is mapped with 2MB pages.  Later,
 	 * if pmap_mapdev{_attr}() uses the direct map for non-write-back
 	 * memory, pmap_change_attr() will demote any 2MB or 1GB page mappings
 	 * that are partially used. 
 	 */
 	for (i = NPDEPG * ndm1g, j = 0; i < NPDEPG * ndmpdp; i++, j++) {
 		((pd_entry_t *)DMPDphys)[j] = (vm_paddr_t)i << PDRSHIFT;
 		/* Preset PG_M and PG_A because demotion expects it. */
 		((pd_entry_t *)DMPDphys)[j] |= PG_RW | PG_V | PG_PS | PG_G |
 		    PG_M | PG_A;
 	}
 	for (i = 0; i < ndm1g; i++) {
 		((pdp_entry_t *)DMPDPphys)[i] = (vm_paddr_t)i << PDPSHIFT;
 		/* Preset PG_M and PG_A because demotion expects it. */
 		((pdp_entry_t *)DMPDPphys)[i] |= PG_RW | PG_V | PG_PS | PG_G |
 		    PG_M | PG_A;
 	}
 	for (j = 0; i < ndmpdp; i++, j++) {
 		((pdp_entry_t *)DMPDPphys)[i] = DMPDphys + (j << PAGE_SHIFT);
 		((pdp_entry_t *)DMPDPphys)[i] |= PG_RW | PG_V | PG_U;
 	}
 
 	/* And recursively map PML4 to itself in order to get PTmap */
 	((pdp_entry_t *)KPML4phys)[PML4PML4I] = KPML4phys;
 	((pdp_entry_t *)KPML4phys)[PML4PML4I] |= PG_RW | PG_V | PG_U;
 
 	/* Connect the Direct Map slot(s) up to the PML4. */
 	for (i = 0; i < NDMPML4E; i++) {
 		((pdp_entry_t *)KPML4phys)[DMPML4I + i] = DMPDPphys +
 		    (i << PAGE_SHIFT);
 		((pdp_entry_t *)KPML4phys)[DMPML4I + i] |= PG_RW | PG_V | PG_U;
 	}
 
 	/* Connect the KVA slot up to the PML4 */
 	((pdp_entry_t *)KPML4phys)[KPML4I] = KPDPphys;
 	((pdp_entry_t *)KPML4phys)[KPML4I] |= PG_RW | PG_V | PG_U;
 }
 
 /*
  *	Bootstrap the system enough to run with virtual memory.
  *
  *	On amd64 this is called after mapping has already been enabled
  *	and just syncs the pmap module with what has already been done.
  *	[We can't call it easily with mapping off since the kernel is not
  *	mapped with PA == VA, hence we would have to relocate every address
  *	from the linked base (virtual) address "KERNBASE" to the actual
  *	(physical) address starting relative to 0]
  */
 void
 pmap_bootstrap(vm_paddr_t *firstaddr)
 {
 	vm_offset_t va;
 	pt_entry_t *pte, *unused;
 
 	/*
 	 * Create an initial set of page tables to run the kernel in.
 	 */
 	create_pagetables(firstaddr);
 
 	virtual_avail = (vm_offset_t) KERNBASE + *firstaddr;
 	virtual_avail = pmap_kmem_choose(virtual_avail);
 
 	virtual_end = VM_MAX_KERNEL_ADDRESS;
 
 
 	/* XXX do %cr0 as well */
 	load_cr4(rcr4() | CR4_PGE | CR4_PSE);
 	load_cr3(KPML4phys);
 
 	/*
 	 * Initialize the kernel pmap (which is statically allocated).
 	 */
 	PMAP_LOCK_INIT(kernel_pmap);
 	kernel_pmap->pm_pml4 = (pdp_entry_t *)PHYS_TO_DMAP(KPML4phys);
 	kernel_pmap->pm_root = NULL;
-	kernel_pmap->pm_active = -1;	/* don't allow deactivation */
+	CPU_FILL(&kernel_pmap->pm_active);	/* don't allow deactivation */
 	TAILQ_INIT(&kernel_pmap->pm_pvchunk);
 
 	/*
 	 * Reserve some special page table entries/VA space for temporary
 	 * mapping of pages.
 	 */
 #define	SYSMAP(c, p, v, n)	\
 	v = (c)va; va += ((n)*PAGE_SIZE); p = pte; pte += (n);
 
 	va = virtual_avail;
 	pte = vtopte(va);
 
 	/*
 	 * CMAP1 is only used for the memory test.
 	 */
 	SYSMAP(caddr_t, CMAP1, CADDR1, 1)
 
 	/*
 	 * Crashdump maps.
 	 */
 	SYSMAP(caddr_t, unused, crashdumpmap, MAXDUMPPGS)
 
 	virtual_avail = va;
 
 	/* Initialize the PAT MSR. */
 	pmap_init_pat();
 }
 
 /*
  * Setup the PAT MSR.
  */
 void
 pmap_init_pat(void)
 {
 	int pat_table[PAT_INDEX_SIZE];
 	uint64_t pat_msr;
 	u_long cr0, cr4;
 	int i;
 
 	/* Bail if this CPU doesn't implement PAT. */
 	if ((cpu_feature & CPUID_PAT) == 0)
 		panic("no PAT??");
 
 	/* Set default PAT index table. */
 	for (i = 0; i < PAT_INDEX_SIZE; i++)
 		pat_table[i] = -1;
 	pat_table[PAT_WRITE_BACK] = 0;
 	pat_table[PAT_WRITE_THROUGH] = 1;
 	pat_table[PAT_UNCACHEABLE] = 3;
 	pat_table[PAT_WRITE_COMBINING] = 3;
 	pat_table[PAT_WRITE_PROTECTED] = 3;
 	pat_table[PAT_UNCACHED] = 3;
 
 	/* Initialize default PAT entries. */
 	pat_msr = PAT_VALUE(0, PAT_WRITE_BACK) |
 	    PAT_VALUE(1, PAT_WRITE_THROUGH) |
 	    PAT_VALUE(2, PAT_UNCACHED) |
 	    PAT_VALUE(3, PAT_UNCACHEABLE) |
 	    PAT_VALUE(4, PAT_WRITE_BACK) |
 	    PAT_VALUE(5, PAT_WRITE_THROUGH) |
 	    PAT_VALUE(6, PAT_UNCACHED) |
 	    PAT_VALUE(7, PAT_UNCACHEABLE);
 
 	if (pat_works) {
 		/*
 		 * Leave the indices 0-3 at the default of WB, WT, UC-, and UC.
 		 * Program 5 and 6 as WP and WC.
 		 * Leave 4 and 7 as WB and UC.
 		 */
 		pat_msr &= ~(PAT_MASK(5) | PAT_MASK(6));
 		pat_msr |= PAT_VALUE(5, PAT_WRITE_PROTECTED) |
 		    PAT_VALUE(6, PAT_WRITE_COMBINING);
 		pat_table[PAT_UNCACHED] = 2;
 		pat_table[PAT_WRITE_PROTECTED] = 5;
 		pat_table[PAT_WRITE_COMBINING] = 6;
 	} else {
 		/*
 		 * Just replace PAT Index 2 with WC instead of UC-.
 		 */
 		pat_msr &= ~PAT_MASK(2);
 		pat_msr |= PAT_VALUE(2, PAT_WRITE_COMBINING);
 		pat_table[PAT_WRITE_COMBINING] = 2;
 	}
 
 	/* Disable PGE. */
 	cr4 = rcr4();
 	load_cr4(cr4 & ~CR4_PGE);
 
 	/* Disable caches (CD = 1, NW = 0). */
 	cr0 = rcr0();
 	load_cr0((cr0 & ~CR0_NW) | CR0_CD);
 
 	/* Flushes caches and TLBs. */
 	wbinvd();
 	invltlb();
 
 	/* Update PAT and index table. */
 	wrmsr(MSR_PAT, pat_msr);
 	for (i = 0; i < PAT_INDEX_SIZE; i++)
 		pat_index[i] = pat_table[i];
 
 	/* Flush caches and TLBs again. */
 	wbinvd();
 	invltlb();
 
 	/* Restore caches and PGE. */
 	load_cr0(cr0);
 	load_cr4(cr4);
 }
 
 /*
  *	Initialize a vm_page's machine-dependent fields.
  */
 void
 pmap_page_init(vm_page_t m)
 {
 
 	TAILQ_INIT(&m->md.pv_list);
 	m->md.pat_mode = PAT_WRITE_BACK;
 }
 
 /*
  *	Initialize the pmap module.
  *	Called by vm_init, to initialize any structures that the pmap
  *	system needs to map virtual memory.
  */
 void
 pmap_init(void)
 {
 	vm_page_t mpte;
 	vm_size_t s;
 	int i, pv_npg;
 
 	/*
 	 * Initialize the vm page array entries for the kernel pmap's
 	 * page table pages.
 	 */ 
 	for (i = 0; i < NKPT; i++) {
 		mpte = PHYS_TO_VM_PAGE(KPTphys + (i << PAGE_SHIFT));
 		KASSERT(mpte >= vm_page_array &&
 		    mpte < &vm_page_array[vm_page_array_size],
 		    ("pmap_init: page table page is out of range"));
 		mpte->pindex = pmap_pde_pindex(KERNBASE) + i;
 		mpte->phys_addr = KPTphys + (i << PAGE_SHIFT);
 	}
 
 	/*
 	 * Initialize the address space (zone) for the pv entries.  Set a
 	 * high water mark so that the system can recover from excessive
 	 * numbers of pv entries.
 	 */
 	TUNABLE_INT_FETCH("vm.pmap.shpgperproc", &shpgperproc);
 	pv_entry_max = shpgperproc * maxproc + cnt.v_page_count;
 	TUNABLE_INT_FETCH("vm.pmap.pv_entries", &pv_entry_max);
 	pv_entry_high_water = 9 * (pv_entry_max / 10);
 
 	/*
 	 * If the kernel is running in a virtual machine on an AMD Family 10h
 	 * processor, then it must assume that MCA is enabled by the virtual
 	 * machine monitor.
 	 */
 	if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD &&
 	    CPUID_TO_FAMILY(cpu_id) == 0x10)
 		workaround_erratum383 = 1;
 
 	/*
 	 * Are large page mappings enabled?
 	 */
 	TUNABLE_INT_FETCH("vm.pmap.pg_ps_enabled", &pg_ps_enabled);
 	if (pg_ps_enabled) {
 		KASSERT(MAXPAGESIZES > 1 && pagesizes[1] == 0,
 		    ("pmap_init: can't assign to pagesizes[1]"));
 		pagesizes[1] = NBPDR;
 	}
 
 	/*
 	 * Calculate the size of the pv head table for superpages.
 	 */
 	for (i = 0; phys_avail[i + 1]; i += 2);
 	pv_npg = round_2mpage(phys_avail[(i - 2) + 1]) / NBPDR;
 
 	/*
 	 * Allocate memory for the pv head table for superpages.
 	 */
 	s = (vm_size_t)(pv_npg * sizeof(struct md_page));
 	s = round_page(s);
 	pv_table = (struct md_page *)kmem_alloc(kernel_map, s);
 	for (i = 0; i < pv_npg; i++)
 		TAILQ_INIT(&pv_table[i].pv_list);
 }
 
 static int
 pmap_pventry_proc(SYSCTL_HANDLER_ARGS)
 {
 	int error;
 
 	error = sysctl_handle_int(oidp, oidp->oid_arg1, oidp->oid_arg2, req);
 	if (error == 0 && req->newptr) {
 		shpgperproc = (pv_entry_max - cnt.v_page_count) / maxproc;
 		pv_entry_high_water = 9 * (pv_entry_max / 10);
 	}
 	return (error);
 }
 SYSCTL_PROC(_vm_pmap, OID_AUTO, pv_entry_max, CTLTYPE_INT|CTLFLAG_RW, 
     &pv_entry_max, 0, pmap_pventry_proc, "IU", "Max number of PV entries");
 
 static int
 pmap_shpgperproc_proc(SYSCTL_HANDLER_ARGS)
 {
 	int error;
 
 	error = sysctl_handle_int(oidp, oidp->oid_arg1, oidp->oid_arg2, req);
 	if (error == 0 && req->newptr) {
 		pv_entry_max = shpgperproc * maxproc + cnt.v_page_count;
 		pv_entry_high_water = 9 * (pv_entry_max / 10);
 	}
 	return (error);
 }
 SYSCTL_PROC(_vm_pmap, OID_AUTO, shpgperproc, CTLTYPE_INT|CTLFLAG_RW, 
     &shpgperproc, 0, pmap_shpgperproc_proc, "IU", "Page share factor per proc");
 
 SYSCTL_NODE(_vm_pmap, OID_AUTO, pde, CTLFLAG_RD, 0,
     "2MB page mapping counters");
 
 static u_long pmap_pde_demotions;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, demotions, CTLFLAG_RD,
     &pmap_pde_demotions, 0, "2MB page demotions");
 
 static u_long pmap_pde_mappings;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, mappings, CTLFLAG_RD,
     &pmap_pde_mappings, 0, "2MB page mappings");
 
 static u_long pmap_pde_p_failures;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, p_failures, CTLFLAG_RD,
     &pmap_pde_p_failures, 0, "2MB page promotion failures");
 
 static u_long pmap_pde_promotions;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, promotions, CTLFLAG_RD,
     &pmap_pde_promotions, 0, "2MB page promotions");
 
 SYSCTL_NODE(_vm_pmap, OID_AUTO, pdpe, CTLFLAG_RD, 0,
     "1GB page mapping counters");
 
 static u_long pmap_pdpe_demotions;
 SYSCTL_ULONG(_vm_pmap_pdpe, OID_AUTO, demotions, CTLFLAG_RD,
     &pmap_pdpe_demotions, 0, "1GB page demotions");
 
 /***************************************************
  * Low level helper routines.....
  ***************************************************/
 
 /*
  * Determine the appropriate bits to set in a PTE or PDE for a specified
  * caching mode.
  */
 static int
 pmap_cache_bits(int mode, boolean_t is_pde)
 {
 	int cache_bits, pat_flag, pat_idx;
 
 	if (mode < 0 || mode >= PAT_INDEX_SIZE || pat_index[mode] < 0)
 		panic("Unknown caching mode %d\n", mode);
 
 	/* The PAT bit is different for PTE's and PDE's. */
 	pat_flag = is_pde ? PG_PDE_PAT : PG_PTE_PAT;
 
 	/* Map the caching mode to a PAT index. */
 	pat_idx = pat_index[mode];
 
 	/* Map the 3-bit index value into the PAT, PCD, and PWT bits. */
 	cache_bits = 0;
 	if (pat_idx & 0x4)
 		cache_bits |= pat_flag;
 	if (pat_idx & 0x2)
 		cache_bits |= PG_NC_PCD;
 	if (pat_idx & 0x1)
 		cache_bits |= PG_NC_PWT;
 	return (cache_bits);
 }
 
 /*
  * After changing the page size for the specified virtual address in the page
  * table, flush the corresponding entries from the processor's TLB.  Only the
  * calling processor's TLB is affected.
  *
  * The calling thread must be pinned to a processor.
  */
 static void
 pmap_update_pde_invalidate(vm_offset_t va, pd_entry_t newpde)
 {
 	u_long cr4;
 
 	if ((newpde & PG_PS) == 0)
 		/* Demotion: flush a specific 2MB page mapping. */
 		invlpg(va);
 	else if ((newpde & PG_G) == 0)
 		/*
 		 * Promotion: flush every 4KB page mapping from the TLB
 		 * because there are too many to flush individually.
 		 */
 		invltlb();
 	else {
 		/*
 		 * Promotion: flush every 4KB page mapping from the TLB,
 		 * including any global (PG_G) mappings.
 		 */
 		cr4 = rcr4();
 		load_cr4(cr4 & ~CR4_PGE);
 		/*
 		 * Although preemption at this point could be detrimental to
 		 * performance, it would not lead to an error.  PG_G is simply
 		 * ignored if CR4.PGE is clear.  Moreover, in case this block
 		 * is re-entered, the load_cr4() either above or below will
 		 * modify CR4.PGE flushing the TLB.
 		 */
 		load_cr4(cr4 | CR4_PGE);
 	}
 }
 #ifdef SMP
 /*
  * For SMP, these functions have to use the IPI mechanism for coherence.
  *
  * N.B.: Before calling any of the following TLB invalidation functions,
  * the calling processor must ensure that all stores updating a non-
  * kernel page table are globally performed.  Otherwise, another
  * processor could cache an old, pre-update entry without being
  * invalidated.  This can happen one of two ways: (1) The pmap becomes
  * active on another processor after its pm_active field is checked by
  * one of the following functions but before a store updating the page
  * table is globally performed. (2) The pmap becomes active on another
  * processor before its pm_active field is checked but due to
  * speculative loads one of the following functions stills reads the
  * pmap as inactive on the other processor.
  * 
  * The kernel page table is exempt because its pm_active field is
  * immutable.  The kernel page table is always active on every
  * processor.
  */
 void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		invlpg(va);
 		smp_invlpg(va);
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			invlpg(va);
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invlpg(pmap->pm_active & other_cpus, va);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invlpg(other_cpus, va);
 	}
 	sched_unpin();
 }
 
 void
 pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 	vm_offset_t addr;
 
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		for (addr = sva; addr < eva; addr += PAGE_SIZE)
 			invlpg(addr);
 		smp_invlpg_range(sva, eva);
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			for (addr = sva; addr < eva; addr += PAGE_SIZE)
 				invlpg(addr);
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invlpg_range(pmap->pm_active & other_cpus,
-			    sva, eva);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invlpg_range(other_cpus, sva, eva);
 	}
 	sched_unpin();
 }
 
 void
 pmap_invalidate_all(pmap_t pmap)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		invltlb();
 		smp_invltlb();
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			invltlb();
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invltlb(pmap->pm_active & other_cpus);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invltlb(other_cpus);
 	}
 	sched_unpin();
 }
 
 void
 pmap_invalidate_cache(void)
 {
 
 	sched_pin();
 	wbinvd();
 	smp_cache_flush();
 	sched_unpin();
 }
 
 struct pde_action {
-	cpumask_t store;	/* processor that updates the PDE */
-	cpumask_t invalidate;	/* processors that invalidate their TLB */
+	cpuset_t store;		/* processor that updates the PDE */
+	cpuset_t invalidate;	/* processors that invalidate their TLB */
 	vm_offset_t va;
 	pd_entry_t *pde;
 	pd_entry_t newpde;
 };
 
 static void
 pmap_update_pde_action(void *arg)
 {
 	struct pde_action *act = arg;
 
-	if (act->store == PCPU_GET(cpumask))
+	sched_pin();
+	if (!CPU_CMP(&act->store, PCPU_PTR(cpumask))) {
+		sched_unpin();
 		pde_store(act->pde, act->newpde);
+	} else
+		sched_unpin();
 }
 
 static void
 pmap_update_pde_teardown(void *arg)
 {
 	struct pde_action *act = arg;
 
-	if ((act->invalidate & PCPU_GET(cpumask)) != 0)
+	sched_pin();
+	if (CPU_OVERLAP(&act->invalidate, PCPU_PTR(cpumask))) {
+		sched_unpin();
 		pmap_update_pde_invalidate(act->va, act->newpde);
+	} else
+		sched_unpin();
 }
 
 /*
  * Change the page size for the specified virtual address in a way that
  * prevents any possibility of the TLB ever having two entries that map the
  * same virtual address using different page sizes.  This is the recommended
  * workaround for Erratum 383 on AMD Family 10h processors.  It prevents a
  * machine check exception for a TLB state that is improperly diagnosed as a
  * hardware error.
  */
 static void
 pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde)
 {
 	struct pde_action act;
-	cpumask_t active, cpumask;
+	cpuset_t active, cpumask, other_cpus;
 
 	sched_pin();
 	cpumask = PCPU_GET(cpumask);
+	other_cpus = PCPU_GET(other_cpus);
 	if (pmap == kernel_pmap)
 		active = all_cpus;
 	else
 		active = pmap->pm_active;
-	if ((active & PCPU_GET(other_cpus)) != 0) {
+	if (CPU_OVERLAP(&active, &other_cpus)) { 
 		act.store = cpumask;
 		act.invalidate = active;
 		act.va = va;
 		act.pde = pde;
 		act.newpde = newpde;
-		smp_rendezvous_cpus(cpumask | active,
+		CPU_OR(&cpumask, &active);
+		smp_rendezvous_cpus(cpumask,
 		    smp_no_rendevous_barrier, pmap_update_pde_action,
 		    pmap_update_pde_teardown, &act);
 	} else {
 		pde_store(pde, newpde);
-		if ((active & cpumask) != 0)
+		if (CPU_OVERLAP(&active, &cpumask))
 			pmap_update_pde_invalidate(va, newpde);
 	}
 	sched_unpin();
 }
 #else /* !SMP */
 /*
  * Normal, non-SMP, invalidation functions.
  * We inline these within pmap.c for speed.
  */
 PMAP_INLINE void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
 
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		invlpg(va);
 }
 
 PMAP_INLINE void
 pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t addr;
 
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		for (addr = sva; addr < eva; addr += PAGE_SIZE)
 			invlpg(addr);
 }
 
 PMAP_INLINE void
 pmap_invalidate_all(pmap_t pmap)
 {
 
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		invltlb();
 }
 
 PMAP_INLINE void
 pmap_invalidate_cache(void)
 {
 
 	wbinvd();
 }
 
 static void
 pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde)
 {
 
 	pde_store(pde, newpde);
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		pmap_update_pde_invalidate(va, newpde);
 }
 #endif /* !SMP */
 
 #define PMAP_CLFLUSH_THRESHOLD   (2 * 1024 * 1024)
 
 void
 pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
 {
 
 	KASSERT((sva & PAGE_MASK) == 0,
 	    ("pmap_invalidate_cache_range: sva not page-aligned"));
 	KASSERT((eva & PAGE_MASK) == 0,
 	    ("pmap_invalidate_cache_range: eva not page-aligned"));
 
 	if (cpu_feature & CPUID_SS)
 		; /* If "Self Snoop" is supported, do nothing. */
 	else if ((cpu_feature & CPUID_CLFSH) != 0 &&
 	    eva - sva < PMAP_CLFLUSH_THRESHOLD) {
 
 		/*
 		 * Otherwise, do per-cache line flush.  Use the mfence
 		 * instruction to insure that previous stores are
 		 * included in the write-back.  The processor
 		 * propagates flush to other processors in the cache
 		 * coherence domain.
 		 */
 		mfence();
 		for (; sva < eva; sva += cpu_clflush_line_size)
 			clflush(sva);
 		mfence();
 	} else {
 
 		/*
 		 * No targeted cache flush methods are supported by CPU,
 		 * or the supplied range is bigger than 2MB.
 		 * Globally invalidate cache.
 		 */
 		pmap_invalidate_cache();
 	}
 }
 
 /*
  * Remove the specified set of pages from the data and instruction caches.
  *
  * In contrast to pmap_invalidate_cache_range(), this function does not
  * rely on the CPU's self-snoop feature, because it is intended for use
  * when moving pages into a different cache domain.
  */
 void
 pmap_invalidate_cache_pages(vm_page_t *pages, int count)
 {
 	vm_offset_t daddr, eva;
 	int i;
 
 	if (count >= PMAP_CLFLUSH_THRESHOLD / PAGE_SIZE ||
 	    (cpu_feature & CPUID_CLFSH) == 0)
 		pmap_invalidate_cache();
 	else {
 		mfence();
 		for (i = 0; i < count; i++) {
 			daddr = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pages[i]));
 			eva = daddr + PAGE_SIZE;
 			for (; daddr < eva; daddr += cpu_clflush_line_size)
 				clflush(daddr);
 		}
 		mfence();
 	}
 }
 
 /*
  * Are we current address space or kernel?
  */
 static __inline int
 pmap_is_current(pmap_t pmap)
 {
 	return (pmap == kernel_pmap ||
 	    (pmap->pm_pml4[PML4PML4I] & PG_FRAME) == (PML4pml4e[0] & PG_FRAME));
 }
 
 /*
  *	Routine:	pmap_extract
  *	Function:
  *		Extract the physical page address associated
  *		with the given map/virtual_address pair.
  */
 vm_paddr_t 
 pmap_extract(pmap_t pmap, vm_offset_t va)
 {
 	pdp_entry_t *pdpe;
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	vm_paddr_t pa;
 
 	pa = 0;
 	PMAP_LOCK(pmap);
 	pdpe = pmap_pdpe(pmap, va);
 	if (pdpe != NULL && (*pdpe & PG_V) != 0) {
 		if ((*pdpe & PG_PS) != 0)
 			pa = (*pdpe & PG_PS_FRAME) | (va & PDPMASK);
 		else {
 			pde = pmap_pdpe_to_pde(pdpe, va);
 			if ((*pde & PG_V) != 0) {
 				if ((*pde & PG_PS) != 0) {
 					pa = (*pde & PG_PS_FRAME) |
 					    (va & PDRMASK);
 				} else {
 					pte = pmap_pde_to_pte(pde, va);
 					pa = (*pte & PG_FRAME) |
 					    (va & PAGE_MASK);
 				}
 			}
 		}
 	}
 	PMAP_UNLOCK(pmap);
 	return (pa);
 }
 
 /*
  *	Routine:	pmap_extract_and_hold
  *	Function:
  *		Atomically extract and hold the physical page
  *		with the given pmap and virtual address pair
  *		if that mapping permits the given protection.
  */
 vm_page_t
 pmap_extract_and_hold(pmap_t pmap, vm_offset_t va, vm_prot_t prot)
 {
 	pd_entry_t pde, *pdep;
 	pt_entry_t pte;
 	vm_paddr_t pa;
 	vm_page_t m;
 
 	pa = 0;
 	m = NULL;
 	PMAP_LOCK(pmap);
 retry:
 	pdep = pmap_pde(pmap, va);
 	if (pdep != NULL && (pde = *pdep)) {
 		if (pde & PG_PS) {
 			if ((pde & PG_RW) || (prot & VM_PROT_WRITE) == 0) {
 				if (vm_page_pa_tryrelock(pmap, (pde & PG_PS_FRAME) |
 				       (va & PDRMASK), &pa))
 					goto retry;
 				m = PHYS_TO_VM_PAGE((pde & PG_PS_FRAME) |
 				    (va & PDRMASK));
 				vm_page_hold(m);
 			}
 		} else {
 			pte = *pmap_pde_to_pte(pdep, va);
 			if ((pte & PG_V) &&
 			    ((pte & PG_RW) || (prot & VM_PROT_WRITE) == 0)) {
 				if (vm_page_pa_tryrelock(pmap, pte & PG_FRAME, &pa))
 					goto retry;
 				m = PHYS_TO_VM_PAGE(pte & PG_FRAME);
 				vm_page_hold(m);
 			}
 		}
 	}
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pmap);
 	return (m);
 }
 
 vm_paddr_t
 pmap_kextract(vm_offset_t va)
 {
 	pd_entry_t pde;
 	vm_paddr_t pa;
 
 	if (va >= DMAP_MIN_ADDRESS && va < DMAP_MAX_ADDRESS) {
 		pa = DMAP_TO_PHYS(va);
 	} else {
 		pde = *vtopde(va);
 		if (pde & PG_PS) {
 			pa = (pde & PG_PS_FRAME) | (va & PDRMASK);
 		} else {
 			/*
 			 * Beware of a concurrent promotion that changes the
 			 * PDE at this point!  For example, vtopte() must not
 			 * be used to access the PTE because it would use the
 			 * new PDE.  It is, however, safe to use the old PDE
 			 * because the page table page is preserved by the
 			 * promotion.
 			 */
 			pa = *pmap_pde_to_pte(&pde, va);
 			pa = (pa & PG_FRAME) | (va & PAGE_MASK);
 		}
 	}
 	return (pa);
 }
 
 /***************************************************
  * Low level mapping routines.....
  ***************************************************/
 
 /*
  * Add a wired page to the kva.
  * Note: not SMP coherent.
  */
 PMAP_INLINE void 
 pmap_kenter(vm_offset_t va, vm_paddr_t pa)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_store(pte, pa | PG_RW | PG_V | PG_G);
 }
 
 static __inline void
 pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_store(pte, pa | PG_RW | PG_V | PG_G | pmap_cache_bits(mode, 0));
 }
 
 /*
  * Remove a page from the kernel pagetables.
  * Note: not SMP coherent.
  */
 PMAP_INLINE void
 pmap_kremove(vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_clear(pte);
 }
 
 /*
  *	Used to map a range of physical addresses into kernel
  *	virtual address space.
  *
  *	The value passed in '*virt' is a suggested virtual address for
  *	the mapping. Architectures which can support a direct-mapped
  *	physical to virtual region can return the appropriate address
  *	within that region, leaving '*virt' unchanged. Other
  *	architectures should map the pages starting at '*virt' and
  *	update '*virt' with the first usable address after the mapped
  *	region.
  */
 vm_offset_t
 pmap_map(vm_offset_t *virt, vm_paddr_t start, vm_paddr_t end, int prot)
 {
 	return PHYS_TO_DMAP(start);
 }
 
 
 /*
  * Add a list of wired pages to the kva
  * this routine is only used for temporary
  * kernel mappings that do not need to have
  * page modification or references recorded.
  * Note that old mappings are simply written
  * over.  The page *must* be wired.
  * Note: SMP coherent.  Uses a ranged shootdown IPI.
  */
 void
 pmap_qenter(vm_offset_t sva, vm_page_t *ma, int count)
 {
 	pt_entry_t *endpte, oldpte, pa, *pte;
 	vm_page_t m;
 
 	oldpte = 0;
 	pte = vtopte(sva);
 	endpte = pte + count;
 	while (pte < endpte) {
 		m = *ma++;
 		pa = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(m->md.pat_mode, 0);
 		if ((*pte & (PG_FRAME | PG_PTE_CACHE)) != pa) {
 			oldpte |= *pte;
 			pte_store(pte, pa | PG_G | PG_RW | PG_V);
 		}
 		pte++;
 	}
 	if (__predict_false((oldpte & PG_V) != 0))
 		pmap_invalidate_range(kernel_pmap, sva, sva + count *
 		    PAGE_SIZE);
 }
 
 /*
  * This routine tears out page mappings from the
  * kernel -- it is meant only for temporary mappings.
  * Note: SMP coherent.  Uses a ranged shootdown IPI.
  */
 void
 pmap_qremove(vm_offset_t sva, int count)
 {
 	vm_offset_t va;
 
 	va = sva;
 	while (count-- > 0) {
 		pmap_kremove(va);
 		va += PAGE_SIZE;
 	}
 	pmap_invalidate_range(kernel_pmap, sva, va);
 }
 
 /***************************************************
  * Page table page management routines.....
  ***************************************************/
 static __inline void
 pmap_free_zero_pages(vm_page_t free)
 {
 	vm_page_t m;
 
 	while (free != NULL) {
 		m = free;
 		free = m->right;
 		/* Preserve the page's PG_ZERO setting. */
 		vm_page_free_toq(m);
 	}
 }
 
 /*
  * Schedule the specified unused page table page to be freed.  Specifically,
  * add the page to the specified list of pages that will be released to the
  * physical memory manager after the TLB has been updated.
  */
 static __inline void
 pmap_add_delayed_free_list(vm_page_t m, vm_page_t *free, boolean_t set_PG_ZERO)
 {
 
 	if (set_PG_ZERO)
 		m->flags |= PG_ZERO;
 	else
 		m->flags &= ~PG_ZERO;
 	m->right = *free;
 	*free = m;
 }
 	
 /*
  * Inserts the specified page table page into the specified pmap's collection
  * of idle page table pages.  Each of a pmap's page table pages is responsible
  * for mapping a distinct range of virtual addresses.  The pmap's collection is
  * ordered by this virtual address range.
  */
 static void
 pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte)
 {
 	vm_page_t root;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	root = pmap->pm_root;
 	if (root == NULL) {
 		mpte->left = NULL;
 		mpte->right = NULL;
 	} else {
 		root = vm_page_splay(mpte->pindex, root);
 		if (mpte->pindex < root->pindex) {
 			mpte->left = root->left;
 			mpte->right = root;
 			root->left = NULL;
 		} else if (mpte->pindex == root->pindex)
 			panic("pmap_insert_pt_page: pindex already inserted");
 		else {
 			mpte->right = root->right;
 			mpte->left = root;
 			root->right = NULL;
 		}
 	}
 	pmap->pm_root = mpte;
 }
 
 /*
  * Looks for a page table page mapping the specified virtual address in the
  * specified pmap's collection of idle page table pages.  Returns NULL if there
  * is no page table page corresponding to the specified virtual address.
  */
 static vm_page_t
 pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va)
 {
 	vm_page_t mpte;
 	vm_pindex_t pindex = pmap_pde_pindex(va);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if ((mpte = pmap->pm_root) != NULL && mpte->pindex != pindex) {
 		mpte = vm_page_splay(pindex, mpte);
 		if ((pmap->pm_root = mpte)->pindex != pindex)
 			mpte = NULL;
 	}
 	return (mpte);
 }
 
 /*
  * Removes the specified page table page from the specified pmap's collection
  * of idle page table pages.  The specified page table page must be a member of
  * the pmap's collection.
  */
 static void
 pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte)
 {
 	vm_page_t root;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if (mpte != pmap->pm_root) {
 		root = vm_page_splay(mpte->pindex, pmap->pm_root);
 		KASSERT(mpte == root,
 		    ("pmap_remove_pt_page: mpte %p is missing from pmap %p",
 		    mpte, pmap));
 	}
 	if (mpte->left == NULL)
 		root = mpte->right;
 	else {
 		root = vm_page_splay(mpte->pindex, mpte->left);
 		root->right = mpte->right;
 	}
 	pmap->pm_root = root;
 }
 
 /*
  * This routine unholds page table pages, and if the hold count
  * drops to zero, then it decrements the wire count.
  */
 static __inline int
 pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_page_t *free)
 {
 
 	--m->wire_count;
 	if (m->wire_count == 0)
 		return (_pmap_unwire_pte_hold(pmap, va, m, free));
 	else
 		return (0);
 }
 
 static int 
 _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, 
     vm_page_t *free)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	/*
 	 * unmap the page table page
 	 */
 	if (m->pindex >= (NUPDE + NUPDPE)) {
 		/* PDP page */
 		pml4_entry_t *pml4;
 		pml4 = pmap_pml4e(pmap, va);
 		*pml4 = 0;
 	} else if (m->pindex >= NUPDE) {
 		/* PD page */
 		pdp_entry_t *pdp;
 		pdp = pmap_pdpe(pmap, va);
 		*pdp = 0;
 	} else {
 		/* PTE page */
 		pd_entry_t *pd;
 		pd = pmap_pde(pmap, va);
 		*pd = 0;
 	}
 	pmap_resident_count_dec(pmap, 1);
 	if (m->pindex < NUPDE) {
 		/* We just released a PT, unhold the matching PD */
 		vm_page_t pdpg;
 
 		pdpg = PHYS_TO_VM_PAGE(*pmap_pdpe(pmap, va) & PG_FRAME);
 		pmap_unwire_pte_hold(pmap, va, pdpg, free);
 	}
 	if (m->pindex >= NUPDE && m->pindex < (NUPDE + NUPDPE)) {
 		/* We just released a PD, unhold the matching PDP */
 		vm_page_t pdppg;
 
 		pdppg = PHYS_TO_VM_PAGE(*pmap_pml4e(pmap, va) & PG_FRAME);
 		pmap_unwire_pte_hold(pmap, va, pdppg, free);
 	}
 
 	/*
 	 * This is a release store so that the ordinary store unmapping
 	 * the page table page is globally performed before TLB shoot-
 	 * down is begun.
 	 */
 	atomic_subtract_rel_int(&cnt.v_wire_count, 1);
 
 	/* 
 	 * Put page on a list so that it is released after
 	 * *ALL* TLB shootdown is done
 	 */
 	pmap_add_delayed_free_list(m, free, TRUE);
 	
 	return (1);
 }
 
 /*
  * After removing a page table entry, this routine is used to
  * conditionally free the page, and manage the hold/wire counts.
  */
 static int
 pmap_unuse_pt(pmap_t pmap, vm_offset_t va, pd_entry_t ptepde, vm_page_t *free)
 {
 	vm_page_t mpte;
 
 	if (va >= VM_MAXUSER_ADDRESS)
 		return (0);
 	KASSERT(ptepde != 0, ("pmap_unuse_pt: ptepde != 0"));
 	mpte = PHYS_TO_VM_PAGE(ptepde & PG_FRAME);
 	return (pmap_unwire_pte_hold(pmap, va, mpte, free));
 }
 
 void
 pmap_pinit0(pmap_t pmap)
 {
 
 	PMAP_LOCK_INIT(pmap);
 	pmap->pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(KPML4phys);
 	pmap->pm_root = NULL;
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 	PCPU_SET(curpmap, pmap);
 	TAILQ_INIT(&pmap->pm_pvchunk);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 }
 
 /*
  * Initialize a preallocated and zeroed pmap structure,
  * such as one in a vmspace structure.
  */
 int
 pmap_pinit(pmap_t pmap)
 {
 	vm_page_t pml4pg;
 	static vm_pindex_t color;
 	int i;
 
 	PMAP_LOCK_INIT(pmap);
 
 	/*
 	 * allocate the page directory page
 	 */
 	while ((pml4pg = vm_page_alloc(NULL, color++, VM_ALLOC_NOOBJ |
 	    VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_ZERO)) == NULL)
 		VM_WAIT;
 
 	pmap->pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pml4pg));
 
 	if ((pml4pg->flags & PG_ZERO) == 0)
 		pagezero(pmap->pm_pml4);
 
 	/* Wire in kernel global address entries. */
 	pmap->pm_pml4[KPML4I] = KPDPphys | PG_RW | PG_V | PG_U;
 	for (i = 0; i < NDMPML4E; i++) {
 		pmap->pm_pml4[DMPML4I + i] = (DMPDPphys + (i << PAGE_SHIFT)) |
 		    PG_RW | PG_V | PG_U;
 	}
 
 	/* install self-referential address mapping entry(s) */
 	pmap->pm_pml4[PML4PML4I] = VM_PAGE_TO_PHYS(pml4pg) | PG_V | PG_RW | PG_A | PG_M;
 
 	pmap->pm_root = NULL;
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 	TAILQ_INIT(&pmap->pm_pvchunk);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 
 	return (1);
 }
 
 /*
  * this routine is called if the page table page is not
  * mapped correctly.
  *
  * Note: If a page allocation fails at page table level two or three,
  * one or two pages may be held during the wait, only to be released
  * afterwards.  This conservative approach is easily argued to avoid
  * race conditions.
  */
 static vm_page_t
 _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex, int flags)
 {
 	vm_page_t m, pdppg, pdpg;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("_pmap_allocpte: flags is neither M_NOWAIT nor M_WAITOK"));
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	/*
 	 * Allocate a page table page.
 	 */
 	if ((m = vm_page_alloc(NULL, ptepindex, VM_ALLOC_NOOBJ |
 	    VM_ALLOC_WIRED | VM_ALLOC_ZERO)) == NULL) {
 		if (flags & M_WAITOK) {
 			PMAP_UNLOCK(pmap);
 			vm_page_unlock_queues();
 			VM_WAIT;
 			vm_page_lock_queues();
 			PMAP_LOCK(pmap);
 		}
 
 		/*
 		 * Indicate the need to retry.  While waiting, the page table
 		 * page may have been allocated.
 		 */
 		return (NULL);
 	}
 	if ((m->flags & PG_ZERO) == 0)
 		pmap_zero_page(m);
 
 	/*
 	 * Map the pagetable page into the process address space, if
 	 * it isn't already there.
 	 */
 
 	if (ptepindex >= (NUPDE + NUPDPE)) {
 		pml4_entry_t *pml4;
 		vm_pindex_t pml4index;
 
 		/* Wire up a new PDPE page */
 		pml4index = ptepindex - (NUPDE + NUPDPE);
 		pml4 = &pmap->pm_pml4[pml4index];
 		*pml4 = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M;
 
 	} else if (ptepindex >= NUPDE) {
 		vm_pindex_t pml4index;
 		vm_pindex_t pdpindex;
 		pml4_entry_t *pml4;
 		pdp_entry_t *pdp;
 
 		/* Wire up a new PDE page */
 		pdpindex = ptepindex - NUPDE;
 		pml4index = pdpindex >> NPML4EPGSHIFT;
 
 		pml4 = &pmap->pm_pml4[pml4index];
 		if ((*pml4 & PG_V) == 0) {
 			/* Have to allocate a new pdp, recurse */
 			if (_pmap_allocpte(pmap, NUPDE + NUPDPE + pml4index,
 			    flags) == NULL) {
 				--m->wire_count;
 				atomic_subtract_int(&cnt.v_wire_count, 1);
 				vm_page_free_zero(m);
 				return (NULL);
 			}
 		} else {
 			/* Add reference to pdp page */
 			pdppg = PHYS_TO_VM_PAGE(*pml4 & PG_FRAME);
 			pdppg->wire_count++;
 		}
 		pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME);
 
 		/* Now find the pdp page */
 		pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)];
 		*pdp = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M;
 
 	} else {
 		vm_pindex_t pml4index;
 		vm_pindex_t pdpindex;
 		pml4_entry_t *pml4;
 		pdp_entry_t *pdp;
 		pd_entry_t *pd;
 
 		/* Wire up a new PTE page */
 		pdpindex = ptepindex >> NPDPEPGSHIFT;
 		pml4index = pdpindex >> NPML4EPGSHIFT;
 
 		/* First, find the pdp and check that its valid. */
 		pml4 = &pmap->pm_pml4[pml4index];
 		if ((*pml4 & PG_V) == 0) {
 			/* Have to allocate a new pd, recurse */
 			if (_pmap_allocpte(pmap, NUPDE + pdpindex,
 			    flags) == NULL) {
 				--m->wire_count;
 				atomic_subtract_int(&cnt.v_wire_count, 1);
 				vm_page_free_zero(m);
 				return (NULL);
 			}
 			pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME);
 			pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)];
 		} else {
 			pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME);
 			pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)];
 			if ((*pdp & PG_V) == 0) {
 				/* Have to allocate a new pd, recurse */
 				if (_pmap_allocpte(pmap, NUPDE + pdpindex,
 				    flags) == NULL) {
 					--m->wire_count;
 					atomic_subtract_int(&cnt.v_wire_count,
 					    1);
 					vm_page_free_zero(m);
 					return (NULL);
 				}
 			} else {
 				/* Add reference to the pd page */
 				pdpg = PHYS_TO_VM_PAGE(*pdp & PG_FRAME);
 				pdpg->wire_count++;
 			}
 		}
 		pd = (pd_entry_t *)PHYS_TO_DMAP(*pdp & PG_FRAME);
 
 		/* Now we know where the page directory page is */
 		pd = &pd[ptepindex & ((1ul << NPDEPGSHIFT) - 1)];
 		*pd = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M;
 	}
 
 	pmap_resident_count_inc(pmap, 1);
 
 	return (m);
 }
 
 static vm_page_t
 pmap_allocpde(pmap_t pmap, vm_offset_t va, int flags)
 {
 	vm_pindex_t pdpindex, ptepindex;
 	pdp_entry_t *pdpe;
 	vm_page_t pdpg;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("pmap_allocpde: flags is neither M_NOWAIT nor M_WAITOK"));
 retry:
 	pdpe = pmap_pdpe(pmap, va);
 	if (pdpe != NULL && (*pdpe & PG_V) != 0) {
 		/* Add a reference to the pd page. */
 		pdpg = PHYS_TO_VM_PAGE(*pdpe & PG_FRAME);
 		pdpg->wire_count++;
 	} else {
 		/* Allocate a pd page. */
 		ptepindex = pmap_pde_pindex(va);
 		pdpindex = ptepindex >> NPDPEPGSHIFT;
 		pdpg = _pmap_allocpte(pmap, NUPDE + pdpindex, flags);
 		if (pdpg == NULL && (flags & M_WAITOK))
 			goto retry;
 	}
 	return (pdpg);
 }
 
 static vm_page_t
 pmap_allocpte(pmap_t pmap, vm_offset_t va, int flags)
 {
 	vm_pindex_t ptepindex;
 	pd_entry_t *pd;
 	vm_page_t m;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("pmap_allocpte: flags is neither M_NOWAIT nor M_WAITOK"));
 
 	/*
 	 * Calculate pagetable page index
 	 */
 	ptepindex = pmap_pde_pindex(va);
 retry:
 	/*
 	 * Get the page directory entry
 	 */
 	pd = pmap_pde(pmap, va);
 
 	/*
 	 * This supports switching from a 2MB page to a
 	 * normal 4K page.
 	 */
 	if (pd != NULL && (*pd & (PG_PS | PG_V)) == (PG_PS | PG_V)) {
 		if (!pmap_demote_pde(pmap, pd, va)) {
 			/*
 			 * Invalidation of the 2MB page mapping may have caused
 			 * the deallocation of the underlying PD page.
 			 */
 			pd = NULL;
 		}
 	}
 
 	/*
 	 * If the page table page is mapped, we just increment the
 	 * hold count, and activate it.
 	 */
 	if (pd != NULL && (*pd & PG_V) != 0) {
 		m = PHYS_TO_VM_PAGE(*pd & PG_FRAME);
 		m->wire_count++;
 	} else {
 		/*
 		 * Here if the pte page isn't mapped, or if it has been
 		 * deallocated.
 		 */
 		m = _pmap_allocpte(pmap, ptepindex, flags);
 		if (m == NULL && (flags & M_WAITOK))
 			goto retry;
 	}
 	return (m);
 }
 
 
 /***************************************************
  * Pmap allocation/deallocation routines.
  ***************************************************/
 
 /*
  * Release any resources held by the given physical map.
  * Called when a pmap initialized by pmap_pinit is being released.
  * Should only be called if the map contains no valid mappings.
  */
 void
 pmap_release(pmap_t pmap)
 {
 	vm_page_t m;
 	int i;
 
 	KASSERT(pmap->pm_stats.resident_count == 0,
 	    ("pmap_release: pmap resident count %ld != 0",
 	    pmap->pm_stats.resident_count));
 	KASSERT(pmap->pm_root == NULL,
 	    ("pmap_release: pmap has reserved page table page(s)"));
 
 	m = PHYS_TO_VM_PAGE(pmap->pm_pml4[PML4PML4I] & PG_FRAME);
 
 	pmap->pm_pml4[KPML4I] = 0;	/* KVA */
 	for (i = 0; i < NDMPML4E; i++)	/* Direct Map */
 		pmap->pm_pml4[DMPML4I + i] = 0;
 	pmap->pm_pml4[PML4PML4I] = 0;	/* Recursive Mapping */
 
 	m->wire_count--;
 	atomic_subtract_int(&cnt.v_wire_count, 1);
 	vm_page_free_zero(m);
 	PMAP_LOCK_DESTROY(pmap);
 }
 
 static int
 kvm_size(SYSCTL_HANDLER_ARGS)
 {
 	unsigned long ksize = VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS;
 
 	return sysctl_handle_long(oidp, &ksize, 0, req);
 }
 SYSCTL_PROC(_vm, OID_AUTO, kvm_size, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_size, "LU", "Size of KVM");
 
 static int
 kvm_free(SYSCTL_HANDLER_ARGS)
 {
 	unsigned long kfree = VM_MAX_KERNEL_ADDRESS - kernel_vm_end;
 
 	return sysctl_handle_long(oidp, &kfree, 0, req);
 }
 SYSCTL_PROC(_vm, OID_AUTO, kvm_free, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_free, "LU", "Amount of KVM free");
 
 /*
  * grow the number of kernel page table entries, if needed
  */
 void
 pmap_growkernel(vm_offset_t addr)
 {
 	vm_paddr_t paddr;
 	vm_page_t nkpg;
 	pd_entry_t *pde, newpdir;
 	pdp_entry_t *pdpe;
 
 	mtx_assert(&kernel_map->system_mtx, MA_OWNED);
 
 	/*
 	 * Return if "addr" is within the range of kernel page table pages
 	 * that were preallocated during pmap bootstrap.  Moreover, leave
 	 * "kernel_vm_end" and the kernel page table as they were.
 	 *
 	 * The correctness of this action is based on the following
 	 * argument: vm_map_findspace() allocates contiguous ranges of the
 	 * kernel virtual address space.  It calls this function if a range
 	 * ends after "kernel_vm_end".  If the kernel is mapped between
 	 * "kernel_vm_end" and "addr", then the range cannot begin at
 	 * "kernel_vm_end".  In fact, its beginning address cannot be less
 	 * than the kernel.  Thus, there is no immediate need to allocate
 	 * any new kernel page table pages between "kernel_vm_end" and
 	 * "KERNBASE".
 	 */
 	if (KERNBASE < addr && addr <= KERNBASE + NKPT * NBPDR)
 		return;
 
 	addr = roundup2(addr, NBPDR);
 	if (addr - 1 >= kernel_map->max_offset)
 		addr = kernel_map->max_offset;
 	while (kernel_vm_end < addr) {
 		pdpe = pmap_pdpe(kernel_pmap, kernel_vm_end);
 		if ((*pdpe & PG_V) == 0) {
 			/* We need a new PDP entry */
 			nkpg = vm_page_alloc(NULL, kernel_vm_end >> PDPSHIFT,
 			    VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ |
 			    VM_ALLOC_WIRED | VM_ALLOC_ZERO);
 			if (nkpg == NULL)
 				panic("pmap_growkernel: no memory to grow kernel");
 			if ((nkpg->flags & PG_ZERO) == 0)
 				pmap_zero_page(nkpg);
 			paddr = VM_PAGE_TO_PHYS(nkpg);
 			*pdpe = (pdp_entry_t)
 				(paddr | PG_V | PG_RW | PG_A | PG_M);
 			continue; /* try again */
 		}
 		pde = pmap_pdpe_to_pde(pdpe, kernel_vm_end);
 		if ((*pde & PG_V) != 0) {
 			kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK;
 			if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 				kernel_vm_end = kernel_map->max_offset;
 				break;                       
 			}
 			continue;
 		}
 
 		nkpg = vm_page_alloc(NULL, pmap_pde_pindex(kernel_vm_end),
 		    VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED |
 		    VM_ALLOC_ZERO);
 		if (nkpg == NULL)
 			panic("pmap_growkernel: no memory to grow kernel");
 		if ((nkpg->flags & PG_ZERO) == 0)
 			pmap_zero_page(nkpg);
 		paddr = VM_PAGE_TO_PHYS(nkpg);
 		newpdir = (pd_entry_t) (paddr | PG_V | PG_RW | PG_A | PG_M);
 		pde_store(pde, newpdir);
 
 		kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK;
 		if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 			kernel_vm_end = kernel_map->max_offset;
 			break;                       
 		}
 	}
 }
 
 
 /***************************************************
  * page management routines.
  ***************************************************/
 
 CTASSERT(sizeof(struct pv_chunk) == PAGE_SIZE);
 CTASSERT(_NPCM == 3);
 CTASSERT(_NPCPV == 168);
 
 static __inline struct pv_chunk *
 pv_to_chunk(pv_entry_t pv)
 {
 
 	return (struct pv_chunk *)((uintptr_t)pv & ~(uintptr_t)PAGE_MASK);
 }
 
 #define PV_PMAP(pv) (pv_to_chunk(pv)->pc_pmap)
 
 #define	PC_FREE0	0xfffffffffffffffful
 #define	PC_FREE1	0xfffffffffffffffful
 #define	PC_FREE2	0x000000fffffffffful
 
 static uint64_t pc_freemask[_NPCM] = { PC_FREE0, PC_FREE1, PC_FREE2 };
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_count, CTLFLAG_RD, &pv_entry_count, 0,
 	"Current number of pv entries");
 
 #ifdef PV_STATS
 static int pc_chunk_count, pc_chunk_allocs, pc_chunk_frees, pc_chunk_tryfail;
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_count, CTLFLAG_RD, &pc_chunk_count, 0,
 	"Current number of pv entry chunks");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_allocs, CTLFLAG_RD, &pc_chunk_allocs, 0,
 	"Current number of pv entry chunks allocated");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_frees, CTLFLAG_RD, &pc_chunk_frees, 0,
 	"Current number of pv entry chunks frees");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_tryfail, CTLFLAG_RD, &pc_chunk_tryfail, 0,
 	"Number of times tried to get a chunk page but failed.");
 
 static long pv_entry_frees, pv_entry_allocs;
 static int pv_entry_spare;
 
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_frees, CTLFLAG_RD, &pv_entry_frees, 0,
 	"Current number of pv entry frees");
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_allocs, CTLFLAG_RD, &pv_entry_allocs, 0,
 	"Current number of pv entry allocs");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_spare, CTLFLAG_RD, &pv_entry_spare, 0,
 	"Current number of spare pv entries");
 
 static int pmap_collect_inactive, pmap_collect_active;
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pmap_collect_inactive, CTLFLAG_RD, &pmap_collect_inactive, 0,
 	"Current number times pmap_collect called on inactive queue");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pmap_collect_active, CTLFLAG_RD, &pmap_collect_active, 0,
 	"Current number times pmap_collect called on active queue");
 #endif
 
 /*
  * We are in a serious low memory condition.  Resort to
  * drastic measures to free some pages so we can allocate
  * another pv entry chunk.  This is normally called to
  * unmap inactive pages, and if necessary, active pages.
  *
  * We do not, however, unmap 2mpages because subsequent accesses will
  * allocate per-page pv entries until repromotion occurs, thereby
  * exacerbating the shortage of free pv entries.
  */
 static void
 pmap_collect(pmap_t locked_pmap, struct vpgqueues *vpq)
 {
 	pd_entry_t *pde;
 	pmap_t pmap;
 	pt_entry_t *pte, tpte;
 	pv_entry_t next_pv, pv;
 	vm_offset_t va;
 	vm_page_t m, free;
 
 	TAILQ_FOREACH(m, &vpq->pl, pageq) {
 		if (m->hold_count || m->busy)
 			continue;
 		TAILQ_FOREACH_SAFE(pv, &m->md.pv_list, pv_list, next_pv) {
 			va = pv->pv_va;
 			pmap = PV_PMAP(pv);
 			/* Avoid deadlock and lock recursion. */
 			if (pmap > locked_pmap)
 				PMAP_LOCK(pmap);
 			else if (pmap != locked_pmap && !PMAP_TRYLOCK(pmap))
 				continue;
 			pmap_resident_count_dec(pmap, 1);
 			pde = pmap_pde(pmap, va);
 			KASSERT((*pde & PG_PS) == 0, ("pmap_collect: found"
 			    " a 2mpage in page %p's pv list", m));
 			pte = pmap_pde_to_pte(pde, va);
 			tpte = pte_load_clear(pte);
 			KASSERT((tpte & PG_W) == 0,
 			    ("pmap_collect: wired pte %#lx", tpte));
 			if (tpte & PG_A)
 				vm_page_flag_set(m, PG_REFERENCED);
 			if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 			free = NULL;
 			pmap_unuse_pt(pmap, va, *pde, &free);
 			pmap_invalidate_page(pmap, va);
 			pmap_free_zero_pages(free);
 			TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 			free_pv_entry(pmap, pv);
 			if (pmap != locked_pmap)
 				PMAP_UNLOCK(pmap);
 		}
 		if (TAILQ_EMPTY(&m->md.pv_list) &&
 		    TAILQ_EMPTY(&pa_to_pvh(VM_PAGE_TO_PHYS(m))->pv_list))
 			vm_page_flag_clear(m, PG_WRITEABLE);
 	}
 }
 
 
 /*
  * free the pv_entry back to the free list
  */
 static void
 free_pv_entry(pmap_t pmap, pv_entry_t pv)
 {
 	vm_page_t m;
 	struct pv_chunk *pc;
 	int idx, field, bit;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	PV_STAT(pv_entry_frees++);
 	PV_STAT(pv_entry_spare++);
 	pv_entry_count--;
 	pc = pv_to_chunk(pv);
 	idx = pv - &pc->pc_pventry[0];
 	field = idx / 64;
 	bit = idx % 64;
 	pc->pc_map[field] |= 1ul << bit;
 	/* move to head of list */
 	TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 	if (pc->pc_map[0] != PC_FREE0 || pc->pc_map[1] != PC_FREE1 ||
 	    pc->pc_map[2] != PC_FREE2) {
 		TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 		return;
 	}
 	PV_STAT(pv_entry_spare -= _NPCPV);
 	PV_STAT(pc_chunk_count--);
 	PV_STAT(pc_chunk_frees++);
 	/* entire chunk is free, return it */
 	m = PHYS_TO_VM_PAGE(DMAP_TO_PHYS((vm_offset_t)pc));
 	dump_drop_page(m->phys_addr);
 	vm_page_unwire(m, 0);
 	vm_page_free(m);
 }
 
 /*
  * get a new pv_entry, allocating a block from the system
  * when needed.
  */
 static pv_entry_t
 get_pv_entry(pmap_t pmap, int try)
 {
 	static const struct timeval printinterval = { 60, 0 };
 	static struct timeval lastprint;
 	static vm_pindex_t colour;
 	struct vpgqueues *pq;
 	int bit, field;
 	pv_entry_t pv;
 	struct pv_chunk *pc;
 	vm_page_t m;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PV_STAT(pv_entry_allocs++);
 	pv_entry_count++;
 	if (pv_entry_count > pv_entry_high_water)
 		if (ratecheck(&lastprint, &printinterval))
 			printf("Approaching the limit on PV entries, consider "
 			    "increasing either the vm.pmap.shpgperproc or the "
 			    "vm.pmap.pv_entry_max sysctl.\n");
 	pq = NULL;
 retry:
 	pc = TAILQ_FIRST(&pmap->pm_pvchunk);
 	if (pc != NULL) {
 		for (field = 0; field < _NPCM; field++) {
 			if (pc->pc_map[field]) {
 				bit = bsfq(pc->pc_map[field]);
 				break;
 			}
 		}
 		if (field < _NPCM) {
 			pv = &pc->pc_pventry[field * 64 + bit];
 			pc->pc_map[field] &= ~(1ul << bit);
 			/* If this was the last item, move it to tail */
 			if (pc->pc_map[0] == 0 && pc->pc_map[1] == 0 &&
 			    pc->pc_map[2] == 0) {
 				TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 				TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc, pc_list);
 			}
 			PV_STAT(pv_entry_spare--);
 			return (pv);
 		}
 	}
 	/* No free items, allocate another chunk */
 	m = vm_page_alloc(NULL, colour, (pq == &vm_page_queues[PQ_ACTIVE] ?
 	    VM_ALLOC_SYSTEM : VM_ALLOC_NORMAL) | VM_ALLOC_NOOBJ |
 	    VM_ALLOC_WIRED);
 	if (m == NULL) {
 		if (try) {
 			pv_entry_count--;
 			PV_STAT(pc_chunk_tryfail++);
 			return (NULL);
 		}
 		/*
 		 * Reclaim pv entries: At first, destroy mappings to inactive
 		 * pages.  After that, if a pv chunk entry is still needed,
 		 * destroy mappings to active pages.
 		 */
 		if (pq == NULL) {
 			PV_STAT(pmap_collect_inactive++);
 			pq = &vm_page_queues[PQ_INACTIVE];
 		} else if (pq == &vm_page_queues[PQ_INACTIVE]) {
 			PV_STAT(pmap_collect_active++);
 			pq = &vm_page_queues[PQ_ACTIVE];
 		} else
 			panic("get_pv_entry: increase vm.pmap.shpgperproc");
 		pmap_collect(pmap, pq);
 		goto retry;
 	}
 	PV_STAT(pc_chunk_count++);
 	PV_STAT(pc_chunk_allocs++);
 	colour++;
 	dump_add_page(m->phys_addr);
 	pc = (void *)PHYS_TO_DMAP(m->phys_addr);
 	pc->pc_pmap = pmap;
 	pc->pc_map[0] = PC_FREE0 & ~1ul;	/* preallocated bit 0 */
 	pc->pc_map[1] = PC_FREE1;
 	pc->pc_map[2] = PC_FREE2;
 	pv = &pc->pc_pventry[0];
 	TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 	PV_STAT(pv_entry_spare += _NPCPV - 1);
 	return (pv);
 }
 
 /*
  * First find and then remove the pv entry for the specified pmap and virtual
  * address from the specified pv list.  Returns the pv entry if found and NULL
  * otherwise.  This operation can be performed on pv lists for either 4KB or
  * 2MB page mappings.
  */
 static __inline pv_entry_t
 pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		if (pmap == PV_PMAP(pv) && va == pv->pv_va) {
 			TAILQ_REMOVE(&pvh->pv_list, pv, pv_list);
 			break;
 		}
 	}
 	return (pv);
 }
 
 /*
  * After demotion from a 2MB page mapping to 512 4KB page mappings,
  * destroy the pv entry for the 2MB page mapping and reinstantiate the pv
  * entries for each of the 4KB page mappings.
  */
 static void
 pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	vm_offset_t va_last;
 	vm_page_t m;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	KASSERT((pa & PDRMASK) == 0,
 	    ("pmap_pv_demote_pde: pa is not 2mpage aligned"));
 
 	/*
 	 * Transfer the 2mpage's pv entry for this mapping to the first
 	 * page's pv list.
 	 */
 	pvh = pa_to_pvh(pa);
 	va = trunc_2mpage(va);
 	pv = pmap_pvh_remove(pvh, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pv_demote_pde: pv not found"));
 	m = PHYS_TO_VM_PAGE(pa);
 	TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 	/* Instantiate the remaining NPTEPG - 1 pv entries. */
 	va_last = va + NBPDR - PAGE_SIZE;
 	do {
 		m++;
 		KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 		    ("pmap_pv_demote_pde: page %p is not managed", m));
 		va += PAGE_SIZE;
 		pmap_insert_entry(pmap, va, m);
 	} while (va < va_last);
 }
 
 /*
  * After promotion from 512 4KB page mappings to a single 2MB page mapping,
  * replace the many pv entries for the 4KB page mappings by a single pv entry
  * for the 2MB page mapping.
  */
 static void
 pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	vm_offset_t va_last;
 	vm_page_t m;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	KASSERT((pa & PDRMASK) == 0,
 	    ("pmap_pv_promote_pde: pa is not 2mpage aligned"));
 
 	/*
 	 * Transfer the first page's pv entry for this mapping to the
 	 * 2mpage's pv list.  Aside from avoiding the cost of a call
 	 * to get_pv_entry(), a transfer avoids the possibility that
 	 * get_pv_entry() calls pmap_collect() and that pmap_collect()
 	 * removes one of the mappings that is being promoted.
 	 */
 	m = PHYS_TO_VM_PAGE(pa);
 	va = trunc_2mpage(va);
 	pv = pmap_pvh_remove(&m->md, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pv_promote_pde: pv not found"));
 	pvh = pa_to_pvh(pa);
 	TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_list);
 	/* Free the remaining NPTEPG - 1 pv entries. */
 	va_last = va + NBPDR - PAGE_SIZE;
 	do {
 		m++;
 		va += PAGE_SIZE;
 		pmap_pvh_free(&m->md, pmap, va);
 	} while (va < va_last);
 }
 
 /*
  * First find and then destroy the pv entry for the specified pmap and virtual
  * address.  This operation can be performed on pv lists for either 4KB or 2MB
  * page mappings.
  */
 static void
 pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	pv = pmap_pvh_remove(pvh, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pvh_free: pv not found"));
 	free_pv_entry(pmap, pv);
 }
 
 static void
 pmap_remove_entry(pmap_t pmap, vm_page_t m, vm_offset_t va)
 {
 	struct md_page *pvh;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	pmap_pvh_free(&m->md, pmap, va);
 	if (TAILQ_EMPTY(&m->md.pv_list)) {
 		pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 		if (TAILQ_EMPTY(&pvh->pv_list))
 			vm_page_flag_clear(m, PG_WRITEABLE);
 	}
 }
 
 /*
  * Create a pv entry for page at pa for
  * (pmap, va).
  */
 static void
 pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	pv_entry_t pv;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	pv = get_pv_entry(pmap, FALSE);
 	pv->pv_va = va;
 	TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 }
 
 /*
  * Conditionally create a pv entry.
  */
 static boolean_t
 pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	pv_entry_t pv;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if (pv_entry_count < pv_entry_high_water && 
 	    (pv = get_pv_entry(pmap, TRUE)) != NULL) {
 		pv->pv_va = va;
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 /*
  * Create the pv entry for a 2MB page mapping.
  */
 static boolean_t
 pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if (pv_entry_count < pv_entry_high_water && 
 	    (pv = get_pv_entry(pmap, TRUE)) != NULL) {
 		pv->pv_va = va;
 		pvh = pa_to_pvh(pa);
 		TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_list);
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 /*
  * Fills a page table page with mappings to consecutive physical pages.
  */
 static void
 pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte)
 {
 	pt_entry_t *pte;
 
 	for (pte = firstpte; pte < firstpte + NPTEPG; pte++) {
 		*pte = newpte;
 		newpte += PAGE_SIZE;
 	}
 }
 
 /*
  * Tries to demote a 2MB page mapping.  If demotion fails, the 2MB page
  * mapping is invalidated.
  */
 static boolean_t
 pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va)
 {
 	pd_entry_t newpde, oldpde;
 	pt_entry_t *firstpte, newpte;
 	vm_paddr_t mptepa;
 	vm_page_t free, mpte;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpde = *pde;
 	KASSERT((oldpde & (PG_PS | PG_V)) == (PG_PS | PG_V),
 	    ("pmap_demote_pde: oldpde is missing PG_PS and/or PG_V"));
 	mpte = pmap_lookup_pt_page(pmap, va);
 	if (mpte != NULL)
 		pmap_remove_pt_page(pmap, mpte);
 	else {
 		KASSERT((oldpde & PG_W) == 0,
 		    ("pmap_demote_pde: page table page for a wired mapping"
 		    " is missing"));
 
 		/*
 		 * Invalidate the 2MB page mapping and return "failure" if the
 		 * mapping was never accessed or the allocation of the new
 		 * page table page fails.  If the 2MB page mapping belongs to
 		 * the direct map region of the kernel's address space, then
 		 * the page allocation request specifies the highest possible
 		 * priority (VM_ALLOC_INTERRUPT).  Otherwise, the priority is
 		 * normal.  Page table pages are preallocated for every other
 		 * part of the kernel address space, so the direct map region
 		 * is the only part of the kernel address space that must be
 		 * handled here.
 		 */
 		if ((oldpde & PG_A) == 0 || (mpte = vm_page_alloc(NULL,
 		    pmap_pde_pindex(va), (va >= DMAP_MIN_ADDRESS && va <
 		    DMAP_MAX_ADDRESS ? VM_ALLOC_INTERRUPT : VM_ALLOC_NORMAL) |
 		    VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) {
 			free = NULL;
 			pmap_remove_pde(pmap, pde, trunc_2mpage(va), &free);
 			pmap_invalidate_page(pmap, trunc_2mpage(va));
 			pmap_free_zero_pages(free);
 			CTR2(KTR_PMAP, "pmap_demote_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return (FALSE);
 		}
 		if (va < VM_MAXUSER_ADDRESS)
 			pmap_resident_count_inc(pmap, 1);
 	}
 	mptepa = VM_PAGE_TO_PHYS(mpte);
 	firstpte = (pt_entry_t *)PHYS_TO_DMAP(mptepa);
 	newpde = mptepa | PG_M | PG_A | (oldpde & PG_U) | PG_RW | PG_V;
 	KASSERT((oldpde & PG_A) != 0,
 	    ("pmap_demote_pde: oldpde is missing PG_A"));
 	KASSERT((oldpde & (PG_M | PG_RW)) != PG_RW,
 	    ("pmap_demote_pde: oldpde is missing PG_M"));
 	newpte = oldpde & ~PG_PS;
 	if ((newpte & PG_PDE_PAT) != 0)
 		newpte ^= PG_PDE_PAT | PG_PTE_PAT;
 
 	/*
 	 * If the page table page is new, initialize it.
 	 */
 	if (mpte->wire_count == 1) {
 		mpte->wire_count = NPTEPG;
 		pmap_fill_ptp(firstpte, newpte);
 	}
 	KASSERT((*firstpte & PG_FRAME) == (newpte & PG_FRAME),
 	    ("pmap_demote_pde: firstpte and newpte map different physical"
 	    " addresses"));
 
 	/*
 	 * If the mapping has changed attributes, update the page table
 	 * entries.
 	 */
 	if ((*firstpte & PG_PTE_PROMOTE) != (newpte & PG_PTE_PROMOTE))
 		pmap_fill_ptp(firstpte, newpte);
 
 	/*
 	 * Demote the mapping.  This pmap is locked.  The old PDE has
 	 * PG_A set.  If the old PDE has PG_RW set, it also has PG_M
 	 * set.  Thus, there is no danger of a race with another
 	 * processor changing the setting of PG_A and/or PG_M between
 	 * the read above and the store below. 
 	 */
 	if (workaround_erratum383)
 		pmap_update_pde(pmap, va, pde, newpde);
 	else
 		pde_store(pde, newpde);
 
 	/*
 	 * Invalidate a stale recursive mapping of the page table page.
 	 */
 	if (va >= VM_MAXUSER_ADDRESS)
 		pmap_invalidate_page(pmap, (vm_offset_t)vtopte(va));
 
 	/*
 	 * Demote the pv entry.  This depends on the earlier demotion
 	 * of the mapping.  Specifically, the (re)creation of a per-
 	 * page pv entry might trigger the execution of pmap_collect(),
 	 * which might reclaim a newly (re)created per-page pv entry
 	 * and destroy the associated mapping.  In order to destroy
 	 * the mapping, the PDE must have already changed from mapping
 	 * the 2mpage to referencing the page table page.
 	 */
 	if ((oldpde & PG_MANAGED) != 0)
 		pmap_pv_demote_pde(pmap, va, oldpde & PG_PS_FRAME);
 
 	pmap_pde_demotions++;
 	CTR2(KTR_PMAP, "pmap_demote_pde: success for va %#lx"
 	    " in pmap %p", va, pmap);
 	return (TRUE);
 }
 
 /*
  * pmap_remove_pde: do the things to unmap a superpage in a process
  */
 static int
 pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva,
     vm_page_t *free)
 {
 	struct md_page *pvh;
 	pd_entry_t oldpde;
 	vm_offset_t eva, va;
 	vm_page_t m, mpte;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((sva & PDRMASK) == 0,
 	    ("pmap_remove_pde: sva is not 2mpage aligned"));
 	oldpde = pte_load_clear(pdq);
 	if (oldpde & PG_W)
 		pmap->pm_stats.wired_count -= NBPDR / PAGE_SIZE;
 
 	/*
 	 * Machines that don't support invlpg, also don't support
 	 * PG_G.
 	 */
 	if (oldpde & PG_G)
 		pmap_invalidate_page(kernel_pmap, sva);
 	pmap_resident_count_dec(pmap, NBPDR / PAGE_SIZE);
 	if (oldpde & PG_MANAGED) {
 		pvh = pa_to_pvh(oldpde & PG_PS_FRAME);
 		pmap_pvh_free(pvh, pmap, sva);
 		eva = sva + NBPDR;
 		for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME);
 		    va < eva; va += PAGE_SIZE, m++) {
 			if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 			if (oldpde & PG_A)
 				vm_page_flag_set(m, PG_REFERENCED);
 			if (TAILQ_EMPTY(&m->md.pv_list) &&
 			    TAILQ_EMPTY(&pvh->pv_list))
 				vm_page_flag_clear(m, PG_WRITEABLE);
 		}
 	}
 	if (pmap == kernel_pmap) {
 		if (!pmap_demote_pde(pmap, pdq, sva))
 			panic("pmap_remove_pde: failed demotion");
 	} else {
 		mpte = pmap_lookup_pt_page(pmap, sva);
 		if (mpte != NULL) {
 			pmap_remove_pt_page(pmap, mpte);
 			pmap_resident_count_dec(pmap, 1);
 			KASSERT(mpte->wire_count == NPTEPG,
 			    ("pmap_remove_pde: pte page wire count error"));
 			mpte->wire_count = 0;
 			pmap_add_delayed_free_list(mpte, free, FALSE);
 			atomic_subtract_int(&cnt.v_wire_count, 1);
 		}
 	}
 	return (pmap_unuse_pt(pmap, sva, *pmap_pdpe(pmap, sva), free));
 }
 
 /*
  * pmap_remove_pte: do the things to unmap a page in a process
  */
 static int
 pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t va, 
     pd_entry_t ptepde, vm_page_t *free)
 {
 	pt_entry_t oldpte;
 	vm_page_t m;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpte = pte_load_clear(ptq);
 	if (oldpte & PG_W)
 		pmap->pm_stats.wired_count -= 1;
 	pmap_resident_count_dec(pmap, 1);
 	if (oldpte & PG_MANAGED) {
 		m = PHYS_TO_VM_PAGE(oldpte & PG_FRAME);
 		if ((oldpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			vm_page_dirty(m);
 		if (oldpte & PG_A)
 			vm_page_flag_set(m, PG_REFERENCED);
 		pmap_remove_entry(pmap, m, va);
 	}
 	return (pmap_unuse_pt(pmap, va, ptepde, free));
 }
 
 /*
  * Remove a single page from a process address space
  */
 static void
 pmap_remove_page(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, vm_page_t *free)
 {
 	pt_entry_t *pte;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if ((*pde & PG_V) == 0)
 		return;
 	pte = pmap_pde_to_pte(pde, va);
 	if ((*pte & PG_V) == 0)
 		return;
 	pmap_remove_pte(pmap, pte, va, *pde, free);
 	pmap_invalidate_page(pmap, va);
 }
 
 /*
  *	Remove the given range of addresses from the specified map.
  *
  *	It is assumed that the start and end are properly
  *	rounded to the page size.
  */
 void
 pmap_remove(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t va, va_next;
 	pml4_entry_t *pml4e;
 	pdp_entry_t *pdpe;
 	pd_entry_t ptpaddr, *pde;
 	pt_entry_t *pte;
 	vm_page_t free = NULL;
 	int anyvalid;
 
 	/*
 	 * Perform an unsynchronized read.  This is, however, safe.
 	 */
 	if (pmap->pm_stats.resident_count == 0)
 		return;
 
 	anyvalid = 0;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 
 	/*
 	 * special handling of removing one page.  a very
 	 * common operation and easy to short circuit some
 	 * code.
 	 */
 	if (sva + PAGE_SIZE == eva) {
 		pde = pmap_pde(pmap, sva);
 		if (pde && (*pde & PG_PS) == 0) {
 			pmap_remove_page(pmap, sva, pde, &free);
 			goto out;
 		}
 	}
 
 	for (; sva < eva; sva = va_next) {
 
 		if (pmap->pm_stats.resident_count == 0)
 			break;
 
 		pml4e = pmap_pml4e(pmap, sva);
 		if ((*pml4e & PG_V) == 0) {
 			va_next = (sva + NBPML4) & ~PML4MASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 
 		pdpe = pmap_pml4e_to_pdpe(pml4e, sva);
 		if ((*pdpe & PG_V) == 0) {
 			va_next = (sva + NBPDP) & ~PDPMASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 
 		/*
 		 * Calculate index for next page table.
 		 */
 		va_next = (sva + NBPDR) & ~PDRMASK;
 		if (va_next < sva)
 			va_next = eva;
 
 		pde = pmap_pdpe_to_pde(pdpe, sva);
 		ptpaddr = *pde;
 
 		/*
 		 * Weed out invalid mappings.
 		 */
 		if (ptpaddr == 0)
 			continue;
 
 		/*
 		 * Check for large page.
 		 */
 		if ((ptpaddr & PG_PS) != 0) {
 			/*
 			 * Are we removing the entire large page?  If not,
 			 * demote the mapping and fall through.
 			 */
 			if (sva + NBPDR == va_next && eva >= va_next) {
 				/*
 				 * The TLB entry for a PG_G mapping is
 				 * invalidated by pmap_remove_pde().
 				 */
 				if ((ptpaddr & PG_G) == 0)
 					anyvalid = 1;
 				pmap_remove_pde(pmap, pde, sva, &free);
 				continue;
 			} else if (!pmap_demote_pde(pmap, pde, sva)) {
 				/* The large page mapping was destroyed. */
 				continue;
 			} else
 				ptpaddr = *pde;
 		}
 
 		/*
 		 * Limit our scan to either the end of the va represented
 		 * by the current page table page, or to the end of the
 		 * range being removed.
 		 */
 		if (va_next > eva)
 			va_next = eva;
 
 		va = va_next;
 		for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++,
 		    sva += PAGE_SIZE) {
 			if (*pte == 0) {
 				if (va != va_next) {
 					pmap_invalidate_range(pmap, va, sva);
 					va = va_next;
 				}
 				continue;
 			}
 			if ((*pte & PG_G) == 0)
 				anyvalid = 1;
 			else if (va == va_next)
 				va = sva;
 			if (pmap_remove_pte(pmap, pte, sva, ptpaddr, &free)) {
 				sva += PAGE_SIZE;
 				break;
 			}
 		}
 		if (va != va_next)
 			pmap_invalidate_range(pmap, va, sva);
 	}
 out:
 	if (anyvalid)
 		pmap_invalidate_all(pmap);
 	vm_page_unlock_queues();	
 	PMAP_UNLOCK(pmap);
 	pmap_free_zero_pages(free);
 }
 
 /*
  *	Routine:	pmap_remove_all
  *	Function:
  *		Removes this physical page from
  *		all physical maps in which it resides.
  *		Reflects back modify bits to the pager.
  *
  *	Notes:
  *		Original versions of this routine were very
  *		inefficient because they iteratively called
  *		pmap_remove (slow...)
  */
 
 void
 pmap_remove_all(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	pmap_t pmap;
 	pt_entry_t *pte, tpte;
 	pd_entry_t *pde;
 	vm_offset_t va;
 	vm_page_t free;
 
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_remove_all: page %p is fictitious", m));
 	free = NULL;
 	vm_page_lock_queues();
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	while ((pv = TAILQ_FIRST(&pvh->pv_list)) != NULL) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, va);
 		(void)pmap_demote_pde(pmap, pde, va);
 		PMAP_UNLOCK(pmap);
 	}
 	while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pmap_resident_count_dec(pmap, 1);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_remove_all: found"
 		    " a 2mpage in page %p's pv list", m));
 		pte = pmap_pde_to_pte(pde, pv->pv_va);
 		tpte = pte_load_clear(pte);
 		if (tpte & PG_W)
 			pmap->pm_stats.wired_count--;
 		if (tpte & PG_A)
 			vm_page_flag_set(m, PG_REFERENCED);
 
 		/*
 		 * Update the vm_page_t clean and reference bits.
 		 */
 		if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			vm_page_dirty(m);
 		pmap_unuse_pt(pmap, pv->pv_va, *pde, &free);
 		pmap_invalidate_page(pmap, pv->pv_va);
 		TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 		free_pv_entry(pmap, pv);
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 	pmap_free_zero_pages(free);
 }
 
 /*
  * pmap_protect_pde: do the things to protect a 2mpage in a process
  */
 static boolean_t
 pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva, vm_prot_t prot)
 {
 	pd_entry_t newpde, oldpde;
 	vm_offset_t eva, va;
 	vm_page_t m;
 	boolean_t anychanged;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((sva & PDRMASK) == 0,
 	    ("pmap_protect_pde: sva is not 2mpage aligned"));
 	anychanged = FALSE;
 retry:
 	oldpde = newpde = *pde;
 	if (oldpde & PG_MANAGED) {
 		eva = sva + NBPDR;
 		for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME);
 		    va < eva; va += PAGE_SIZE, m++)
 			if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 	}
 	if ((prot & VM_PROT_WRITE) == 0)
 		newpde &= ~(PG_RW | PG_M);
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpde |= pg_nx;
 	if (newpde != oldpde) {
 		if (!atomic_cmpset_long(pde, oldpde, newpde))
 			goto retry;
 		if (oldpde & PG_G)
 			pmap_invalidate_page(pmap, sva);
 		else
 			anychanged = TRUE;
 	}
 	return (anychanged);
 }
 
 /*
  *	Set the physical protection on the
  *	specified range of this map as requested.
  */
 void
 pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
 {
 	vm_offset_t va_next;
 	pml4_entry_t *pml4e;
 	pdp_entry_t *pdpe;
 	pd_entry_t ptpaddr, *pde;
 	pt_entry_t *pte;
 	int anychanged;
 
 	if ((prot & VM_PROT_READ) == VM_PROT_NONE) {
 		pmap_remove(pmap, sva, eva);
 		return;
 	}
 
 	if ((prot & (VM_PROT_WRITE|VM_PROT_EXECUTE)) ==
 	    (VM_PROT_WRITE|VM_PROT_EXECUTE))
 		return;
 
 	anychanged = 0;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	for (; sva < eva; sva = va_next) {
 
 		pml4e = pmap_pml4e(pmap, sva);
 		if ((*pml4e & PG_V) == 0) {
 			va_next = (sva + NBPML4) & ~PML4MASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 
 		pdpe = pmap_pml4e_to_pdpe(pml4e, sva);
 		if ((*pdpe & PG_V) == 0) {
 			va_next = (sva + NBPDP) & ~PDPMASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 
 		va_next = (sva + NBPDR) & ~PDRMASK;
 		if (va_next < sva)
 			va_next = eva;
 
 		pde = pmap_pdpe_to_pde(pdpe, sva);
 		ptpaddr = *pde;
 
 		/*
 		 * Weed out invalid mappings.
 		 */
 		if (ptpaddr == 0)
 			continue;
 
 		/*
 		 * Check for large page.
 		 */
 		if ((ptpaddr & PG_PS) != 0) {
 			/*
 			 * Are we protecting the entire large page?  If not,
 			 * demote the mapping and fall through.
 			 */
 			if (sva + NBPDR == va_next && eva >= va_next) {
 				/*
 				 * The TLB entry for a PG_G mapping is
 				 * invalidated by pmap_protect_pde().
 				 */
 				if (pmap_protect_pde(pmap, pde, sva, prot))
 					anychanged = 1;
 				continue;
 			} else if (!pmap_demote_pde(pmap, pde, sva)) {
 				/* The large page mapping was destroyed. */
 				continue;
 			}
 		}
 
 		if (va_next > eva)
 			va_next = eva;
 
 		for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++,
 		    sva += PAGE_SIZE) {
 			pt_entry_t obits, pbits;
 			vm_page_t m;
 
 retry:
 			obits = pbits = *pte;
 			if ((pbits & PG_V) == 0)
 				continue;
 
 			if ((prot & VM_PROT_WRITE) == 0) {
 				if ((pbits & (PG_MANAGED | PG_M | PG_RW)) ==
 				    (PG_MANAGED | PG_M | PG_RW)) {
 					m = PHYS_TO_VM_PAGE(pbits & PG_FRAME);
 					vm_page_dirty(m);
 				}
 				pbits &= ~(PG_RW | PG_M);
 			}
 			if ((prot & VM_PROT_EXECUTE) == 0)
 				pbits |= pg_nx;
 
 			if (pbits != obits) {
 				if (!atomic_cmpset_long(pte, obits, pbits))
 					goto retry;
 				if (obits & PG_G)
 					pmap_invalidate_page(pmap, sva);
 				else
 					anychanged = 1;
 			}
 		}
 	}
 	if (anychanged)
 		pmap_invalidate_all(pmap);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Tries to promote the 512, contiguous 4KB page mappings that are within a
  * single page table page (PTP) to a single 2MB page mapping.  For promotion
  * to occur, two conditions must be met: (1) the 4KB page mappings must map
  * aligned, contiguous physical memory and (2) the 4KB page mappings must have
  * identical characteristics. 
  */
 static void
 pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va)
 {
 	pd_entry_t newpde;
 	pt_entry_t *firstpte, oldpte, pa, *pte;
 	vm_offset_t oldpteva;
 	vm_page_t mpte;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * Examine the first PTE in the specified PTP.  Abort if this PTE is
 	 * either invalid, unused, or does not map the first 4KB physical page
 	 * within a 2MB page. 
 	 */
 	firstpte = (pt_entry_t *)PHYS_TO_DMAP(*pde & PG_FRAME);
 setpde:
 	newpde = *firstpte;
 	if ((newpde & ((PG_FRAME & PDRMASK) | PG_A | PG_V)) != (PG_A | PG_V)) {
 		pmap_pde_p_failures++;
 		CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return;
 	}
 	if ((newpde & (PG_M | PG_RW)) == PG_RW) {
 		/*
 		 * When PG_M is already clear, PG_RW can be cleared without
 		 * a TLB invalidation.
 		 */
 		if (!atomic_cmpset_long(firstpte, newpde, newpde & ~PG_RW))
 			goto setpde;
 		newpde &= ~PG_RW;
 	}
 
 	/*
 	 * Examine each of the other PTEs in the specified PTP.  Abort if this
 	 * PTE maps an unexpected 4KB physical page or does not have identical
 	 * characteristics to the first PTE.
 	 */
 	pa = (newpde & (PG_PS_FRAME | PG_A | PG_V)) + NBPDR - PAGE_SIZE;
 	for (pte = firstpte + NPTEPG - 1; pte > firstpte; pte--) {
 setpte:
 		oldpte = *pte;
 		if ((oldpte & (PG_FRAME | PG_A | PG_V)) != pa) {
 			pmap_pde_p_failures++;
 			CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return;
 		}
 		if ((oldpte & (PG_M | PG_RW)) == PG_RW) {
 			/*
 			 * When PG_M is already clear, PG_RW can be cleared
 			 * without a TLB invalidation.
 			 */
 			if (!atomic_cmpset_long(pte, oldpte, oldpte & ~PG_RW))
 				goto setpte;
 			oldpte &= ~PG_RW;
 			oldpteva = (oldpte & PG_FRAME & PDRMASK) |
 			    (va & ~PDRMASK);
 			CTR2(KTR_PMAP, "pmap_promote_pde: protect for va %#lx"
 			    " in pmap %p", oldpteva, pmap);
 		}
 		if ((oldpte & PG_PTE_PROMOTE) != (newpde & PG_PTE_PROMOTE)) {
 			pmap_pde_p_failures++;
 			CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return;
 		}
 		pa -= PAGE_SIZE;
 	}
 
 	/*
 	 * Save the page table page in its current state until the PDE
 	 * mapping the superpage is demoted by pmap_demote_pde() or
 	 * destroyed by pmap_remove_pde(). 
 	 */
 	mpte = PHYS_TO_VM_PAGE(*pde & PG_FRAME);
 	KASSERT(mpte >= vm_page_array &&
 	    mpte < &vm_page_array[vm_page_array_size],
 	    ("pmap_promote_pde: page table page is out of range"));
 	KASSERT(mpte->pindex == pmap_pde_pindex(va),
 	    ("pmap_promote_pde: page table page's pindex is wrong"));
 	pmap_insert_pt_page(pmap, mpte);
 
 	/*
 	 * Promote the pv entries.
 	 */
 	if ((newpde & PG_MANAGED) != 0)
 		pmap_pv_promote_pde(pmap, va, newpde & PG_PS_FRAME);
 
 	/*
 	 * Propagate the PAT index to its proper position.
 	 */
 	if ((newpde & PG_PTE_PAT) != 0)
 		newpde ^= PG_PDE_PAT | PG_PTE_PAT;
 
 	/*
 	 * Map the superpage.
 	 */
 	if (workaround_erratum383)
 		pmap_update_pde(pmap, va, pde, PG_PS | newpde);
 	else
 		pde_store(pde, PG_PS | newpde);
 
 	pmap_pde_promotions++;
 	CTR2(KTR_PMAP, "pmap_promote_pde: success for va %#lx"
 	    " in pmap %p", va, pmap);
 }
 
 /*
  *	Insert the given physical page (p) at
  *	the specified virtual address (v) in the
  *	target physical map with the protection requested.
  *
  *	If specified, the page will be wired down, meaning
  *	that the related pte can not be reclaimed.
  *
  *	NB:  This is the only routine which MAY NOT lazy-evaluate
  *	or lose information.  That is, this routine must actually
  *	insert this page into the given map NOW.
  */
 void
 pmap_enter(pmap_t pmap, vm_offset_t va, vm_prot_t access, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	pt_entry_t newpte, origpte;
 	pv_entry_t pv;
 	vm_paddr_t opa, pa;
 	vm_page_t mpte, om;
 	boolean_t invlva;
 
 	va = trunc_page(va);
 	KASSERT(va <= VM_MAX_KERNEL_ADDRESS, ("pmap_enter: toobig"));
 	KASSERT(va < UPT_MIN_ADDRESS || va >= UPT_MAX_ADDRESS,
 	    ("pmap_enter: invalid to pmap_enter page table pages (va: 0x%lx)",
 	    va));
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 	    (m->oflags & VPO_BUSY) != 0,
 	    ("pmap_enter: page %p is not busy", m));
 
 	mpte = NULL;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 
 	/*
 	 * In the case that a page table page is not
 	 * resident, we are creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS)
 		mpte = pmap_allocpte(pmap, va, M_WAITOK);
 
 	pde = pmap_pde(pmap, va);
 	if (pde != NULL && (*pde & PG_V) != 0) {
 		if ((*pde & PG_PS) != 0)
 			panic("pmap_enter: attempted pmap_enter on 2MB page");
 		pte = pmap_pde_to_pte(pde, va);
 	} else
 		panic("pmap_enter: invalid page directory va=%#lx", va);
 
 	pa = VM_PAGE_TO_PHYS(m);
 	om = NULL;
 	origpte = *pte;
 	opa = origpte & PG_FRAME;
 
 	/*
 	 * Mapping has not changed, must be protection or wiring change.
 	 */
 	if (origpte && (opa == pa)) {
 		/*
 		 * Wiring change, just update stats. We don't worry about
 		 * wiring PT pages as they remain resident as long as there
 		 * are valid mappings in them. Hence, if a user page is wired,
 		 * the PT page will be also.
 		 */
 		if (wired && ((origpte & PG_W) == 0))
 			pmap->pm_stats.wired_count++;
 		else if (!wired && (origpte & PG_W))
 			pmap->pm_stats.wired_count--;
 
 		/*
 		 * Remove extra pte reference
 		 */
 		if (mpte)
 			mpte->wire_count--;
 
 		if (origpte & PG_MANAGED) {
 			om = m;
 			pa |= PG_MANAGED;
 		}
 		goto validate;
 	} 
 
 	pv = NULL;
 
 	/*
 	 * Mapping has changed, invalidate old range and fall through to
 	 * handle validating new mapping.
 	 */
 	if (opa) {
 		if (origpte & PG_W)
 			pmap->pm_stats.wired_count--;
 		if (origpte & PG_MANAGED) {
 			om = PHYS_TO_VM_PAGE(opa);
 			pv = pmap_pvh_remove(&om->md, pmap, va);
 		}
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			KASSERT(mpte->wire_count > 0,
 			    ("pmap_enter: missing reference to page table page,"
 			     " va: 0x%lx", va));
 		}
 	} else
 		pmap_resident_count_inc(pmap, 1);
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0) {
 		KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva,
 		    ("pmap_enter: managed mapping within the clean submap"));
 		if (pv == NULL)
 			pv = get_pv_entry(pmap, FALSE);
 		pv->pv_va = va;
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 		pa |= PG_MANAGED;
 	} else if (pv != NULL)
 		free_pv_entry(pmap, pv);
 
 	/*
 	 * Increment counters
 	 */
 	if (wired)
 		pmap->pm_stats.wired_count++;
 
 validate:
 	/*
 	 * Now validate mapping with desired protection/wiring.
 	 */
 	newpte = (pt_entry_t)(pa | pmap_cache_bits(m->md.pat_mode, 0) | PG_V);
 	if ((prot & VM_PROT_WRITE) != 0) {
 		newpte |= PG_RW;
 		if ((newpte & PG_MANAGED) != 0)
 			vm_page_flag_set(m, PG_WRITEABLE);
 	}
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpte |= pg_nx;
 	if (wired)
 		newpte |= PG_W;
 	if (va < VM_MAXUSER_ADDRESS)
 		newpte |= PG_U;
 	if (pmap == kernel_pmap)
 		newpte |= PG_G;
 
 	/*
 	 * if the mapping or permission bits are different, we need
 	 * to update the pte.
 	 */
 	if ((origpte & ~(PG_M|PG_A)) != newpte) {
 		newpte |= PG_A;
 		if ((access & VM_PROT_WRITE) != 0)
 			newpte |= PG_M;
 		if (origpte & PG_V) {
 			invlva = FALSE;
 			origpte = pte_load_store(pte, newpte);
 			if (origpte & PG_A) {
 				if (origpte & PG_MANAGED)
 					vm_page_flag_set(om, PG_REFERENCED);
 				if (opa != VM_PAGE_TO_PHYS(m) || ((origpte &
 				    PG_NX) == 0 && (newpte & PG_NX)))
 					invlva = TRUE;
 			}
 			if ((origpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 				if ((origpte & PG_MANAGED) != 0)
 					vm_page_dirty(om);
 				if ((newpte & PG_RW) == 0)
 					invlva = TRUE;
 			}
 			if ((origpte & PG_MANAGED) != 0 &&
 			    TAILQ_EMPTY(&om->md.pv_list) &&
 			    TAILQ_EMPTY(&pa_to_pvh(opa)->pv_list))
 				vm_page_flag_clear(om, PG_WRITEABLE);
 			if (invlva)
 				pmap_invalidate_page(pmap, va);
 		} else
 			pte_store(pte, newpte);
 	}
 
 	/*
 	 * If both the page table page and the reservation are fully
 	 * populated, then attempt promotion.
 	 */
 	if ((mpte == NULL || mpte->wire_count == NPTEPG) &&
 	    pg_ps_enabled && vm_reserv_level_iffullpop(m) == 0)
 		pmap_promote_pde(pmap, pde, va);
 
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Tries to create a 2MB page mapping.  Returns TRUE if successful and FALSE
  * otherwise.  Fails if (1) a page table page cannot be allocated without
  * blocking, (2) a mapping already exists at the specified virtual address, or
  * (3) a pv entry cannot be allocated without reclaiming another pv entry. 
  */
 static boolean_t
 pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 	pd_entry_t *pde, newpde;
 	vm_page_t free, mpde;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if ((mpde = pmap_allocpde(pmap, va, M_NOWAIT)) == NULL) {
 		CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return (FALSE);
 	}
 	pde = (pd_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mpde));
 	pde = &pde[pmap_pde_index(va)];
 	if ((*pde & PG_V) != 0) {
 		KASSERT(mpde->wire_count > 1,
 		    ("pmap_enter_pde: mpde's wire count is too low"));
 		mpde->wire_count--;
 		CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return (FALSE);
 	}
 	newpde = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(m->md.pat_mode, 1) |
 	    PG_PS | PG_V;
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0) {
 		newpde |= PG_MANAGED;
 
 		/*
 		 * Abort this mapping if its PV entry could not be created.
 		 */
 		if (!pmap_pv_insert_pde(pmap, va, VM_PAGE_TO_PHYS(m))) {
 			free = NULL;
 			if (pmap_unwire_pte_hold(pmap, va, mpde, &free)) {
 				pmap_invalidate_page(pmap, va);
 				pmap_free_zero_pages(free);
 			}
 			CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return (FALSE);
 		}
 	}
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpde |= pg_nx;
 	if (va < VM_MAXUSER_ADDRESS)
 		newpde |= PG_U;
 
 	/*
 	 * Increment counters.
 	 */
 	pmap_resident_count_inc(pmap, NBPDR / PAGE_SIZE);
 
 	/*
 	 * Map the superpage.
 	 */
 	pde_store(pde, newpde);
 
 	pmap_pde_mappings++;
 	CTR2(KTR_PMAP, "pmap_enter_pde: success for va %#lx"
 	    " in pmap %p", va, pmap);
 	return (TRUE);
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 pmap_enter_object(pmap_t pmap, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	vm_offset_t va;
 	vm_page_t m, mpte;
 	vm_pindex_t diff, psize;
 
 	VM_OBJECT_LOCK_ASSERT(m_start->object, MA_OWNED);
 	psize = atop(end - start);
 	mpte = NULL;
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		va = start + ptoa(diff);
 		if ((va & PDRMASK) == 0 && va + NBPDR <= end &&
 		    (VM_PAGE_TO_PHYS(m) & PDRMASK) == 0 &&
 		    pg_ps_enabled && vm_reserv_level_iffullpop(m) == 0 &&
 		    pmap_enter_pde(pmap, va, m, prot))
 			m = &m[NBPDR / PAGE_SIZE - 1];
 		else
 			mpte = pmap_enter_quick_locked(pmap, va, m, prot,
 			    mpte);
 		m = TAILQ_NEXT(m, listq);
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * this code makes some *MAJOR* assumptions:
  * 1. Current pmap & pmap exists.
  * 2. Not wired.
  * 3. Read access.
  * 4. No page table pages.
  * but is *MUCH* faster than pmap_enter...
  */
 
 void
 pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	(void)pmap_enter_quick_locked(pmap, va, m, prot, NULL);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 static vm_page_t
 pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, vm_page_t mpte)
 {
 	vm_page_t free;
 	pt_entry_t *pte;
 	vm_paddr_t pa;
 
 	KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva ||
 	    (m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0,
 	    ("pmap_enter_quick_locked: managed mapping within the clean submap"));
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * In the case that a page table page is not
 	 * resident, we are creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		vm_pindex_t ptepindex;
 		pd_entry_t *ptepa;
 
 		/*
 		 * Calculate pagetable page index
 		 */
 		ptepindex = pmap_pde_pindex(va);
 		if (mpte && (mpte->pindex == ptepindex)) {
 			mpte->wire_count++;
 		} else {
 			/*
 			 * Get the page directory entry
 			 */
 			ptepa = pmap_pde(pmap, va);
 
 			/*
 			 * If the page table page is mapped, we just increment
 			 * the hold count, and activate it.
 			 */
 			if (ptepa && (*ptepa & PG_V) != 0) {
 				if (*ptepa & PG_PS)
 					return (NULL);
 				mpte = PHYS_TO_VM_PAGE(*ptepa & PG_FRAME);
 				mpte->wire_count++;
 			} else {
 				mpte = _pmap_allocpte(pmap, ptepindex,
 				    M_NOWAIT);
 				if (mpte == NULL)
 					return (mpte);
 			}
 		}
 		pte = (pt_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mpte));
 		pte = &pte[pmap_pte_index(va)];
 	} else {
 		mpte = NULL;
 		pte = vtopte(va);
 	}
 	if (*pte) {
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0 &&
 	    !pmap_try_insert_pv_entry(pmap, va, m)) {
 		if (mpte != NULL) {
 			free = NULL;
 			if (pmap_unwire_pte_hold(pmap, va, mpte, &free)) {
 				pmap_invalidate_page(pmap, va);
 				pmap_free_zero_pages(free);
 			}
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Increment counters
 	 */
 	pmap_resident_count_inc(pmap, 1);
 
 	pa = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(m->md.pat_mode, 0);
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		pa |= pg_nx;
 
 	/*
 	 * Now validate mapping with RO protection
 	 */
 	if (m->flags & (PG_FICTITIOUS|PG_UNMANAGED))
 		pte_store(pte, pa | PG_V | PG_U);
 	else
 		pte_store(pte, pa | PG_V | PG_U | PG_MANAGED);
 	return (mpte);
 }
 
 /*
  * Make a temporary mapping for a physical address.  This is only intended
  * to be used for panic dumps.
  */
 void *
 pmap_kenter_temporary(vm_paddr_t pa, int i)
 {
 	vm_offset_t va;
 
 	va = (vm_offset_t)crashdumpmap + (i * PAGE_SIZE);
 	pmap_kenter(va, pa);
 	invlpg(va);
 	return ((void *)crashdumpmap);
 }
 
 /*
  * This code maps large physical mmap regions into the
  * processor address space.  Note that some shortcuts
  * are taken, but the code works.
  */
 void
 pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_object_t object,
     vm_pindex_t pindex, vm_size_t size)
 {
 	pd_entry_t *pde;
 	vm_paddr_t pa, ptepa;
 	vm_page_t p, pdpg;
 	int pat_mode;
 
 	VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
 	KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG,
 	    ("pmap_object_init_pt: non-device object"));
 	if ((addr & (NBPDR - 1)) == 0 && (size & (NBPDR - 1)) == 0) {
 		if (!vm_object_populate(object, pindex, pindex + atop(size)))
 			return;
 		p = vm_page_lookup(object, pindex);
 		KASSERT(p->valid == VM_PAGE_BITS_ALL,
 		    ("pmap_object_init_pt: invalid page %p", p));
 		pat_mode = p->md.pat_mode;
 
 		/*
 		 * Abort the mapping if the first page is not physically
 		 * aligned to a 2MB page boundary.
 		 */
 		ptepa = VM_PAGE_TO_PHYS(p);
 		if (ptepa & (NBPDR - 1))
 			return;
 
 		/*
 		 * Skip the first page.  Abort the mapping if the rest of
 		 * the pages are not physically contiguous or have differing
 		 * memory attributes.
 		 */
 		p = TAILQ_NEXT(p, listq);
 		for (pa = ptepa + PAGE_SIZE; pa < ptepa + size;
 		    pa += PAGE_SIZE) {
 			KASSERT(p->valid == VM_PAGE_BITS_ALL,
 			    ("pmap_object_init_pt: invalid page %p", p));
 			if (pa != VM_PAGE_TO_PHYS(p) ||
 			    pat_mode != p->md.pat_mode)
 				return;
 			p = TAILQ_NEXT(p, listq);
 		}
 
 		/*
 		 * Map using 2MB pages.  Since "ptepa" is 2M aligned and
 		 * "size" is a multiple of 2M, adding the PAT setting to "pa"
 		 * will not affect the termination of this loop.
 		 */ 
 		PMAP_LOCK(pmap);
 		for (pa = ptepa | pmap_cache_bits(pat_mode, 1); pa < ptepa +
 		    size; pa += NBPDR) {
 			pdpg = pmap_allocpde(pmap, addr, M_NOWAIT);
 			if (pdpg == NULL) {
 				/*
 				 * The creation of mappings below is only an
 				 * optimization.  If a page directory page
 				 * cannot be allocated without blocking,
 				 * continue on to the next mapping rather than
 				 * blocking.
 				 */
 				addr += NBPDR;
 				continue;
 			}
 			pde = (pd_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pdpg));
 			pde = &pde[pmap_pde_index(addr)];
 			if ((*pde & PG_V) == 0) {
 				pde_store(pde, pa | PG_PS | PG_M | PG_A |
 				    PG_U | PG_RW | PG_V);
 				pmap_resident_count_inc(pmap, NBPDR / PAGE_SIZE);
 				pmap_pde_mappings++;
 			} else {
 				/* Continue on if the PDE is already valid. */
 				pdpg->wire_count--;
 				KASSERT(pdpg->wire_count > 0,
 				    ("pmap_object_init_pt: missing reference "
 				    "to page directory page, va: 0x%lx", addr));
 			}
 			addr += NBPDR;
 		}
 		PMAP_UNLOCK(pmap);
 	}
 }
 
 /*
  *	Routine:	pmap_change_wiring
  *	Function:	Change the wiring attribute for a map/virtual-address
  *			pair.
  *	In/out conditions:
  *			The mapping must already exist in the pmap.
  */
 void
 pmap_change_wiring(pmap_t pmap, vm_offset_t va, boolean_t wired)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	boolean_t are_queues_locked;
 
 	are_queues_locked = FALSE;
 
 	/*
 	 * Wiring is not a hardware characteristic so there is no need to
 	 * invalidate TLB.
 	 */
 retry:
 	PMAP_LOCK(pmap);
 	pde = pmap_pde(pmap, va);
 	if ((*pde & PG_PS) != 0) {
 		if (!wired != ((*pde & PG_W) == 0)) {
 			if (!are_queues_locked) {
 				are_queues_locked = TRUE;
 				if (!mtx_trylock(&vm_page_queue_mtx)) {
 					PMAP_UNLOCK(pmap);
 					vm_page_lock_queues();
 					goto retry;
 				}
 			}
 			if (!pmap_demote_pde(pmap, pde, va))
 				panic("pmap_change_wiring: demotion failed");
 		} else
 			goto out;
 	}
 	pte = pmap_pde_to_pte(pde, va);
 	if (wired && (*pte & PG_W) == 0) {
 		pmap->pm_stats.wired_count++;
 		atomic_set_long(pte, PG_W);
 	} else if (!wired && (*pte & PG_W) != 0) {
 		pmap->pm_stats.wired_count--;
 		atomic_clear_long(pte, PG_W);
 	}
 out:
 	if (are_queues_locked)
 		vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	Copy the range specified by src_addr/len
  *	from the source map to the range dst_addr/len
  *	in the destination map.
  *
  *	This routine is only advisory and need not do anything.
  */
 
 void
 pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, vm_size_t len,
     vm_offset_t src_addr)
 {
 	vm_page_t   free;
 	vm_offset_t addr;
 	vm_offset_t end_addr = src_addr + len;
 	vm_offset_t va_next;
 
 	if (dst_addr != src_addr)
 		return;
 
 	vm_page_lock_queues();
 	if (dst_pmap < src_pmap) {
 		PMAP_LOCK(dst_pmap);
 		PMAP_LOCK(src_pmap);
 	} else {
 		PMAP_LOCK(src_pmap);
 		PMAP_LOCK(dst_pmap);
 	}
 	for (addr = src_addr; addr < end_addr; addr = va_next) {
 		pt_entry_t *src_pte, *dst_pte;
 		vm_page_t dstmpde, dstmpte, srcmpte;
 		pml4_entry_t *pml4e;
 		pdp_entry_t *pdpe;
 		pd_entry_t srcptepaddr, *pde;
 
 		KASSERT(addr < UPT_MIN_ADDRESS,
 		    ("pmap_copy: invalid to pmap_copy page tables"));
 
 		pml4e = pmap_pml4e(src_pmap, addr);
 		if ((*pml4e & PG_V) == 0) {
 			va_next = (addr + NBPML4) & ~PML4MASK;
 			if (va_next < addr)
 				va_next = end_addr;
 			continue;
 		}
 
 		pdpe = pmap_pml4e_to_pdpe(pml4e, addr);
 		if ((*pdpe & PG_V) == 0) {
 			va_next = (addr + NBPDP) & ~PDPMASK;
 			if (va_next < addr)
 				va_next = end_addr;
 			continue;
 		}
 
 		va_next = (addr + NBPDR) & ~PDRMASK;
 		if (va_next < addr)
 			va_next = end_addr;
 
 		pde = pmap_pdpe_to_pde(pdpe, addr);
 		srcptepaddr = *pde;
 		if (srcptepaddr == 0)
 			continue;
 			
 		if (srcptepaddr & PG_PS) {
 			dstmpde = pmap_allocpde(dst_pmap, addr, M_NOWAIT);
 			if (dstmpde == NULL)
 				break;
 			pde = (pd_entry_t *)
 			    PHYS_TO_DMAP(VM_PAGE_TO_PHYS(dstmpde));
 			pde = &pde[pmap_pde_index(addr)];
 			if (*pde == 0 && ((srcptepaddr & PG_MANAGED) == 0 ||
 			    pmap_pv_insert_pde(dst_pmap, addr, srcptepaddr &
 			    PG_PS_FRAME))) {
 				*pde = srcptepaddr & ~PG_W;
 				pmap_resident_count_inc(dst_pmap, NBPDR / PAGE_SIZE);
 			} else
 				dstmpde->wire_count--;
 			continue;
 		}
 
 		srcptepaddr &= PG_FRAME;
 		srcmpte = PHYS_TO_VM_PAGE(srcptepaddr);
 		KASSERT(srcmpte->wire_count > 0,
 		    ("pmap_copy: source page table page is unused"));
 
 		if (va_next > end_addr)
 			va_next = end_addr;
 
 		src_pte = (pt_entry_t *)PHYS_TO_DMAP(srcptepaddr);
 		src_pte = &src_pte[pmap_pte_index(addr)];
 		dstmpte = NULL;
 		while (addr < va_next) {
 			pt_entry_t ptetemp;
 			ptetemp = *src_pte;
 			/*
 			 * we only virtual copy managed pages
 			 */
 			if ((ptetemp & PG_MANAGED) != 0) {
 				if (dstmpte != NULL &&
 				    dstmpte->pindex == pmap_pde_pindex(addr))
 					dstmpte->wire_count++;
 				else if ((dstmpte = pmap_allocpte(dst_pmap,
 				    addr, M_NOWAIT)) == NULL)
 					goto out;
 				dst_pte = (pt_entry_t *)
 				    PHYS_TO_DMAP(VM_PAGE_TO_PHYS(dstmpte));
 				dst_pte = &dst_pte[pmap_pte_index(addr)];
 				if (*dst_pte == 0 &&
 				    pmap_try_insert_pv_entry(dst_pmap, addr,
 				    PHYS_TO_VM_PAGE(ptetemp & PG_FRAME))) {
 					/*
 					 * Clear the wired, modified, and
 					 * accessed (referenced) bits
 					 * during the copy.
 					 */
 					*dst_pte = ptetemp & ~(PG_W | PG_M |
 					    PG_A);
 					pmap_resident_count_inc(dst_pmap, 1);
 	 			} else {
 					free = NULL;
 					if (pmap_unwire_pte_hold(dst_pmap,
 					    addr, dstmpte, &free)) {
 					    	pmap_invalidate_page(dst_pmap,
 					 	    addr);
 				    	    	pmap_free_zero_pages(free);
 					}
 					goto out;
 				}
 				if (dstmpte->wire_count >= srcmpte->wire_count)
 					break;
 			}
 			addr += PAGE_SIZE;
 			src_pte++;
 		}
 	}
 out:
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(src_pmap);
 	PMAP_UNLOCK(dst_pmap);
 }	
 
 /*
  *	pmap_zero_page zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  */
 void
 pmap_zero_page(vm_page_t m)
 {
 	vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
 
 	pagezero((void *)va);
 }
 
 /*
  *	pmap_zero_page_area zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  *
  *	off and size may not cover an area beyond a single hardware page.
  */
 void
 pmap_zero_page_area(vm_page_t m, int off, int size)
 {
 	vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
 
 	if (off == 0 && size == PAGE_SIZE)
 		pagezero((void *)va);
 	else
 		bzero((char *)va + off, size);
 }
 
 /*
  *	pmap_zero_page_idle zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.  This
  *	is intended to be called from the vm_pagezero process only and
  *	outside of Giant.
  */
 void
 pmap_zero_page_idle(vm_page_t m)
 {
 	vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
 
 	pagezero((void *)va);
 }
 
 /*
  *	pmap_copy_page copies the specified (machine independent)
  *	page by mapping the page into virtual memory and using
  *	bcopy to copy the page, one machine dependent page at a
  *	time.
  */
 void
 pmap_copy_page(vm_page_t msrc, vm_page_t mdst)
 {
 	vm_offset_t src = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(msrc));
 	vm_offset_t dst = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mdst));
 
 	pagecopy((void *)src, (void *)dst);
 }
 
 /*
  * Returns true if the pmap's pv is one of the first
  * 16 pvs linked to from this page.  This count may
  * be changed upwards or downwards in the future; it
  * is only necessary that true be returned for a small
  * subset of pmaps for proper page aging.
  */
 boolean_t
 pmap_page_exists_quick(pmap_t pmap, vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	int loops = 0;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_page_exists_quick: page %p is not managed", m));
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		if (PV_PMAP(pv) == pmap) {
 			rv = TRUE;
 			break;
 		}
 		loops++;
 		if (loops >= 16)
 			break;
 	}
 	if (!rv && loops < 16) {
 		pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 		TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 			if (PV_PMAP(pv) == pmap) {
 				rv = TRUE;
 				break;
 			}
 			loops++;
 			if (loops >= 16)
 				break;
 		}
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  *	pmap_page_wired_mappings:
  *
  *	Return the number of managed mappings to the given physical page
  *	that are wired.
  */
 int
 pmap_page_wired_mappings(vm_page_t m)
 {
 	int count;
 
 	count = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	count = pmap_pvh_wired_mappings(&m->md, count);
 	count = pmap_pvh_wired_mappings(pa_to_pvh(VM_PAGE_TO_PHYS(m)), count);
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  *	pmap_pvh_wired_mappings:
  *
  *	Return the updated number "count" of managed mappings that are wired.
  */
 static int
 pmap_pvh_wired_mappings(struct md_page *pvh, int count)
 {
 	pmap_t pmap;
 	pt_entry_t *pte;
 	pv_entry_t pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte(pmap, pv->pv_va);
 		if ((*pte & PG_W) != 0)
 			count++;
 		PMAP_UNLOCK(pmap);
 	}
 	return (count);
 }
 
 /*
  * Returns TRUE if the given page is mapped individually or as part of
  * a 2mpage.  Otherwise, returns FALSE.
  */
 boolean_t
 pmap_page_is_mapped(vm_page_t m)
 {
 	boolean_t rv;
 
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0)
 		return (FALSE);
 	vm_page_lock_queues();
 	rv = !TAILQ_EMPTY(&m->md.pv_list) ||
 	    !TAILQ_EMPTY(&pa_to_pvh(VM_PAGE_TO_PHYS(m))->pv_list);
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Remove all pages from specified address space
  * this aids process exit speeds.  Also, this code
  * is special cased for current process only, but
  * can have the more generic (and slightly slower)
  * mode enabled.  This is much faster than pmap_remove
  * in the case of running down an entire address space.
  */
 void
 pmap_remove_pages(pmap_t pmap)
 {
 	pd_entry_t ptepde;
 	pt_entry_t *pte, tpte;
 	vm_page_t free = NULL;
 	vm_page_t m, mpte, mt;
 	pv_entry_t pv;
 	struct md_page *pvh;
 	struct pv_chunk *pc, *npc;
 	int field, idx;
 	int64_t bit;
 	uint64_t inuse, bitmask;
 	int allfree;
 
 	if (pmap != PCPU_GET(curpmap)) {
 		printf("warning: pmap_remove_pages called with non-current pmap\n");
 		return;
 	}
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	TAILQ_FOREACH_SAFE(pc, &pmap->pm_pvchunk, pc_list, npc) {
 		allfree = 1;
 		for (field = 0; field < _NPCM; field++) {
 			inuse = (~(pc->pc_map[field])) & pc_freemask[field];
 			while (inuse != 0) {
 				bit = bsfq(inuse);
 				bitmask = 1UL << bit;
 				idx = field * 64 + bit;
 				pv = &pc->pc_pventry[idx];
 				inuse &= ~bitmask;
 
 				pte = pmap_pdpe(pmap, pv->pv_va);
 				ptepde = *pte;
 				pte = pmap_pdpe_to_pde(pte, pv->pv_va);
 				tpte = *pte;
 				if ((tpte & (PG_PS | PG_V)) == PG_V) {
 					ptepde = tpte;
 					pte = (pt_entry_t *)PHYS_TO_DMAP(tpte &
 					    PG_FRAME);
 					pte = &pte[pmap_pte_index(pv->pv_va)];
 					tpte = *pte & ~PG_PTE_PAT;
 				}
 				if ((tpte & PG_V) == 0)
 					panic("bad pte");
 
 /*
  * We cannot remove wired pages from a process' mapping at this time
  */
 				if (tpte & PG_W) {
 					allfree = 0;
 					continue;
 				}
 
 				m = PHYS_TO_VM_PAGE(tpte & PG_FRAME);
 				KASSERT(m->phys_addr == (tpte & PG_FRAME),
 				    ("vm_page_t %p phys_addr mismatch %016jx %016jx",
 				    m, (uintmax_t)m->phys_addr,
 				    (uintmax_t)tpte));
 
 				KASSERT(m < &vm_page_array[vm_page_array_size],
 					("pmap_remove_pages: bad tpte %#jx",
 					(uintmax_t)tpte));
 
 				pte_clear(pte);
 
 				/*
 				 * Update the vm_page_t clean/reference bits.
 				 */
 				if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 					if ((tpte & PG_PS) != 0) {
 						for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++)
 							vm_page_dirty(mt);
 					} else
 						vm_page_dirty(m);
 				}
 
 				/* Mark free */
 				PV_STAT(pv_entry_frees++);
 				PV_STAT(pv_entry_spare++);
 				pv_entry_count--;
 				pc->pc_map[field] |= bitmask;
 				if ((tpte & PG_PS) != 0) {
 					pmap_resident_count_dec(pmap, NBPDR / PAGE_SIZE);
 					pvh = pa_to_pvh(tpte & PG_PS_FRAME);
 					TAILQ_REMOVE(&pvh->pv_list, pv, pv_list);
 					if (TAILQ_EMPTY(&pvh->pv_list)) {
 						for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++)
 							if (TAILQ_EMPTY(&mt->md.pv_list))
 								vm_page_flag_clear(mt, PG_WRITEABLE);
 					}
 					mpte = pmap_lookup_pt_page(pmap, pv->pv_va);
 					if (mpte != NULL) {
 						pmap_remove_pt_page(pmap, mpte);
 						pmap_resident_count_dec(pmap, 1);
 						KASSERT(mpte->wire_count == NPTEPG,
 						    ("pmap_remove_pages: pte page wire count error"));
 						mpte->wire_count = 0;
 						pmap_add_delayed_free_list(mpte, &free, FALSE);
 						atomic_subtract_int(&cnt.v_wire_count, 1);
 					}
 				} else {
 					pmap_resident_count_dec(pmap, 1);
 					TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 					if (TAILQ_EMPTY(&m->md.pv_list)) {
 						pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 						if (TAILQ_EMPTY(&pvh->pv_list))
 							vm_page_flag_clear(m, PG_WRITEABLE);
 					}
 				}
 				pmap_unuse_pt(pmap, pv->pv_va, ptepde, &free);
 			}
 		}
 		if (allfree) {
 			PV_STAT(pv_entry_spare -= _NPCPV);
 			PV_STAT(pc_chunk_count--);
 			PV_STAT(pc_chunk_frees++);
 			TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 			m = PHYS_TO_VM_PAGE(DMAP_TO_PHYS((vm_offset_t)pc));
 			dump_drop_page(m->phys_addr);
 			vm_page_unwire(m, 0);
 			vm_page_free(m);
 		}
 	}
 	pmap_invalidate_all(pmap);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 	pmap_free_zero_pages(free);
 }
 
 /*
  *	pmap_is_modified:
  *
  *	Return whether or not the specified physical page was modified
  *	in any physical maps.
  */
 boolean_t
 pmap_is_modified(vm_page_t m)
 {
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_modified: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no PTEs can have PG_M set.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return (FALSE);
 	vm_page_lock_queues();
 	rv = pmap_is_modified_pvh(&m->md) ||
 	    pmap_is_modified_pvh(pa_to_pvh(VM_PAGE_TO_PHYS(m)));
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Returns TRUE if any of the given mappings were used to modify
  * physical memory.  Otherwise, returns FALSE.  Both page and 2mpage
  * mappings are supported.
  */
 static boolean_t
 pmap_is_modified_pvh(struct md_page *pvh)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 	pmap_t pmap;
 	boolean_t rv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	rv = FALSE;
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte(pmap, pv->pv_va);
 		rv = (*pte & (PG_M | PG_RW)) == (PG_M | PG_RW);
 		PMAP_UNLOCK(pmap);
 		if (rv)
 			break;
 	}
 	return (rv);
 }
 
 /*
  *	pmap_is_prefaultable:
  *
  *	Return whether or not the specified virtual address is elgible
  *	for prefault.
  */
 boolean_t
 pmap_is_prefaultable(pmap_t pmap, vm_offset_t addr)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	boolean_t rv;
 
 	rv = FALSE;
 	PMAP_LOCK(pmap);
 	pde = pmap_pde(pmap, addr);
 	if (pde != NULL && (*pde & (PG_PS | PG_V)) == PG_V) {
 		pte = pmap_pde_to_pte(pde, addr);
 		rv = (*pte & PG_V) == 0;
 	}
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 /*
  *	pmap_is_referenced:
  *
  *	Return whether or not the specified physical page was referenced
  *	in any physical maps.
  */
 boolean_t
 pmap_is_referenced(vm_page_t m)
 {
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_referenced: page %p is not managed", m));
 	vm_page_lock_queues();
 	rv = pmap_is_referenced_pvh(&m->md) ||
 	    pmap_is_referenced_pvh(pa_to_pvh(VM_PAGE_TO_PHYS(m)));
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Returns TRUE if any of the given mappings were referenced and FALSE
  * otherwise.  Both page and 2mpage mappings are supported.
  */
 static boolean_t
 pmap_is_referenced_pvh(struct md_page *pvh)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 	pmap_t pmap;
 	boolean_t rv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	rv = FALSE;
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte(pmap, pv->pv_va);
 		rv = (*pte & (PG_A | PG_V)) == (PG_A | PG_V);
 		PMAP_UNLOCK(pmap);
 		if (rv)
 			break;
 	}
 	return (rv);
 }
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 void
 pmap_remove_write(vm_page_t m)
 {
 	struct md_page *pvh;
 	pmap_t pmap;
 	pv_entry_t next_pv, pv;
 	pd_entry_t *pde;
 	pt_entry_t oldpte, *pte;
 	vm_offset_t va;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_list, next_pv) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, va);
 		if ((*pde & PG_RW) != 0)
 			(void)pmap_demote_pde(pmap, pde, va);
 		PMAP_UNLOCK(pmap);
 	}
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_clear_write: found"
 		    " a 2mpage in page %p's pv list", m));
 		pte = pmap_pde_to_pte(pde, pv->pv_va);
 retry:
 		oldpte = *pte;
 		if (oldpte & PG_RW) {
 			if (!atomic_cmpset_long(pte, oldpte, oldpte &
 			    ~(PG_RW | PG_M)))
 				goto retry;
 			if ((oldpte & PG_M) != 0)
 				vm_page_dirty(m);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 /*
  *	pmap_ts_referenced:
  *
  *	Return a count of reference bits for a page, clearing those bits.
  *	It is not necessary for every reference bit to be cleared, but it
  *	is necessary that 0 only be returned when there are truly no
  *	reference bits set.
  *
  *	XXX: The exact number of bits to check and clear is a matter that
  *	should be tested and standardized at some point in the future for
  *	optimal aging of shared pages.
  */
 int
 pmap_ts_referenced(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t pv, pvf, pvn;
 	pmap_t pmap;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t *pte;
 	vm_offset_t va;
 	int rtval = 0;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_ts_referenced: page %p is not managed", m));
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	vm_page_lock_queues();
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_list, pvn) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, va);
 		oldpde = *pde;
 		if ((oldpde & PG_A) != 0) {
 			if (pmap_demote_pde(pmap, pde, va)) {
 				if ((oldpde & PG_W) == 0) {
 					/*
 					 * Remove the mapping to a single page
 					 * so that a subsequent access may
 					 * repromote.  Since the underlying
 					 * page table page is fully populated,
 					 * this removal never frees a page
 					 * table page.
 					 */
 					va += VM_PAGE_TO_PHYS(m) - (oldpde &
 					    PG_PS_FRAME);
 					pmap_remove_page(pmap, va, pde, NULL);
 					rtval++;
 					if (rtval > 4) {
 						PMAP_UNLOCK(pmap);
 						goto out;
 					}
 				}
 			}
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	if ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		pvf = pv;
 		do {
 			pvn = TAILQ_NEXT(pv, pv_list);
 			TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 			TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 			pmap = PV_PMAP(pv);
 			PMAP_LOCK(pmap);
 			pde = pmap_pde(pmap, pv->pv_va);
 			KASSERT((*pde & PG_PS) == 0, ("pmap_ts_referenced:"
 			    " found a 2mpage in page %p's pv list", m));
 			pte = pmap_pde_to_pte(pde, pv->pv_va);
 			if ((*pte & PG_A) != 0) {
 				atomic_clear_long(pte, PG_A);
 				pmap_invalidate_page(pmap, pv->pv_va);
 				rtval++;
 				if (rtval > 4)
 					pvn = NULL;
 			}
 			PMAP_UNLOCK(pmap);
 		} while ((pv = pvn) != NULL && pv != pvf);
 	}
 out:
 	vm_page_unlock_queues();
 	return (rtval);
 }
 
 /*
  *	Clear the modify bits on the specified physical page.
  */
 void
 pmap_clear_modify(vm_page_t m)
 {
 	struct md_page *pvh;
 	pmap_t pmap;
 	pv_entry_t next_pv, pv;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t oldpte, *pte;
 	vm_offset_t va;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("pmap_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no PTEs can have PG_M set.
 	 * If the object containing the page is locked and the page is not
 	 * VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_list, next_pv) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, va);
 		oldpde = *pde;
 		if ((oldpde & PG_RW) != 0) {
 			if (pmap_demote_pde(pmap, pde, va)) {
 				if ((oldpde & PG_W) == 0) {
 					/*
 					 * Write protect the mapping to a
 					 * single page so that a subsequent
 					 * write access may repromote.
 					 */
 					va += VM_PAGE_TO_PHYS(m) - (oldpde &
 					    PG_PS_FRAME);
 					pte = pmap_pde_to_pte(pde, va);
 					oldpte = *pte;
 					if ((oldpte & PG_V) != 0) {
 						while (!atomic_cmpset_long(pte,
 						    oldpte,
 						    oldpte & ~(PG_M | PG_RW)))
 							oldpte = *pte;
 						vm_page_dirty(m);
 						pmap_invalidate_page(pmap, va);
 					}
 				}
 			}
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_clear_modify: found"
 		    " a 2mpage in page %p's pv list", m));
 		pte = pmap_pde_to_pte(pde, pv->pv_va);
 		if ((*pte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 			atomic_clear_long(pte, PG_M);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_unlock_queues();
 }
 
 /*
  *	pmap_clear_reference:
  *
  *	Clear the reference bit on the specified physical page.
  */
 void
 pmap_clear_reference(vm_page_t m)
 {
 	struct md_page *pvh;
 	pmap_t pmap;
 	pv_entry_t next_pv, pv;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t *pte;
 	vm_offset_t va;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_reference: page %p is not managed", m));
 	vm_page_lock_queues();
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_list, next_pv) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, va);
 		oldpde = *pde;
 		if ((oldpde & PG_A) != 0) {
 			if (pmap_demote_pde(pmap, pde, va)) {
 				/*
 				 * Remove the mapping to a single page so
 				 * that a subsequent access may repromote.
 				 * Since the underlying page table page is
 				 * fully populated, this removal never frees
 				 * a page table page.
 				 */
 				va += VM_PAGE_TO_PHYS(m) - (oldpde &
 				    PG_PS_FRAME);
 				pmap_remove_page(pmap, va, pde, NULL);
 			}
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_clear_reference: found"
 		    " a 2mpage in page %p's pv list", m));
 		pte = pmap_pde_to_pte(pde, pv->pv_va);
 		if (*pte & PG_A) {
 			atomic_clear_long(pte, PG_A);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_unlock_queues();
 }
 
 /*
  * Miscellaneous support routines follow
  */
 
 /* Adjust the cache mode for a 4KB page mapped via a PTE. */
 static __inline void
 pmap_pte_attr(pt_entry_t *pte, int cache_bits)
 {
 	u_int opte, npte;
 
 	/*
 	 * The cache mode bits are all in the low 32-bits of the
 	 * PTE, so we can just spin on updating the low 32-bits.
 	 */
 	do {
 		opte = *(u_int *)pte;
 		npte = opte & ~PG_PTE_CACHE;
 		npte |= cache_bits;
 	} while (npte != opte && !atomic_cmpset_int((u_int *)pte, opte, npte));
 }
 
 /* Adjust the cache mode for a 2MB page mapped via a PDE. */
 static __inline void
 pmap_pde_attr(pd_entry_t *pde, int cache_bits)
 {
 	u_int opde, npde;
 
 	/*
 	 * The cache mode bits are all in the low 32-bits of the
 	 * PDE, so we can just spin on updating the low 32-bits.
 	 */
 	do {
 		opde = *(u_int *)pde;
 		npde = opde & ~PG_PDE_CACHE;
 		npde |= cache_bits;
 	} while (npde != opde && !atomic_cmpset_int((u_int *)pde, opde, npde));
 }
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  */
 void *
 pmap_mapdev_attr(vm_paddr_t pa, vm_size_t size, int mode)
 {
 	vm_offset_t va, offset;
 	vm_size_t tmpsize;
 
 	/*
 	 * If the specified range of physical addresses fits within the direct
 	 * map window, use the direct map. 
 	 */
 	if (pa < dmaplimit && pa + size < dmaplimit) {
 		va = PHYS_TO_DMAP(pa);
 		if (!pmap_change_attr(va, size, mode))
 			return ((void *)va);
 	}
 	offset = pa & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 	va = kmem_alloc_nofault(kernel_map, size);
 	if (!va)
 		panic("pmap_mapdev: Couldn't alloc kernel virtual memory");
 	pa = trunc_page(pa);
 	for (tmpsize = 0; tmpsize < size; tmpsize += PAGE_SIZE)
 		pmap_kenter_attr(va + tmpsize, pa + tmpsize, mode);
 	pmap_invalidate_range(kernel_pmap, va, va + tmpsize);
 	pmap_invalidate_cache_range(va, va + tmpsize);
 	return ((void *)(va + offset));
 }
 
 void *
 pmap_mapdev(vm_paddr_t pa, vm_size_t size)
 {
 
 	return (pmap_mapdev_attr(pa, size, PAT_UNCACHEABLE));
 }
 
 void *
 pmap_mapbios(vm_paddr_t pa, vm_size_t size)
 {
 
 	return (pmap_mapdev_attr(pa, size, PAT_WRITE_BACK));
 }
 
 void
 pmap_unmapdev(vm_offset_t va, vm_size_t size)
 {
 	vm_offset_t base, offset, tmpva;
 
 	/* If we gave a direct map region in pmap_mapdev, do nothing */
 	if (va >= DMAP_MIN_ADDRESS && va < DMAP_MAX_ADDRESS)
 		return;
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 	for (tmpva = base; tmpva < (base + size); tmpva += PAGE_SIZE)
 		pmap_kremove(tmpva);
 	pmap_invalidate_range(kernel_pmap, va, tmpva);
 	kmem_free(kernel_map, base, size);
 }
 
 /*
  * Tries to demote a 1GB page mapping.
  */
 static boolean_t
 pmap_demote_pdpe(pmap_t pmap, pdp_entry_t *pdpe, vm_offset_t va)
 {
 	pdp_entry_t newpdpe, oldpdpe;
 	pd_entry_t *firstpde, newpde, *pde;
 	vm_paddr_t mpdepa;
 	vm_page_t mpde;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpdpe = *pdpe;
 	KASSERT((oldpdpe & (PG_PS | PG_V)) == (PG_PS | PG_V),
 	    ("pmap_demote_pdpe: oldpdpe is missing PG_PS and/or PG_V"));
 	if ((mpde = vm_page_alloc(NULL, va >> PDPSHIFT, VM_ALLOC_INTERRUPT |
 	    VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) {
 		CTR2(KTR_PMAP, "pmap_demote_pdpe: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return (FALSE);
 	}
 	mpdepa = VM_PAGE_TO_PHYS(mpde);
 	firstpde = (pd_entry_t *)PHYS_TO_DMAP(mpdepa);
 	newpdpe = mpdepa | PG_M | PG_A | (oldpdpe & PG_U) | PG_RW | PG_V;
 	KASSERT((oldpdpe & PG_A) != 0,
 	    ("pmap_demote_pdpe: oldpdpe is missing PG_A"));
 	KASSERT((oldpdpe & (PG_M | PG_RW)) != PG_RW,
 	    ("pmap_demote_pdpe: oldpdpe is missing PG_M"));
 	newpde = oldpdpe;
 
 	/*
 	 * Initialize the page directory page.
 	 */
 	for (pde = firstpde; pde < firstpde + NPDEPG; pde++) {
 		*pde = newpde;
 		newpde += NBPDR;
 	}
 
 	/*
 	 * Demote the mapping.
 	 */
 	*pdpe = newpdpe;
 
 	/*
 	 * Invalidate a stale recursive mapping of the page directory page.
 	 */
 	pmap_invalidate_page(pmap, (vm_offset_t)vtopde(va));
 
 	pmap_pdpe_demotions++;
 	CTR2(KTR_PMAP, "pmap_demote_pdpe: success for va %#lx"
 	    " in pmap %p", va, pmap);
 	return (TRUE);
 }
 
 /*
  * Sets the memory attribute for the specified page.
  */
 void
 pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma)
 {
 
 	m->md.pat_mode = ma;
 
 	/*
 	 * If "m" is a normal page, update its direct mapping.  This update
 	 * can be relied upon to perform any cache operations that are
 	 * required for data coherence.
 	 */
 	if ((m->flags & PG_FICTITIOUS) == 0 &&
 	    pmap_change_attr(PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)), PAGE_SIZE,
 	    m->md.pat_mode))
 		panic("memory attribute change on the direct map failed");
 }
 
 /*
  * Changes the specified virtual address range's memory type to that given by
  * the parameter "mode".  The specified virtual address range must be
  * completely contained within either the direct map or the kernel map.  If
  * the virtual address range is contained within the kernel map, then the
  * memory type for each of the corresponding ranges of the direct map is also
  * changed.  (The corresponding ranges of the direct map are those ranges that
  * map the same physical pages as the specified virtual address range.)  These
  * changes to the direct map are necessary because Intel describes the
  * behavior of their processors as "undefined" if two or more mappings to the
  * same physical page have different memory types.
  *
  * Returns zero if the change completed successfully, and either EINVAL or
  * ENOMEM if the change failed.  Specifically, EINVAL is returned if some part
  * of the virtual address range was not mapped, and ENOMEM is returned if
  * there was insufficient memory available to complete the change.  In the
  * latter case, the memory type may have been changed on some part of the
  * virtual address range or the direct map.
  */
 int
 pmap_change_attr(vm_offset_t va, vm_size_t size, int mode)
 {
 	int error;
 
 	PMAP_LOCK(kernel_pmap);
 	error = pmap_change_attr_locked(va, size, mode);
 	PMAP_UNLOCK(kernel_pmap);
 	return (error);
 }
 
 static int
 pmap_change_attr_locked(vm_offset_t va, vm_size_t size, int mode)
 {
 	vm_offset_t base, offset, tmpva;
 	vm_paddr_t pa_start, pa_end;
 	pdp_entry_t *pdpe;
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	int cache_bits_pte, cache_bits_pde, error;
 	boolean_t changed;
 
 	PMAP_LOCK_ASSERT(kernel_pmap, MA_OWNED);
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 
 	/*
 	 * Only supported on kernel virtual addresses, including the direct
 	 * map but excluding the recursive map.
 	 */
 	if (base < DMAP_MIN_ADDRESS)
 		return (EINVAL);
 
 	cache_bits_pde = pmap_cache_bits(mode, 1);
 	cache_bits_pte = pmap_cache_bits(mode, 0);
 	changed = FALSE;
 
 	/*
 	 * Pages that aren't mapped aren't supported.  Also break down 2MB pages
 	 * into 4KB pages if required.
 	 */
 	for (tmpva = base; tmpva < base + size; ) {
 		pdpe = pmap_pdpe(kernel_pmap, tmpva);
 		if (*pdpe == 0)
 			return (EINVAL);
 		if (*pdpe & PG_PS) {
 			/*
 			 * If the current 1GB page already has the required
 			 * memory type, then we need not demote this page. Just
 			 * increment tmpva to the next 1GB page frame.
 			 */
 			if ((*pdpe & PG_PDE_CACHE) == cache_bits_pde) {
 				tmpva = trunc_1gpage(tmpva) + NBPDP;
 				continue;
 			}
 
 			/*
 			 * If the current offset aligns with a 1GB page frame
 			 * and there is at least 1GB left within the range, then
 			 * we need not break down this page into 2MB pages.
 			 */
 			if ((tmpva & PDPMASK) == 0 &&
 			    tmpva + PDPMASK < base + size) {
 				tmpva += NBPDP;
 				continue;
 			}
 			if (!pmap_demote_pdpe(kernel_pmap, pdpe, tmpva))
 				return (ENOMEM);
 		}
 		pde = pmap_pdpe_to_pde(pdpe, tmpva);
 		if (*pde == 0)
 			return (EINVAL);
 		if (*pde & PG_PS) {
 			/*
 			 * If the current 2MB page already has the required
 			 * memory type, then we need not demote this page. Just
 			 * increment tmpva to the next 2MB page frame.
 			 */
 			if ((*pde & PG_PDE_CACHE) == cache_bits_pde) {
 				tmpva = trunc_2mpage(tmpva) + NBPDR;
 				continue;
 			}
 
 			/*
 			 * If the current offset aligns with a 2MB page frame
 			 * and there is at least 2MB left within the range, then
 			 * we need not break down this page into 4KB pages.
 			 */
 			if ((tmpva & PDRMASK) == 0 &&
 			    tmpva + PDRMASK < base + size) {
 				tmpva += NBPDR;
 				continue;
 			}
 			if (!pmap_demote_pde(kernel_pmap, pde, tmpva))
 				return (ENOMEM);
 		}
 		pte = pmap_pde_to_pte(pde, tmpva);
 		if (*pte == 0)
 			return (EINVAL);
 		tmpva += PAGE_SIZE;
 	}
 	error = 0;
 
 	/*
 	 * Ok, all the pages exist, so run through them updating their
 	 * cache mode if required.
 	 */
 	pa_start = pa_end = 0;
 	for (tmpva = base; tmpva < base + size; ) {
 		pdpe = pmap_pdpe(kernel_pmap, tmpva);
 		if (*pdpe & PG_PS) {
 			if ((*pdpe & PG_PDE_CACHE) != cache_bits_pde) {
 				pmap_pde_attr(pdpe, cache_bits_pde);
 				changed = TRUE;
 			}
 			if (tmpva >= VM_MIN_KERNEL_ADDRESS) {
 				if (pa_start == pa_end) {
 					/* Start physical address run. */
 					pa_start = *pdpe & PG_PS_FRAME;
 					pa_end = pa_start + NBPDP;
 				} else if (pa_end == (*pdpe & PG_PS_FRAME))
 					pa_end += NBPDP;
 				else {
 					/* Run ended, update direct map. */
 					error = pmap_change_attr_locked(
 					    PHYS_TO_DMAP(pa_start),
 					    pa_end - pa_start, mode);
 					if (error != 0)
 						break;
 					/* Start physical address run. */
 					pa_start = *pdpe & PG_PS_FRAME;
 					pa_end = pa_start + NBPDP;
 				}
 			}
 			tmpva = trunc_1gpage(tmpva) + NBPDP;
 			continue;
 		}
 		pde = pmap_pdpe_to_pde(pdpe, tmpva);
 		if (*pde & PG_PS) {
 			if ((*pde & PG_PDE_CACHE) != cache_bits_pde) {
 				pmap_pde_attr(pde, cache_bits_pde);
 				changed = TRUE;
 			}
 			if (tmpva >= VM_MIN_KERNEL_ADDRESS) {
 				if (pa_start == pa_end) {
 					/* Start physical address run. */
 					pa_start = *pde & PG_PS_FRAME;
 					pa_end = pa_start + NBPDR;
 				} else if (pa_end == (*pde & PG_PS_FRAME))
 					pa_end += NBPDR;
 				else {
 					/* Run ended, update direct map. */
 					error = pmap_change_attr_locked(
 					    PHYS_TO_DMAP(pa_start),
 					    pa_end - pa_start, mode);
 					if (error != 0)
 						break;
 					/* Start physical address run. */
 					pa_start = *pde & PG_PS_FRAME;
 					pa_end = pa_start + NBPDR;
 				}
 			}
 			tmpva = trunc_2mpage(tmpva) + NBPDR;
 		} else {
 			pte = pmap_pde_to_pte(pde, tmpva);
 			if ((*pte & PG_PTE_CACHE) != cache_bits_pte) {
 				pmap_pte_attr(pte, cache_bits_pte);
 				changed = TRUE;
 			}
 			if (tmpva >= VM_MIN_KERNEL_ADDRESS) {
 				if (pa_start == pa_end) {
 					/* Start physical address run. */
 					pa_start = *pte & PG_FRAME;
 					pa_end = pa_start + PAGE_SIZE;
 				} else if (pa_end == (*pte & PG_FRAME))
 					pa_end += PAGE_SIZE;
 				else {
 					/* Run ended, update direct map. */
 					error = pmap_change_attr_locked(
 					    PHYS_TO_DMAP(pa_start),
 					    pa_end - pa_start, mode);
 					if (error != 0)
 						break;
 					/* Start physical address run. */
 					pa_start = *pte & PG_FRAME;
 					pa_end = pa_start + PAGE_SIZE;
 				}
 			}
 			tmpva += PAGE_SIZE;
 		}
 	}
 	if (error == 0 && pa_start != pa_end)
 		error = pmap_change_attr_locked(PHYS_TO_DMAP(pa_start),
 		    pa_end - pa_start, mode);
 
 	/*
 	 * Flush CPU caches if required to make sure any data isn't cached that
 	 * shouldn't be, etc.
 	 */
 	if (changed) {
 		pmap_invalidate_range(kernel_pmap, base, tmpva);
 		pmap_invalidate_cache_range(base, tmpva);
 	}
 	return (error);
 }
 
 /*
  * Demotes any mapping within the direct map region that covers more than the
  * specified range of physical addresses.  This range's size must be a power
  * of two and its starting address must be a multiple of its size.  Since the
  * demotion does not change any attributes of the mapping, a TLB invalidation
  * is not mandatory.  The caller may, however, request a TLB invalidation.
  */
 void
 pmap_demote_DMAP(vm_paddr_t base, vm_size_t len, boolean_t invalidate)
 {
 	pdp_entry_t *pdpe;
 	pd_entry_t *pde;
 	vm_offset_t va;
 	boolean_t changed;
 
 	if (len == 0)
 		return;
 	KASSERT(powerof2(len), ("pmap_demote_DMAP: len is not a power of 2"));
 	KASSERT((base & (len - 1)) == 0,
 	    ("pmap_demote_DMAP: base is not a multiple of len"));
 	if (len < NBPDP && base < dmaplimit) {
 		va = PHYS_TO_DMAP(base);
 		changed = FALSE;
 		PMAP_LOCK(kernel_pmap);
 		pdpe = pmap_pdpe(kernel_pmap, va);
 		if ((*pdpe & PG_V) == 0)
 			panic("pmap_demote_DMAP: invalid PDPE");
 		if ((*pdpe & PG_PS) != 0) {
 			if (!pmap_demote_pdpe(kernel_pmap, pdpe, va))
 				panic("pmap_demote_DMAP: PDPE failed");
 			changed = TRUE;
 		}
 		if (len < NBPDR) {
 			pde = pmap_pdpe_to_pde(pdpe, va);
 			if ((*pde & PG_V) == 0)
 				panic("pmap_demote_DMAP: invalid PDE");
 			if ((*pde & PG_PS) != 0) {
 				if (!pmap_demote_pde(kernel_pmap, pde, va))
 					panic("pmap_demote_DMAP: PDE failed");
 				changed = TRUE;
 			}
 		}
 		if (changed && invalidate)
 			pmap_invalidate_page(kernel_pmap, va);
 		PMAP_UNLOCK(kernel_pmap);
 	}
 }
 
 /*
  * perform the pmap work for mincore
  */
 int
 pmap_mincore(pmap_t pmap, vm_offset_t addr, vm_paddr_t *locked_pa)
 {
 	pd_entry_t *pdep;
 	pt_entry_t pte;
 	vm_paddr_t pa;
 	int val;
 
 	PMAP_LOCK(pmap);
 retry:
 	pdep = pmap_pde(pmap, addr);
 	if (pdep != NULL && (*pdep & PG_V)) {
 		if (*pdep & PG_PS) {
 			pte = *pdep;
 			/* Compute the physical address of the 4KB page. */
 			pa = ((*pdep & PG_PS_FRAME) | (addr & PDRMASK)) &
 			    PG_FRAME;
 			val = MINCORE_SUPER;
 		} else {
 			pte = *pmap_pde_to_pte(pdep, addr);
 			pa = pte & PG_FRAME;
 			val = 0;
 		}
 	} else {
 		pte = 0;
 		pa = 0;
 		val = 0;
 	}
 	if ((pte & PG_V) != 0) {
 		val |= MINCORE_INCORE;
 		if ((pte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			val |= MINCORE_MODIFIED | MINCORE_MODIFIED_OTHER;
 		if ((pte & PG_A) != 0)
 			val |= MINCORE_REFERENCED | MINCORE_REFERENCED_OTHER;
 	}
 	if ((val & (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER)) !=
 	    (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER) &&
 	    (pte & (PG_MANAGED | PG_V)) == (PG_MANAGED | PG_V)) {
 		/* Ensure that "PHYS_TO_VM_PAGE(pa)->object" doesn't change. */
 		if (vm_page_pa_tryrelock(pmap, pa, locked_pa))
 			goto retry;
 	} else
 		PA_UNLOCK_COND(*locked_pa);
 	PMAP_UNLOCK(pmap);
 	return (val);
 }
 
 void
 pmap_activate(struct thread *td)
 {
 	pmap_t	pmap, oldpmap;
 	u_int64_t  cr3;
 
 	critical_enter();
 	pmap = vmspace_pmap(td->td_proc->p_vmspace);
 	oldpmap = PCPU_GET(curpmap);
 #ifdef SMP
-	atomic_clear_int(&oldpmap->pm_active, PCPU_GET(cpumask));
-	atomic_set_int(&pmap->pm_active, PCPU_GET(cpumask));
+	CPU_NAND_ATOMIC(&oldpmap->pm_active, PCPU_PTR(cpumask));
+	CPU_OR_ATOMIC(&pmap->pm_active, PCPU_PTR(cpumask));
 #else
-	oldpmap->pm_active &= ~PCPU_GET(cpumask);
-	pmap->pm_active |= PCPU_GET(cpumask);
+	CPU_NAND(&oldpmap->pm_active, PCPU_PTR(cpumask));
+	CPU_OR(&pmap->pm_active, PCPU_PTR(cpumask));
 #endif
 	cr3 = DMAP_TO_PHYS((vm_offset_t)pmap->pm_pml4);
 	td->td_pcb->pcb_cr3 = cr3;
 	load_cr3(cr3);
 	PCPU_SET(curpmap, pmap);
 	critical_exit();
 }
 
 void
 pmap_sync_icache(pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 }
 
 /*
  *	Increase the starting virtual address of the given mapping if a
  *	different alignment might result in more superpage mappings.
  */
 void
 pmap_align_superpage(vm_object_t object, vm_ooffset_t offset,
     vm_offset_t *addr, vm_size_t size)
 {
 	vm_offset_t superpage_offset;
 
 	if (size < NBPDR)
 		return;
 	if (object != NULL && (object->flags & OBJ_COLORED) != 0)
 		offset += ptoa(object->pg_color);
 	superpage_offset = offset & PDRMASK;
 	if (size - ((NBPDR - superpage_offset) & PDRMASK) < NBPDR ||
 	    (*addr & PDRMASK) == superpage_offset)
 		return;
 	if ((*addr & PDRMASK) < superpage_offset)
 		*addr = (*addr & ~PDRMASK) + superpage_offset;
 	else
 		*addr = ((*addr + PDRMASK) & ~PDRMASK) + superpage_offset;
 }
Index: head/sys/amd64/amd64/vm_machdep.c
===================================================================
--- head/sys/amd64/amd64/vm_machdep.c	(revision 222812)
+++ head/sys/amd64/amd64/vm_machdep.c	(revision 222813)
@@ -1,682 +1,691 @@
 /*-
  * Copyright (c) 1982, 1986 The Regents of the University of California.
  * Copyright (c) 1989, 1990 William Jolitz
  * Copyright (c) 1994 John Dyson
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department, and William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)vm_machdep.c	7.3 (Berkeley) 5/13/91
  *	Utah $Hdr: vm_machdep.c 1.16.1.1 89/06/23$
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_isa.h"
 #include "opt_cpu.h"
 #include "opt_compat.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bio.h>
 #include <sys/buf.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mbuf.h>
 #include <sys/mutex.h>
 #include <sys/pioctl.h>
 #include <sys/proc.h>
+#include <sys/sched.h>
 #include <sys/sf_buf.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 #include <sys/sysent.h>
 #include <sys/unistd.h>
 #include <sys/vnode.h>
 #include <sys/vmmeter.h>
 
 #include <machine/cpu.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
+#include <machine/smp.h>
 #include <machine/specialreg.h>
 #include <machine/tss.h>
 
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_param.h>
 
 #include <x86/isa/isa.h>
 
 static void	cpu_reset_real(void);
 #ifdef SMP
 static void	cpu_reset_proxy(void);
 static u_int	cpu_reset_proxyid;
 static volatile u_int	cpu_reset_proxy_active;
 #endif
 
 /*
  * Finish a fork operation, with process p2 nearly set up.
  * Copy and update the pcb, set up the stack so that the child
  * ready to run and return to user mode.
  */
 void
 cpu_fork(td1, p2, td2, flags)
 	register struct thread *td1;
 	register struct proc *p2;
 	struct thread *td2;
 	int flags;
 {
 	register struct proc *p1;
 	struct pcb *pcb2;
 	struct mdproc *mdp1, *mdp2;
 	struct proc_ldt *pldt;
 	pmap_t pmap2;
 
 	p1 = td1->td_proc;
 	if ((flags & RFPROC) == 0) {
 		if ((flags & RFMEM) == 0) {
 			/* unshare user LDT */
 			mdp1 = &p1->p_md;
 			mtx_lock(&dt_lock);
 			if ((pldt = mdp1->md_ldt) != NULL &&
 			    pldt->ldt_refcnt > 1 &&
 			    user_ldt_alloc(p1, 1) == NULL)
 				panic("could not copy LDT");
 			mtx_unlock(&dt_lock);
 		}
 		return;
 	}
 
 	/* Ensure that td1's pcb is up to date. */
 	fpuexit(td1);
 
 	/* Point the pcb to the top of the stack */
 	pcb2 = (struct pcb *)(td2->td_kstack +
 	    td2->td_kstack_pages * PAGE_SIZE) - 1;
 	td2->td_pcb = pcb2;
 
 	/* Copy td1's pcb */
 	bcopy(td1->td_pcb, pcb2, sizeof(*pcb2));
 
 	/* Properly initialize pcb_save */
 	pcb2->pcb_save = &pcb2->pcb_user_save;
 
 	/* Point mdproc and then copy over td1's contents */
 	mdp2 = &p2->p_md;
 	bcopy(&p1->p_md, mdp2, sizeof(*mdp2));
 
 	/*
 	 * Create a new fresh stack for the new process.
 	 * Copy the trap frame for the return to user mode as if from a
 	 * syscall.  This copies most of the user mode register values.
 	 */
 	td2->td_frame = (struct trapframe *)td2->td_pcb - 1;
 	bcopy(td1->td_frame, td2->td_frame, sizeof(struct trapframe));
 
 	td2->td_frame->tf_rax = 0;		/* Child returns zero */
 	td2->td_frame->tf_rflags &= ~PSL_C;	/* success */
 	td2->td_frame->tf_rdx = 1;
 
 	/*
 	 * If the parent process has the trap bit set (i.e. a debugger had
 	 * single stepped the process to the system call), we need to clear
 	 * the trap flag from the new frame unless the debugger had set PF_FORK
 	 * on the parent.  Otherwise, the child will receive a (likely
 	 * unexpected) SIGTRAP when it executes the first instruction after
 	 * returning  to userland.
 	 */
 	if ((p1->p_pfsflags & PF_FORK) == 0)
 		td2->td_frame->tf_rflags &= ~PSL_T;
 
 	/*
 	 * Set registers for trampoline to user mode.  Leave space for the
 	 * return address on stack.  These are the kernel mode register values.
 	 */
 	pmap2 = vmspace_pmap(p2->p_vmspace);
 	pcb2->pcb_cr3 = DMAP_TO_PHYS((vm_offset_t)pmap2->pm_pml4);
 	pcb2->pcb_r12 = (register_t)fork_return;	/* fork_trampoline argument */
 	pcb2->pcb_rbp = 0;
 	pcb2->pcb_rsp = (register_t)td2->td_frame - sizeof(void *);
 	pcb2->pcb_rbx = (register_t)td2;		/* fork_trampoline argument */
 	pcb2->pcb_rip = (register_t)fork_trampoline;
 	/*-
 	 * pcb2->pcb_dr*:	cloned above.
 	 * pcb2->pcb_savefpu:	cloned above.
 	 * pcb2->pcb_flags:	cloned above.
 	 * pcb2->pcb_onfault:	cloned above (always NULL here?).
 	 * pcb2->pcb_[fg]sbase:	cloned above
 	 */
 
 	/* Setup to release spin count in fork_exit(). */
 	td2->td_md.md_spinlock_count = 1;
 	td2->td_md.md_saved_flags = PSL_KERNEL | PSL_I;
 
 	/* As an i386, do not copy io permission bitmap. */
 	pcb2->pcb_tssp = NULL;
 
 	/* New segment registers. */
 	set_pcb_flags(pcb2, PCB_FULL_IRET);
 
 	/* Copy the LDT, if necessary. */
 	mdp1 = &td1->td_proc->p_md;
 	mdp2 = &p2->p_md;
 	mtx_lock(&dt_lock);
 	if (mdp1->md_ldt != NULL) {
 		if (flags & RFMEM) {
 			mdp1->md_ldt->ldt_refcnt++;
 			mdp2->md_ldt = mdp1->md_ldt;
 			bcopy(&mdp1->md_ldt_sd, &mdp2->md_ldt_sd, sizeof(struct
 			    system_segment_descriptor));
 		} else {
 			mdp2->md_ldt = NULL;
 			mdp2->md_ldt = user_ldt_alloc(p2, 0);
 			if (mdp2->md_ldt == NULL)
 				panic("could not copy LDT");
 			amd64_set_ldt_data(td2, 0, max_ldt_segment,
 			    (struct user_segment_descriptor *)
 			    mdp1->md_ldt->ldt_base);
 		}
 	} else
 		mdp2->md_ldt = NULL;
 	mtx_unlock(&dt_lock);
 
 	/*
 	 * Now, cpu_switch() can schedule the new process.
 	 * pcb_rsp is loaded pointing to the cpu_switch() stack frame
 	 * containing the return address when exiting cpu_switch.
 	 * This will normally be to fork_trampoline(), which will have
 	 * %ebx loaded with the new proc's pointer.  fork_trampoline()
 	 * will set up a stack to call fork_return(p, frame); to complete
 	 * the return to user-mode.
 	 */
 }
 
 /*
  * Intercept the return address from a freshly forked process that has NOT
  * been scheduled yet.
  *
  * This is needed to make kernel threads stay in kernel mode.
  */
 void
 cpu_set_fork_handler(td, func, arg)
 	struct thread *td;
 	void (*func)(void *);
 	void *arg;
 {
 	/*
 	 * Note that the trap frame follows the args, so the function
 	 * is really called like this:  func(arg, frame);
 	 */
 	td->td_pcb->pcb_r12 = (long) func;	/* function */
 	td->td_pcb->pcb_rbx = (long) arg;	/* first arg */
 }
 
 void
 cpu_exit(struct thread *td)
 {
 
 	/*
 	 * If this process has a custom LDT, release it.
 	 */
 	mtx_lock(&dt_lock);
 	if (td->td_proc->p_md.md_ldt != 0)
 		user_ldt_free(td);
 	else
 		mtx_unlock(&dt_lock);
 }
 
 void
 cpu_thread_exit(struct thread *td)
 {
 	struct pcb *pcb;
 
 	critical_enter();
 	if (td == PCPU_GET(fpcurthread))
 		fpudrop();
 	critical_exit();
 
 	pcb = td->td_pcb;
 
 	/* Disable any hardware breakpoints. */
 	if (pcb->pcb_flags & PCB_DBREGS) {
 		reset_dbregs();
 		clear_pcb_flags(pcb, PCB_DBREGS);
 	}
 }
 
 void
 cpu_thread_clean(struct thread *td)
 {
 	struct pcb *pcb;
 
 	pcb = td->td_pcb;
 
 	/*
 	 * Clean TSS/iomap
 	 */
 	if (pcb->pcb_tssp != NULL) {
 		kmem_free(kernel_map, (vm_offset_t)pcb->pcb_tssp,
 		    ctob(IOPAGES + 1));
 		pcb->pcb_tssp = NULL;
 	}
 }
 
 void
 cpu_thread_swapin(struct thread *td)
 {
 }
 
 void
 cpu_thread_swapout(struct thread *td)
 {
 }
 
 void
 cpu_thread_alloc(struct thread *td)
 {
 
 	td->td_pcb = (struct pcb *)(td->td_kstack +
 	    td->td_kstack_pages * PAGE_SIZE) - 1;
 	td->td_frame = (struct trapframe *)td->td_pcb - 1;
 	td->td_pcb->pcb_save = &td->td_pcb->pcb_user_save;
 }
 
 void
 cpu_thread_free(struct thread *td)
 {
 
 	cpu_thread_clean(td);
 }
 
 void
 cpu_set_syscall_retval(struct thread *td, int error)
 {
 
 	switch (error) {
 	case 0:
 		td->td_frame->tf_rax = td->td_retval[0];
 		td->td_frame->tf_rdx = td->td_retval[1];
 		td->td_frame->tf_rflags &= ~PSL_C;
 		break;
 
 	case ERESTART:
 		/*
 		 * Reconstruct pc, we know that 'syscall' is 2 bytes,
 		 * lcall $X,y is 7 bytes, int 0x80 is 2 bytes.
 		 * We saved this in tf_err.
 		 * %r10 (which was holding the value of %rcx) is restored
 		 * for the next iteration.
 		 * %r10 restore is only required for freebsd/amd64 processes,
 		 * but shall be innocent for any ia32 ABI.
 		 */
 		td->td_frame->tf_rip -= td->td_frame->tf_err;
 		td->td_frame->tf_r10 = td->td_frame->tf_rcx;
 		break;
 
 	case EJUSTRETURN:
 		break;
 
 	default:
 		if (td->td_proc->p_sysent->sv_errsize) {
 			if (error >= td->td_proc->p_sysent->sv_errsize)
 				error = -1;	/* XXX */
 			else
 				error = td->td_proc->p_sysent->sv_errtbl[error];
 		}
 		td->td_frame->tf_rax = error;
 		td->td_frame->tf_rflags |= PSL_C;
 		break;
 	}
 }
 
 /*
  * Initialize machine state (pcb and trap frame) for a new thread about to
  * upcall. Put enough state in the new thread's PCB to get it to go back 
  * userret(), where we can intercept it again to set the return (upcall)
  * Address and stack, along with those from upcals that are from other sources
  * such as those generated in thread_userret() itself.
  */
 void
 cpu_set_upcall(struct thread *td, struct thread *td0)
 {
 	struct pcb *pcb2;
 
 	/* Point the pcb to the top of the stack. */
 	pcb2 = td->td_pcb;
 
 	/*
 	 * Copy the upcall pcb.  This loads kernel regs.
 	 * Those not loaded individually below get their default
 	 * values here.
 	 */
 	bcopy(td0->td_pcb, pcb2, sizeof(*pcb2));
 	clear_pcb_flags(pcb2, PCB_FPUINITDONE | PCB_USERFPUINITDONE);
 	pcb2->pcb_save = &pcb2->pcb_user_save;
 	set_pcb_flags(pcb2, PCB_FULL_IRET);
 
 	/*
 	 * Create a new fresh stack for the new thread.
 	 */
 	bcopy(td0->td_frame, td->td_frame, sizeof(struct trapframe));
 
 	/* If the current thread has the trap bit set (i.e. a debugger had
 	 * single stepped the process to the system call), we need to clear
 	 * the trap flag from the new frame. Otherwise, the new thread will
 	 * receive a (likely unexpected) SIGTRAP when it executes the first
 	 * instruction after returning to userland.
 	 */
 	td->td_frame->tf_rflags &= ~PSL_T;
 
 	/*
 	 * Set registers for trampoline to user mode.  Leave space for the
 	 * return address on stack.  These are the kernel mode register values.
 	 */
 	pcb2->pcb_r12 = (register_t)fork_return;	    /* trampoline arg */
 	pcb2->pcb_rbp = 0;
 	pcb2->pcb_rsp = (register_t)td->td_frame - sizeof(void *);	/* trampoline arg */
 	pcb2->pcb_rbx = (register_t)td;			    /* trampoline arg */
 	pcb2->pcb_rip = (register_t)fork_trampoline;
 	/*
 	 * If we didn't copy the pcb, we'd need to do the following registers:
 	 * pcb2->pcb_cr3:	cloned above.
 	 * pcb2->pcb_dr*:	cloned above.
 	 * pcb2->pcb_savefpu:	cloned above.
 	 * pcb2->pcb_onfault:	cloned above (always NULL here?).
 	 * pcb2->pcb_[fg]sbase: cloned above
 	 */
 
 	/* Setup to release spin count in fork_exit(). */
 	td->td_md.md_spinlock_count = 1;
 	td->td_md.md_saved_flags = PSL_KERNEL | PSL_I;
 }
 
 /*
  * Set that machine state for performing an upcall that has to
  * be done in thread_userret() so that those upcalls generated
  * in thread_userret() itself can be done as well.
  */
 void
 cpu_set_upcall_kse(struct thread *td, void (*entry)(void *), void *arg,
 	stack_t *stack)
 {
 
 	/* 
 	 * Do any extra cleaning that needs to be done.
 	 * The thread may have optional components
 	 * that are not present in a fresh thread.
 	 * This may be a recycled thread so make it look
 	 * as though it's newly allocated.
 	 */
 	cpu_thread_clean(td);
 
 #ifdef COMPAT_FREEBSD32
 	if (SV_PROC_FLAG(td->td_proc, SV_ILP32)) {
 		/*
 	 	 * Set the trap frame to point at the beginning of the uts
 		 * function.
 		 */
 		td->td_frame->tf_rbp = 0;
 		td->td_frame->tf_rsp =
 		   (((uintptr_t)stack->ss_sp + stack->ss_size - 4) & ~0x0f) - 4;
 		td->td_frame->tf_rip = (uintptr_t)entry;
 
 		/*
 		 * Pass the address of the mailbox for this kse to the uts
 		 * function as a parameter on the stack.
 		 */
 		suword32((void *)(td->td_frame->tf_rsp + sizeof(int32_t)),
 		    (uint32_t)(uintptr_t)arg);
 
 		return;
 	}
 #endif
 
 	/*
 	 * Set the trap frame to point at the beginning of the uts
 	 * function.
 	 */
 	td->td_frame->tf_rbp = 0;
 	td->td_frame->tf_rsp =
 	    ((register_t)stack->ss_sp + stack->ss_size) & ~0x0f;
 	td->td_frame->tf_rsp -= 8;
 	td->td_frame->tf_rip = (register_t)entry;
 	td->td_frame->tf_ds = _udatasel;
 	td->td_frame->tf_es = _udatasel;
 	td->td_frame->tf_fs = _ufssel;
 	td->td_frame->tf_gs = _ugssel;
 	td->td_frame->tf_flags = TF_HASSEGS;
 
 	/*
 	 * Pass the address of the mailbox for this kse to the uts
 	 * function as a parameter on the stack.
 	 */
 	td->td_frame->tf_rdi = (register_t)arg;
 }
 
 int
 cpu_set_user_tls(struct thread *td, void *tls_base)
 {
 	struct pcb *pcb;
 
 	if ((u_int64_t)tls_base >= VM_MAXUSER_ADDRESS)
 		return (EINVAL);
 
 	pcb = td->td_pcb;
 #ifdef COMPAT_FREEBSD32
 	if (SV_PROC_FLAG(td->td_proc, SV_ILP32)) {
 		pcb->pcb_gsbase = (register_t)tls_base;
 		return (0);
 	}
 #endif
 	pcb->pcb_fsbase = (register_t)tls_base;
 	set_pcb_flags(pcb, PCB_FULL_IRET);
 	return (0);
 }
 
 #ifdef SMP
 static void
 cpu_reset_proxy()
 {
+	cpuset_t tcrp;
 
 	cpu_reset_proxy_active = 1;
 	while (cpu_reset_proxy_active == 1)
 		;	/* Wait for other cpu to see that we've started */
-	stop_cpus((1<<cpu_reset_proxyid));
+	CPU_SETOF(cpu_reset_proxyid, &tcrp);
+	stop_cpus(tcrp);
 	printf("cpu_reset_proxy: Stopped CPU %d\n", cpu_reset_proxyid);
 	DELAY(1000000);
 	cpu_reset_real();
 }
 #endif
 
 void
 cpu_reset()
 {
 #ifdef SMP
-	cpumask_t map;
+	cpuset_t map;
 	u_int cnt;
 
 	if (smp_active) {
-		map = PCPU_GET(other_cpus) & ~stopped_cpus;
-		if (map != 0) {
+		sched_pin();
+		map = PCPU_GET(other_cpus);
+		CPU_NAND(&map, &stopped_cpus);
+		if (!CPU_EMPTY(&map)) {
 			printf("cpu_reset: Stopping other CPUs\n");
 			stop_cpus(map);
 		}
 
 		if (PCPU_GET(cpuid) != 0) {
 			cpu_reset_proxyid = PCPU_GET(cpuid);
+			sched_unpin();
 			cpustop_restartfunc = cpu_reset_proxy;
 			cpu_reset_proxy_active = 0;
 			printf("cpu_reset: Restarting BSP\n");
 
 			/* Restart CPU #0. */
-			atomic_store_rel_int(&started_cpus, 1 << 0);
+			CPU_SETOF(0, &started_cpus);
+			wmb();
 
 			cnt = 0;
 			while (cpu_reset_proxy_active == 0 && cnt < 10000000)
 				cnt++;	/* Wait for BSP to announce restart */
 			if (cpu_reset_proxy_active == 0)
 				printf("cpu_reset: Failed to restart BSP\n");
 			enable_intr();
 			cpu_reset_proxy_active = 2;
 
 			while (1);
 			/* NOTREACHED */
-		}
+		} else
+			sched_unpin();
 
 		DELAY(1000000);
 	}
 #endif
 	cpu_reset_real();
 	/* NOTREACHED */
 }
 
 static void
 cpu_reset_real()
 {
 	struct region_descriptor null_idt;
 	int b;
 
 	disable_intr();
 
 	/*
 	 * Attempt to do a CPU reset via the keyboard controller,
 	 * do not turn off GateA20, as any machine that fails
 	 * to do the reset here would then end up in no man's land.
 	 */
 	outb(IO_KBD + 4, 0xFE);
 	DELAY(500000);	/* wait 0.5 sec to see if that did it */
 
 	/*
 	 * Attempt to force a reset via the Reset Control register at
 	 * I/O port 0xcf9.  Bit 2 forces a system reset when it
 	 * transitions from 0 to 1.  Bit 1 selects the type of reset
 	 * to attempt: 0 selects a "soft" reset, and 1 selects a
 	 * "hard" reset.  We try a "hard" reset.  The first write sets
 	 * bit 1 to select a "hard" reset and clears bit 2.  The
 	 * second write forces a 0 -> 1 transition in bit 2 to trigger
 	 * a reset.
 	 */
 	outb(0xcf9, 0x2);
 	outb(0xcf9, 0x6);
 	DELAY(500000);  /* wait 0.5 sec to see if that did it */
 
 	/*
 	 * Attempt to force a reset via the Fast A20 and Init register
 	 * at I/O port 0x92.  Bit 1 serves as an alternate A20 gate.
 	 * Bit 0 asserts INIT# when set to 1.  We are careful to only
 	 * preserve bit 1 while setting bit 0.  We also must clear bit
 	 * 0 before setting it if it isn't already clear.
 	 */
 	b = inb(0x92);
 	if (b != 0xff) {
 		if ((b & 0x1) != 0)
 			outb(0x92, b & 0xfe);
 		outb(0x92, b | 0x1);
 		DELAY(500000);  /* wait 0.5 sec to see if that did it */
 	}
 
 	printf("No known reset method worked, attempting CPU shutdown\n");
 	DELAY(1000000);	/* wait 1 sec for printf to complete */
 
 	/* Wipe the IDT. */
 	null_idt.rd_limit = 0;
 	null_idt.rd_base = 0;
 	lidt(&null_idt);
 
 	/* "good night, sweet prince .... <THUNK!>" */
 	breakpoint();
 
 	/* NOTREACHED */
 	while(1);
 }
 
 /*
  * Allocate an sf_buf for the given vm_page.  On this machine, however, there
  * is no sf_buf object.  Instead, an opaque pointer to the given vm_page is
  * returned.
  */
 struct sf_buf *
 sf_buf_alloc(struct vm_page *m, int pri)
 {
 
 	return ((struct sf_buf *)m);
 }
 
 /*
  * Free the sf_buf.  In fact, do nothing because there are no resources
  * associated with the sf_buf.
  */
 void
 sf_buf_free(struct sf_buf *sf)
 {
 }
 
 /*
  * Software interrupt handler for queued VM system processing.
  */   
 void  
 swi_vm(void *dummy) 
 {     
 	if (busdma_swi_pending != 0)
 		busdma_swi();
 }
 
 /*
  * Tell whether this address is in some physical memory region.
  * Currently used by the kernel coredump code in order to avoid
  * dumping the ``ISA memory hole'' which could cause indefinite hangs,
  * or other unpredictable behaviour.
  */
 
 int
 is_physical_memory(vm_paddr_t addr)
 {
 
 #ifdef DEV_ISA
 	/* The ISA ``memory hole''. */
 	if (addr >= 0xa0000 && addr < 0x100000)
 		return 0;
 #endif
 
 	/*
 	 * stuff other tests for known memory-mapped devices (PCI?)
 	 * here
 	 */
 
 	return 1;
 }
Index: head/sys/amd64/include/_types.h
===================================================================
--- head/sys/amd64/include/_types.h	(revision 222812)
+++ head/sys/amd64/include/_types.h	(revision 222813)
@@ -1,117 +1,116 @@
 /*-
  * Copyright (c) 2002 Mike Barcroft <mike@FreeBSD.org>
  * Copyright (c) 1990, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	From: @(#)ansi.h	8.2 (Berkeley) 1/4/94
  *	From: @(#)types.h	8.3 (Berkeley) 1/5/94
  * $FreeBSD$
  */
 
 #ifndef _MACHINE__TYPES_H_
 #define	_MACHINE__TYPES_H_
 
 #ifndef _SYS_CDEFS_H_
 #error this file needs sys/cdefs.h as a prerequisite
 #endif
 
 #define __NO_STRICT_ALIGNMENT
 
 /*
  * Basic types upon which most other types are built.
  */
 typedef	__signed char		__int8_t;
 typedef	unsigned char		__uint8_t;
 typedef	short			__int16_t;
 typedef	unsigned short		__uint16_t;
 typedef	int			__int32_t;
 typedef	unsigned int		__uint32_t;
 typedef	long			__int64_t;
 typedef	unsigned long		__uint64_t;
 
 /*
  * Standard type definitions.
  */
 typedef	__int32_t	__clock_t;		/* clock()... */
-typedef	unsigned int	__cpumask_t;
 typedef	__int64_t	__critical_t;
 typedef	double		__double_t;
 typedef	float		__float_t;
 typedef	__int64_t	__intfptr_t;
 typedef	__int64_t	__intmax_t;
 typedef	__int64_t	__intptr_t;
 typedef	__int32_t	__int_fast8_t;
 typedef	__int32_t	__int_fast16_t;
 typedef	__int32_t	__int_fast32_t;
 typedef	__int64_t	__int_fast64_t;
 typedef	__int8_t	__int_least8_t;
 typedef	__int16_t	__int_least16_t;
 typedef	__int32_t	__int_least32_t;
 typedef	__int64_t	__int_least64_t;
 typedef	__int64_t	__ptrdiff_t;		/* ptr1 - ptr2 */
 typedef	__int64_t	__register_t;
 typedef	__int64_t	__segsz_t;		/* segment size (in pages) */
 typedef	__uint64_t	__size_t;		/* sizeof() */
 typedef	__int64_t	__ssize_t;		/* byte count or error */
 typedef	__int64_t	__time_t;		/* time()... */
 typedef	__uint64_t	__uintfptr_t;
 typedef	__uint64_t	__uintmax_t;
 typedef	__uint64_t	__uintptr_t;
 typedef	__uint32_t	__uint_fast8_t;
 typedef	__uint32_t	__uint_fast16_t;
 typedef	__uint32_t	__uint_fast32_t;
 typedef	__uint64_t	__uint_fast64_t;
 typedef	__uint8_t	__uint_least8_t;
 typedef	__uint16_t	__uint_least16_t;
 typedef	__uint32_t	__uint_least32_t;
 typedef	__uint64_t	__uint_least64_t;
 typedef	__uint64_t	__u_register_t;
 typedef	__uint64_t	__vm_offset_t;
 typedef	__int64_t	__vm_ooffset_t;
 typedef	__uint64_t	__vm_paddr_t;
 typedef	__uint64_t	__vm_pindex_t;
 typedef	__uint64_t	__vm_size_t;
 
 /*
  * Unusual type definitions.
  */
 #ifdef __GNUCLIKE_BUILTIN_VARARGS
 typedef	__builtin_va_list	__va_list;	/* internally known to gcc */
 #elif defined(lint)
 typedef	char *			__va_list;	/* pretend */
 #endif
 #if defined(__GNUC_VA_LIST_COMPATIBILITY) && !defined(__GNUC_VA_LIST) \
     && !defined(__NO_GNUC_VA_LIST)
 #define __GNUC_VA_LIST
 typedef __va_list		__gnuc_va_list;	/* compatibility w/GNU headers*/
 #endif
 
 #endif /* !_MACHINE__TYPES_H_ */
Index: head/sys/amd64/include/pmap.h
===================================================================
--- head/sys/amd64/include/pmap.h	(revision 222812)
+++ head/sys/amd64/include/pmap.h	(revision 222813)
@@ -1,338 +1,339 @@
 /*-
  * Copyright (c) 2003 Peter Wemm.
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * Derived from hp300 version by Mike Hibler, this version by William
  * Jolitz uses a recursive map [a pde points to the page directory] to
  * map the page tables using the pagetables themselves. This is done to
  * reduce the impact on kernel virtual memory for lots of sparse address
  * space, and to reduce the cost of memory to each process.
  *
  *	from: hp300: @(#)pmap.h	7.2 (Berkeley) 12/16/90
  *	from: @(#)pmap.h	7.4 (Berkeley) 5/12/91
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_PMAP_H_
 #define	_MACHINE_PMAP_H_
 
 /*
  * Page-directory and page-table entries follow this format, with a few
  * of the fields not present here and there, depending on a lot of things.
  */
 				/* ---- Intel Nomenclature ---- */
 #define	PG_V		0x001	/* P	Valid			*/
 #define PG_RW		0x002	/* R/W	Read/Write		*/
 #define PG_U		0x004	/* U/S  User/Supervisor		*/
 #define	PG_NC_PWT	0x008	/* PWT	Write through		*/
 #define	PG_NC_PCD	0x010	/* PCD	Cache disable		*/
 #define PG_A		0x020	/* A	Accessed		*/
 #define	PG_M		0x040	/* D	Dirty			*/
 #define	PG_PS		0x080	/* PS	Page size (0=4k,1=2M)	*/
 #define	PG_PTE_PAT	0x080	/* PAT	PAT index		*/
 #define	PG_G		0x100	/* G	Global			*/
 #define	PG_AVAIL1	0x200	/*    /	Available for system	*/
 #define	PG_AVAIL2	0x400	/*   <	programmers use		*/
 #define	PG_AVAIL3	0x800	/*    \				*/
 #define	PG_PDE_PAT	0x1000	/* PAT	PAT index		*/
 #define	PG_NX		(1ul<<63) /* No-execute */
 
 
 /* Our various interpretations of the above */
 #define PG_W		PG_AVAIL1	/* "Wired" pseudoflag */
 #define	PG_MANAGED	PG_AVAIL2
 #define	PG_FRAME	(0x000ffffffffff000ul)
 #define	PG_PS_FRAME	(0x000fffffffe00000ul)
 #define	PG_PROT		(PG_RW|PG_U)	/* all protection bits . */
 #define PG_N		(PG_NC_PWT|PG_NC_PCD)	/* Non-cacheable */
 
 /* Page level cache control fields used to determine the PAT type */
 #define PG_PDE_CACHE	(PG_PDE_PAT | PG_NC_PWT | PG_NC_PCD)
 #define PG_PTE_CACHE	(PG_PTE_PAT | PG_NC_PWT | PG_NC_PCD)
 
 /*
  * Promotion to a 2MB (PDE) page mapping requires that the corresponding 4KB
  * (PTE) page mappings have identical settings for the following fields:
  */
 #define	PG_PTE_PROMOTE	(PG_NX | PG_MANAGED | PG_W | PG_G | PG_PTE_PAT | \
 	    PG_M | PG_A | PG_NC_PCD | PG_NC_PWT | PG_U | PG_RW | PG_V)
 
 /*
  * Page Protection Exception bits
  */
 
 #define PGEX_P		0x01	/* Protection violation vs. not present */
 #define PGEX_W		0x02	/* during a Write cycle */
 #define PGEX_U		0x04	/* access from User mode (UPL) */
 #define PGEX_RSV	0x08	/* reserved PTE field is non-zero */
 #define PGEX_I		0x10	/* during an instruction fetch */
 
 /*
  * Pte related macros.  This is complicated by having to deal with
  * the sign extension of the 48th bit.
  */
 #define KVADDR(l4, l3, l2, l1) ( \
 	((unsigned long)-1 << 47) | \
 	((unsigned long)(l4) << PML4SHIFT) | \
 	((unsigned long)(l3) << PDPSHIFT) | \
 	((unsigned long)(l2) << PDRSHIFT) | \
 	((unsigned long)(l1) << PAGE_SHIFT))
 
 #define UVADDR(l4, l3, l2, l1) ( \
 	((unsigned long)(l4) << PML4SHIFT) | \
 	((unsigned long)(l3) << PDPSHIFT) | \
 	((unsigned long)(l2) << PDRSHIFT) | \
 	((unsigned long)(l1) << PAGE_SHIFT))
 
 /* Initial number of kernel page tables. */
 #ifndef NKPT
 #define	NKPT		32
 #endif
 
 #define NKPML4E		1		/* number of kernel PML4 slots */
 #define NKPDPE		howmany(NKPT, NPDEPG)/* number of kernel PDP slots */
 
 #define	NUPML4E		(NPML4EPG/2)	/* number of userland PML4 pages */
 #define	NUPDPE		(NUPML4E*NPDPEPG)/* number of userland PDP pages */
 #define	NUPDE		(NUPDPE*NPDEPG)	/* number of userland PD entries */
 
 /*
  * NDMPML4E is the number of PML4 entries that are used to implement the
  * direct map.  It must be a power of two.
  */
 #define	NDMPML4E	2
 
 /*
  * The *PDI values control the layout of virtual memory.  The starting address
  * of the direct map, which is controlled by DMPML4I, must be a multiple of
  * its size.  (See the PHYS_TO_DMAP() and DMAP_TO_PHYS() macros.)
  */
 #define	PML4PML4I	(NPML4EPG/2)	/* Index of recursive pml4 mapping */
 
 #define	KPML4I		(NPML4EPG-1)	/* Top 512GB for KVM */
 #define	DMPML4I		rounddown(KPML4I - NDMPML4E, NDMPML4E) /* Below KVM */
 
 #define	KPDPI		(NPDPEPG-2)	/* kernbase at -2GB */
 
 /*
  * XXX doesn't really belong here I guess...
  */
 #define ISA_HOLE_START    0xa0000
 #define ISA_HOLE_LENGTH (0x100000-ISA_HOLE_START)
 
 #ifndef LOCORE
 
 #include <sys/queue.h>
+#include <sys/_cpuset.h>
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
 
 typedef u_int64_t pd_entry_t;
 typedef u_int64_t pt_entry_t;
 typedef u_int64_t pdp_entry_t;
 typedef u_int64_t pml4_entry_t;
 
 #define	PML4ESHIFT	(3)
 #define	PDPESHIFT	(3)
 #define	PTESHIFT	(3)
 #define	PDESHIFT	(3)
 
 /*
  * Address of current address space page table maps and directories.
  */
 #ifdef _KERNEL
 #define	addr_PTmap	(KVADDR(PML4PML4I, 0, 0, 0))
 #define	addr_PDmap	(KVADDR(PML4PML4I, PML4PML4I, 0, 0))
 #define	addr_PDPmap	(KVADDR(PML4PML4I, PML4PML4I, PML4PML4I, 0))
 #define	addr_PML4map	(KVADDR(PML4PML4I, PML4PML4I, PML4PML4I, PML4PML4I))
 #define	addr_PML4pml4e	(addr_PML4map + (PML4PML4I * sizeof(pml4_entry_t)))
 #define	PTmap		((pt_entry_t *)(addr_PTmap))
 #define	PDmap		((pd_entry_t *)(addr_PDmap))
 #define	PDPmap		((pd_entry_t *)(addr_PDPmap))
 #define	PML4map		((pd_entry_t *)(addr_PML4map))
 #define	PML4pml4e	((pd_entry_t *)(addr_PML4pml4e))
 
 extern u_int64_t KPDPphys;	/* physical address of kernel level 3 */
 extern u_int64_t KPML4phys;	/* physical address of kernel level 4 */
 
 /*
  * virtual address to page table entry and
  * to physical address.
  * Note: these work recursively, thus vtopte of a pte will give
  * the corresponding pde that in turn maps it.
  */
 pt_entry_t *vtopte(vm_offset_t);
 #define	vtophys(va)	pmap_kextract(((vm_offset_t) (va)))
 
 static __inline pt_entry_t
 pte_load(pt_entry_t *ptep)
 {
 	pt_entry_t r;
 
 	r = *ptep;
 	return (r);
 }
 
 static __inline pt_entry_t
 pte_load_store(pt_entry_t *ptep, pt_entry_t pte)
 {
 	pt_entry_t r;
 
 	__asm __volatile(
 	    "xchgq %0,%1"
 	    : "=m" (*ptep),
 	      "=r" (r)
 	    : "1" (pte),
 	      "m" (*ptep));
 	return (r);
 }
 
 #define	pte_load_clear(pte)	atomic_readandclear_long(pte)
 
 static __inline void
 pte_store(pt_entry_t *ptep, pt_entry_t pte)
 {
 
 	*ptep = pte;
 }
 
 #define	pte_clear(ptep)		pte_store((ptep), (pt_entry_t)0ULL)
 
 #define	pde_store(pdep, pde)	pte_store((pdep), (pde))
 
 extern pt_entry_t pg_nx;
 
 #endif /* _KERNEL */
 
 /*
  * Pmap stuff
  */
 struct	pv_entry;
 struct	pv_chunk;
 
 struct md_page {
 	TAILQ_HEAD(,pv_entry)	pv_list;
 	int			pat_mode;
 };
 
 /*
  * The kernel virtual address (KVA) of the level 4 page table page is always
  * within the direct map (DMAP) region.
  */
 struct pmap {
 	struct mtx		pm_mtx;
 	pml4_entry_t		*pm_pml4;	/* KVA of level 4 page table */
 	TAILQ_HEAD(,pv_chunk)	pm_pvchunk;	/* list of mappings in pmap */
-	cpumask_t		pm_active;	/* active on cpus */
+	cpuset_t		pm_active;	/* active on cpus */
 	/* spare u_int here due to padding */
 	struct pmap_statistics	pm_stats;	/* pmap statistics */
 	vm_page_t		pm_root;	/* spare page table pages */
 };
 
 typedef struct pmap	*pmap_t;
 
 #ifdef _KERNEL
 extern struct pmap	kernel_pmap_store;
 #define kernel_pmap	(&kernel_pmap_store)
 
 #define	PMAP_LOCK(pmap)		mtx_lock(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_ASSERT(pmap, type) \
 				mtx_assert(&(pmap)->pm_mtx, (type))
 #define	PMAP_LOCK_DESTROY(pmap)	mtx_destroy(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_INIT(pmap)	mtx_init(&(pmap)->pm_mtx, "pmap", \
 				    NULL, MTX_DEF | MTX_DUPOK)
 #define	PMAP_LOCKED(pmap)	mtx_owned(&(pmap)->pm_mtx)
 #define	PMAP_MTX(pmap)		(&(pmap)->pm_mtx)
 #define	PMAP_TRYLOCK(pmap)	mtx_trylock(&(pmap)->pm_mtx)
 #define	PMAP_UNLOCK(pmap)	mtx_unlock(&(pmap)->pm_mtx)
 #endif
 
 /*
  * For each vm_page_t, there is a list of all currently valid virtual
  * mappings of that page.  An entry is a pv_entry_t, the list is pv_list.
  */
 typedef struct pv_entry {
 	vm_offset_t	pv_va;		/* virtual address for mapping */
 	TAILQ_ENTRY(pv_entry)	pv_list;
 } *pv_entry_t;
 
 /*
  * pv_entries are allocated in chunks per-process.  This avoids the
  * need to track per-pmap assignments.
  */
 #define	_NPCM	3
 #define	_NPCPV	168
 struct pv_chunk {
 	pmap_t			pc_pmap;
 	TAILQ_ENTRY(pv_chunk)	pc_list;
 	uint64_t		pc_map[_NPCM];	/* bitmap; 1 = free */
 	uint64_t		pc_spare[2];
 	struct pv_entry		pc_pventry[_NPCPV];
 };
 
 #ifdef	_KERNEL
 
 extern caddr_t	CADDR1;
 extern pt_entry_t *CMAP1;
 extern vm_paddr_t phys_avail[];
 extern vm_paddr_t dump_avail[];
 extern vm_offset_t virtual_avail;
 extern vm_offset_t virtual_end;
 
 #define	pmap_page_get_memattr(m)	((vm_memattr_t)(m)->md.pat_mode)
 #define	pmap_unmapbios(va, sz)	pmap_unmapdev((va), (sz))
 
 void	pmap_bootstrap(vm_paddr_t *);
 int	pmap_change_attr(vm_offset_t, vm_size_t, int);
 void	pmap_demote_DMAP(vm_paddr_t base, vm_size_t len, boolean_t invalidate);
 void	pmap_init_pat(void);
 void	pmap_kenter(vm_offset_t va, vm_paddr_t pa);
 void	*pmap_kenter_temporary(vm_paddr_t pa, int i);
 vm_paddr_t pmap_kextract(vm_offset_t);
 void	pmap_kremove(vm_offset_t);
 void	*pmap_mapbios(vm_paddr_t, vm_size_t);
 void	*pmap_mapdev(vm_paddr_t, vm_size_t);
 void	*pmap_mapdev_attr(vm_paddr_t, vm_size_t, int);
 boolean_t pmap_page_is_mapped(vm_page_t m);
 void	pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma);
 void	pmap_unmapdev(vm_offset_t, vm_size_t);
 void	pmap_invalidate_page(pmap_t, vm_offset_t);
 void	pmap_invalidate_range(pmap_t, vm_offset_t, vm_offset_t);
 void	pmap_invalidate_all(pmap_t);
 void	pmap_invalidate_cache(void);
 void	pmap_invalidate_cache_pages(vm_page_t *pages, int count);
 void	pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva);
 
 #endif /* _KERNEL */
 
 #endif /* !LOCORE */
 
 #endif /* !_MACHINE_PMAP_H_ */
Index: head/sys/amd64/include/smp.h
===================================================================
--- head/sys/amd64/include/smp.h	(revision 222812)
+++ head/sys/amd64/include/smp.h	(revision 222813)
@@ -1,82 +1,82 @@
 /*-
  * ----------------------------------------------------------------------------
  * "THE BEER-WARE LICENSE" (Revision 42):
  * <phk@FreeBSD.org> wrote this file.  As long as you retain this notice you
  * can do whatever you want with this stuff. If we meet some day, and you think
  * this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
  * ----------------------------------------------------------------------------
  *
  * $FreeBSD$
  *
  */
 
 #ifndef _MACHINE_SMP_H_
 #define _MACHINE_SMP_H_
 
 #ifdef _KERNEL
 
 #ifdef SMP
 
 #ifndef LOCORE
 
 #include <sys/bus.h>
 #include <machine/frame.h>
 #include <machine/intr_machdep.h>
 #include <machine/apicvar.h>
 #include <machine/pcb.h>
 
 /* global symbols in mpboot.S */
 extern char			mptramp_start[];
 extern char			mptramp_end[];
 extern u_int32_t		mptramp_pagetables;
 
 /* global data in mp_machdep.c */
 extern int			mp_naps;
 extern int			boot_cpu_id;
 extern struct pcb		stoppcbs[];
 extern int			cpu_apic_ids[];
 #ifdef COUNT_IPIS
 extern u_long *ipi_invltlb_counts[MAXCPU];
 extern u_long *ipi_invlrng_counts[MAXCPU];
 extern u_long *ipi_invlpg_counts[MAXCPU];
 extern u_long *ipi_invlcache_counts[MAXCPU];
 extern u_long *ipi_rendezvous_counts[MAXCPU];
 #endif
 
 /* IPI handlers */
 inthand_t
 	IDTVEC(invltlb),	/* TLB shootdowns - global */
 	IDTVEC(invlpg),		/* TLB shootdowns - 1 page */
 	IDTVEC(invlrng),	/* TLB shootdowns - page range */
 	IDTVEC(invlcache),	/* Write back and invalidate cache */
 	IDTVEC(ipi_intr_bitmap_handler), /* Bitmap based IPIs */ 
 	IDTVEC(cpustop),	/* CPU stops & waits to be restarted */
 	IDTVEC(cpususpend),	/* CPU suspends & waits to be resumed */
 	IDTVEC(rendezvous);	/* handle CPU rendezvous */
 
 /* functions in mp_machdep.c */
 void	cpu_add(u_int apic_id, char boot_cpu);
 void	cpustop_handler(void);
 void	cpususpend_handler(void);
 void	init_secondary(void);
 void	ipi_all_but_self(u_int ipi);
 void 	ipi_bitmap_handler(struct trapframe frame);
 void	ipi_cpu(int cpu, u_int ipi);
 int	ipi_nmi_handler(void);
-void	ipi_selected(cpumask_t cpus, u_int ipi);
+void	ipi_selected(cpuset_t cpus, u_int ipi);
 u_int	mp_bootaddress(u_int);
 int	mp_grab_cpu_hlt(void);
 void	smp_cache_flush(void);
 void	smp_invlpg(vm_offset_t addr);
-void	smp_masked_invlpg(cpumask_t mask, vm_offset_t addr);
+void	smp_masked_invlpg(cpuset_t mask, vm_offset_t addr);
 void	smp_invlpg_range(vm_offset_t startva, vm_offset_t endva);
-void	smp_masked_invlpg_range(cpumask_t mask, vm_offset_t startva,
+void	smp_masked_invlpg_range(cpuset_t mask, vm_offset_t startva,
 	    vm_offset_t endva);
 void	smp_invltlb(void);
-void	smp_masked_invltlb(cpumask_t mask);
+void	smp_masked_invltlb(cpuset_t mask);
 
 #endif /* !LOCORE */
 #endif /* SMP */
 
 #endif /* _KERNEL */
 #endif /* _MACHINE_SMP_H_ */
Index: head/sys/amd64/include/xen
===================================================================
--- head/sys/amd64/include/xen	(revision 222812)
+++ head/sys/amd64/include/xen	(revision 222813)

Property changes on: head/sys/amd64/include/xen
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/amd64/include/xen:r221273-222812
Index: head/sys/arm/arm/pmap.c
===================================================================
--- head/sys/arm/arm/pmap.c	(revision 222812)
+++ head/sys/arm/arm/pmap.c	(revision 222813)
@@ -1,4926 +1,4926 @@
 /* From: $NetBSD: pmap.c,v 1.148 2004/04/03 04:35:48 bsh Exp $ */
 /*-
  * Copyright 2004 Olivier Houchard.
  * Copyright 2003 Wasabi Systems, Inc.
  * All rights reserved.
  *
  * Written by Steve C. Woodford for Wasabi Systems, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *      This product includes software developed for the NetBSD Project by
  *      Wasabi Systems, Inc.
  * 4. The name of Wasabi Systems, Inc. may not be used to endorse
  *    or promote products derived from this software without specific prior
  *    written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY WASABI SYSTEMS, INC. ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
  * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
  * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL WASABI SYSTEMS, INC
  * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
 /*-
  * Copyright (c) 2002-2003 Wasabi Systems, Inc.
  * Copyright (c) 2001 Richard Earnshaw
  * Copyright (c) 2001-2002 Christopher Gilbert
  * All rights reserved.
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. The name of the company nor the name of the author may be used to
  *    endorse or promote products derived from this software without specific
  *    prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
  * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
  * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
  * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
  * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 /*-
  * Copyright (c) 1999 The NetBSD Foundation, Inc.
  * All rights reserved.
  *
  * This code is derived from software contributed to The NetBSD Foundation
  * by Charles M. Hannum.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
  * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
  * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
  * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
  * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
 /*-
  * Copyright (c) 1994-1998 Mark Brinicombe.
  * Copyright (c) 1994 Brini.
  * All rights reserved.
  *
  * This code is derived from software written for Brini by Mark Brinicombe
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *      This product includes software developed by Mark Brinicombe.
  * 4. The name of the author may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  *
  * RiscBSD kernel project
  *
  * pmap.c
  *
  * Machine dependant vm stuff
  *
  * Created      : 20/09/94
  */
 
 /*
  * Special compilation symbols
  * PMAP_DEBUG           - Build in pmap_debug_level code
  */
 /* Include header files */
 
 #include "opt_vm.h"
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/proc.h>
 #include <sys/malloc.h>
 #include <sys/msgbuf.h>
 #include <sys/vmmeter.h>
 #include <sys/mman.h>
 #include <sys/smp.h>
 #include <sys/sched.h>
 
 #include <vm/vm.h>
 #include <vm/uma.h>
 #include <vm/pmap.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_object.h>
 #include <vm/vm_map.h>
 #include <vm/vm_page.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_extern.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <machine/md_var.h>
 #include <machine/vmparam.h>
 #include <machine/cpu.h>
 #include <machine/cpufunc.h>
 #include <machine/pcb.h>
 
 #ifdef PMAP_DEBUG
 #define PDEBUG(_lev_,_stat_) \
         if (pmap_debug_level >= (_lev_)) \
                 ((_stat_))
 #define dprintf printf
 
 int pmap_debug_level = 0;
 #define PMAP_INLINE 
 #else   /* PMAP_DEBUG */
 #define PDEBUG(_lev_,_stat_) /* Nothing */
 #define dprintf(x, arg...)
 #define PMAP_INLINE __inline
 #endif  /* PMAP_DEBUG */
 
 extern struct pv_addr systempage;
 /*
  * Internal function prototypes
  */
 static void pmap_free_pv_entry (pv_entry_t);
 static pv_entry_t pmap_get_pv_entry(void);
 
 static void		pmap_enter_locked(pmap_t, vm_offset_t, vm_page_t,
     vm_prot_t, boolean_t, int);
 static void		pmap_fix_cache(struct vm_page *, pmap_t, vm_offset_t);
 static void		pmap_alloc_l1(pmap_t);
 static void		pmap_free_l1(pmap_t);
 
 static int		pmap_clearbit(struct vm_page *, u_int);
 
 static struct l2_bucket *pmap_get_l2_bucket(pmap_t, vm_offset_t);
 static struct l2_bucket *pmap_alloc_l2_bucket(pmap_t, vm_offset_t);
 static void		pmap_free_l2_bucket(pmap_t, struct l2_bucket *, u_int);
 static vm_offset_t	kernel_pt_lookup(vm_paddr_t);
 
 static MALLOC_DEFINE(M_VMPMAP, "pmap", "PMAP L1");
 
 vm_offset_t virtual_avail;	/* VA of first avail page (after kernel bss) */
 vm_offset_t virtual_end;	/* VA of last avail page (end of kernel AS) */
 vm_offset_t pmap_curmaxkvaddr;
 vm_paddr_t kernel_l1pa;
 
 extern void *end;
 vm_offset_t kernel_vm_end = 0;
 
 struct pmap kernel_pmap_store;
 
 static pt_entry_t *csrc_pte, *cdst_pte;
 static vm_offset_t csrcp, cdstp;
 static struct mtx cmtx;
 
 static void		pmap_init_l1(struct l1_ttable *, pd_entry_t *);
 /*
  * These routines are called when the CPU type is identified to set up
  * the PTE prototypes, cache modes, etc.
  *
  * The variables are always here, just in case LKMs need to reference
  * them (though, they shouldn't).
  */
 
 pt_entry_t	pte_l1_s_cache_mode;
 pt_entry_t	pte_l1_s_cache_mode_pt;
 pt_entry_t	pte_l1_s_cache_mask;
 
 pt_entry_t	pte_l2_l_cache_mode;
 pt_entry_t	pte_l2_l_cache_mode_pt;
 pt_entry_t	pte_l2_l_cache_mask;
 
 pt_entry_t	pte_l2_s_cache_mode;
 pt_entry_t	pte_l2_s_cache_mode_pt;
 pt_entry_t	pte_l2_s_cache_mask;
 
 pt_entry_t	pte_l2_s_prot_u;
 pt_entry_t	pte_l2_s_prot_w;
 pt_entry_t	pte_l2_s_prot_mask;
 
 pt_entry_t	pte_l1_s_proto;
 pt_entry_t	pte_l1_c_proto;
 pt_entry_t	pte_l2_s_proto;
 
 void		(*pmap_copy_page_func)(vm_paddr_t, vm_paddr_t);
 void		(*pmap_zero_page_func)(vm_paddr_t, int, int);
 /*
  * Which pmap is currently 'live' in the cache
  *
  * XXXSCW: Fix for SMP ...
  */
 union pmap_cache_state *pmap_cache_state;
 
 struct msgbuf *msgbufp = 0;
 
 /*
  * Crashdump maps.
  */
 static caddr_t crashdumpmap;
 
 extern void bcopy_page(vm_offset_t, vm_offset_t);
 extern void bzero_page(vm_offset_t);
 
 extern vm_offset_t alloc_firstaddr;
 
 char *_tmppt;
 
 /*
  * Metadata for L1 translation tables.
  */
 struct l1_ttable {
 	/* Entry on the L1 Table list */
 	SLIST_ENTRY(l1_ttable) l1_link;
 
 	/* Entry on the L1 Least Recently Used list */
 	TAILQ_ENTRY(l1_ttable) l1_lru;
 
 	/* Track how many domains are allocated from this L1 */
 	volatile u_int l1_domain_use_count;
 
 	/*
 	 * A free-list of domain numbers for this L1.
 	 * We avoid using ffs() and a bitmap to track domains since ffs()
 	 * is slow on ARM.
 	 */
 	u_int8_t l1_domain_first;
 	u_int8_t l1_domain_free[PMAP_DOMAINS];
 
 	/* Physical address of this L1 page table */
 	vm_paddr_t l1_physaddr;
 
 	/* KVA of this L1 page table */
 	pd_entry_t *l1_kva;
 };
 
 /*
  * Convert a virtual address into its L1 table index. That is, the
  * index used to locate the L2 descriptor table pointer in an L1 table.
  * This is basically used to index l1->l1_kva[].
  *
  * Each L2 descriptor table represents 1MB of VA space.
  */
 #define	L1_IDX(va)		(((vm_offset_t)(va)) >> L1_S_SHIFT)
 
 /*
  * L1 Page Tables are tracked using a Least Recently Used list.
  *  - New L1s are allocated from the HEAD.
  *  - Freed L1s are added to the TAIl.
  *  - Recently accessed L1s (where an 'access' is some change to one of
  *    the userland pmaps which owns this L1) are moved to the TAIL.
  */
 static TAILQ_HEAD(, l1_ttable) l1_lru_list;
 /*
  * A list of all L1 tables
  */
 static SLIST_HEAD(, l1_ttable) l1_list;
 static struct mtx l1_lru_lock;
 
 /*
  * The l2_dtable tracks L2_BUCKET_SIZE worth of L1 slots.
  *
  * This is normally 16MB worth L2 page descriptors for any given pmap.
  * Reference counts are maintained for L2 descriptors so they can be
  * freed when empty.
  */
 struct l2_dtable {
 	/* The number of L2 page descriptors allocated to this l2_dtable */
 	u_int l2_occupancy;
 
 	/* List of L2 page descriptors */
 	struct l2_bucket {
 		pt_entry_t *l2b_kva;	/* KVA of L2 Descriptor Table */
 		vm_paddr_t l2b_phys;	/* Physical address of same */
 		u_short l2b_l1idx;	/* This L2 table's L1 index */
 		u_short l2b_occupancy;	/* How many active descriptors */
 	} l2_bucket[L2_BUCKET_SIZE];
 };
 
 /* pmap_kenter_internal flags */
 #define KENTER_CACHE	0x1
 #define KENTER_USER	0x2
 
 /*
  * Given an L1 table index, calculate the corresponding l2_dtable index
  * and bucket index within the l2_dtable.
  */
 #define	L2_IDX(l1idx)		(((l1idx) >> L2_BUCKET_LOG2) & \
 				 (L2_SIZE - 1))
 #define	L2_BUCKET(l1idx)	((l1idx) & (L2_BUCKET_SIZE - 1))
 
 /*
  * Given a virtual address, this macro returns the
  * virtual address required to drop into the next L2 bucket.
  */
 #define	L2_NEXT_BUCKET(va)	(((va) & L1_S_FRAME) + L1_S_SIZE)
 
 /*
  * L2 allocation.
  */
 #define	pmap_alloc_l2_dtable()		\
 		(void*)uma_zalloc(l2table_zone, M_NOWAIT|M_USE_RESERVE)
 #define	pmap_free_l2_dtable(l2)		\
 		uma_zfree(l2table_zone, l2)
 
 /*
  * We try to map the page tables write-through, if possible.  However, not
  * all CPUs have a write-through cache mode, so on those we have to sync
  * the cache when we frob page tables.
  *
  * We try to evaluate this at compile time, if possible.  However, it's
  * not always possible to do that, hence this run-time var.
  */
 int	pmap_needs_pte_sync;
 
 /*
  * Macro to determine if a mapping might be resident in the
  * instruction cache and/or TLB
  */
 #define	PV_BEEN_EXECD(f)  (((f) & (PVF_REF | PVF_EXEC)) == (PVF_REF | PVF_EXEC))
 
 /*
  * Macro to determine if a mapping might be resident in the
  * data cache and/or TLB
  */
 #define	PV_BEEN_REFD(f)   (((f) & PVF_REF) != 0)
 
 #ifndef PMAP_SHPGPERPROC
 #define PMAP_SHPGPERPROC 200
 #endif
 
 #define pmap_is_current(pm)	((pm) == pmap_kernel() || \
             curproc->p_vmspace->vm_map.pmap == (pm))
 static uma_zone_t pvzone = NULL;
 uma_zone_t l2zone;
 static uma_zone_t l2table_zone;
 static vm_offset_t pmap_kernel_l2dtable_kva;
 static vm_offset_t pmap_kernel_l2ptp_kva;
 static vm_paddr_t pmap_kernel_l2ptp_phys;
 static struct vm_object pvzone_obj;
 static int pv_entry_count=0, pv_entry_max=0, pv_entry_high_water=0;
 
 /*
  * This list exists for the benefit of pmap_map_chunk().  It keeps track
  * of the kernel L2 tables during bootstrap, so that pmap_map_chunk() can
  * find them as necessary.
  *
  * Note that the data on this list MUST remain valid after initarm() returns,
  * as pmap_bootstrap() uses it to contruct L2 table metadata.
  */
 SLIST_HEAD(, pv_addr) kernel_pt_list = SLIST_HEAD_INITIALIZER(kernel_pt_list);
 
 static void
 pmap_init_l1(struct l1_ttable *l1, pd_entry_t *l1pt)
 {
 	int i;
 
 	l1->l1_kva = l1pt;
 	l1->l1_domain_use_count = 0;
 	l1->l1_domain_first = 0;
 
 	for (i = 0; i < PMAP_DOMAINS; i++)
 		l1->l1_domain_free[i] = i + 1;
 
 	/*
 	 * Copy the kernel's L1 entries to each new L1.
 	 */
 	if (l1pt != pmap_kernel()->pm_l1->l1_kva)
 		memcpy(l1pt, pmap_kernel()->pm_l1->l1_kva, L1_TABLE_SIZE);
 
 	if ((l1->l1_physaddr = pmap_extract(pmap_kernel(), (vm_offset_t)l1pt)) == 0)
 		panic("pmap_init_l1: can't get PA of L1 at %p", l1pt);
 	SLIST_INSERT_HEAD(&l1_list, l1, l1_link);
 	TAILQ_INSERT_TAIL(&l1_lru_list, l1, l1_lru);
 }
 
 static vm_offset_t
 kernel_pt_lookup(vm_paddr_t pa)
 {
 	struct pv_addr *pv;
 
 	SLIST_FOREACH(pv, &kernel_pt_list, pv_list) {
 		if (pv->pv_pa == pa)
 			return (pv->pv_va);
 	}
 	return (0);
 }
 
 #if (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0
 void
 pmap_pte_init_generic(void)
 {
 
 	pte_l1_s_cache_mode = L1_S_B|L1_S_C;
 	pte_l1_s_cache_mask = L1_S_CACHE_MASK_generic;
 
 	pte_l2_l_cache_mode = L2_B|L2_C;
 	pte_l2_l_cache_mask = L2_L_CACHE_MASK_generic;
 
 	pte_l2_s_cache_mode = L2_B|L2_C;
 	pte_l2_s_cache_mask = L2_S_CACHE_MASK_generic;
 
 	/*
 	 * If we have a write-through cache, set B and C.  If
 	 * we have a write-back cache, then we assume setting
 	 * only C will make those pages write-through.
 	 */
 	if (cpufuncs.cf_dcache_wb_range == (void *) cpufunc_nullop) {
 		pte_l1_s_cache_mode_pt = L1_S_B|L1_S_C;
 		pte_l2_l_cache_mode_pt = L2_B|L2_C;
 		pte_l2_s_cache_mode_pt = L2_B|L2_C;
 	} else {
 		pte_l1_s_cache_mode_pt = L1_S_C;
 		pte_l2_l_cache_mode_pt = L2_C;
 		pte_l2_s_cache_mode_pt = L2_C;
 	}
 
 	pte_l2_s_prot_u = L2_S_PROT_U_generic;
 	pte_l2_s_prot_w = L2_S_PROT_W_generic;
 	pte_l2_s_prot_mask = L2_S_PROT_MASK_generic;
 
 	pte_l1_s_proto = L1_S_PROTO_generic;
 	pte_l1_c_proto = L1_C_PROTO_generic;
 	pte_l2_s_proto = L2_S_PROTO_generic;
 
 	pmap_copy_page_func = pmap_copy_page_generic;
 	pmap_zero_page_func = pmap_zero_page_generic;
 }
 
 #if defined(CPU_ARM8)
 void
 pmap_pte_init_arm8(void)
 {
 
 	/*
 	 * ARM8 is compatible with generic, but we need to use
 	 * the page tables uncached.
 	 */
 	pmap_pte_init_generic();
 
 	pte_l1_s_cache_mode_pt = 0;
 	pte_l2_l_cache_mode_pt = 0;
 	pte_l2_s_cache_mode_pt = 0;
 }
 #endif /* CPU_ARM8 */
 
 #if defined(CPU_ARM9) && defined(ARM9_CACHE_WRITE_THROUGH)
 void
 pmap_pte_init_arm9(void)
 {
 
 	/*
 	 * ARM9 is compatible with generic, but we want to use
 	 * write-through caching for now.
 	 */
 	pmap_pte_init_generic();
 
 	pte_l1_s_cache_mode = L1_S_C;
 	pte_l2_l_cache_mode = L2_C;
 	pte_l2_s_cache_mode = L2_C;
 
 	pte_l1_s_cache_mode_pt = L1_S_C;
 	pte_l2_l_cache_mode_pt = L2_C;
 	pte_l2_s_cache_mode_pt = L2_C;
 }
 #endif /* CPU_ARM9 */
 #endif /* (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0 */
 
 #if defined(CPU_ARM10)
 void
 pmap_pte_init_arm10(void)
 {
 
 	/*
 	 * ARM10 is compatible with generic, but we want to use
 	 * write-through caching for now.
 	 */
 	pmap_pte_init_generic();
 
 	pte_l1_s_cache_mode = L1_S_B | L1_S_C;
 	pte_l2_l_cache_mode = L2_B | L2_C;
 	pte_l2_s_cache_mode = L2_B | L2_C;
 
 	pte_l1_s_cache_mode_pt = L1_S_C;
 	pte_l2_l_cache_mode_pt = L2_C;
 	pte_l2_s_cache_mode_pt = L2_C;
 
 }
 #endif /* CPU_ARM10 */
 
 #if  ARM_MMU_SA1 == 1
 void
 pmap_pte_init_sa1(void)
 {
 
 	/*
 	 * The StrongARM SA-1 cache does not have a write-through
 	 * mode.  So, do the generic initialization, then reset
 	 * the page table cache mode to B=1,C=1, and note that
 	 * the PTEs need to be sync'd.
 	 */
 	pmap_pte_init_generic();
 
 	pte_l1_s_cache_mode_pt = L1_S_B|L1_S_C;
 	pte_l2_l_cache_mode_pt = L2_B|L2_C;
 	pte_l2_s_cache_mode_pt = L2_B|L2_C;
 
 	pmap_needs_pte_sync = 1;
 }
 #endif /* ARM_MMU_SA1 == 1*/
 
 #if ARM_MMU_XSCALE == 1
 #if (ARM_NMMUS > 1) || defined (CPU_XSCALE_CORE3)
 static u_int xscale_use_minidata;
 #endif
 
 void
 pmap_pte_init_xscale(void)
 {
 	uint32_t auxctl;
 	int write_through = 0;
 
 	pte_l1_s_cache_mode = L1_S_B|L1_S_C|L1_S_XSCALE_P;
 	pte_l1_s_cache_mask = L1_S_CACHE_MASK_xscale;
 
 	pte_l2_l_cache_mode = L2_B|L2_C;
 	pte_l2_l_cache_mask = L2_L_CACHE_MASK_xscale;
 
 	pte_l2_s_cache_mode = L2_B|L2_C;
 	pte_l2_s_cache_mask = L2_S_CACHE_MASK_xscale;
 
 	pte_l1_s_cache_mode_pt = L1_S_C;
 	pte_l2_l_cache_mode_pt = L2_C;
 	pte_l2_s_cache_mode_pt = L2_C;
 #ifdef XSCALE_CACHE_READ_WRITE_ALLOCATE
 	/*
 	 * The XScale core has an enhanced mode where writes that
 	 * miss the cache cause a cache line to be allocated.  This
 	 * is significantly faster than the traditional, write-through
 	 * behavior of this case.
 	 */
 	pte_l1_s_cache_mode |= L1_S_XSCALE_TEX(TEX_XSCALE_X);
 	pte_l2_l_cache_mode |= L2_XSCALE_L_TEX(TEX_XSCALE_X);
 	pte_l2_s_cache_mode |= L2_XSCALE_T_TEX(TEX_XSCALE_X);
 #endif /* XSCALE_CACHE_READ_WRITE_ALLOCATE */
 #ifdef XSCALE_CACHE_WRITE_THROUGH
 	/*
 	 * Some versions of the XScale core have various bugs in
 	 * their cache units, the work-around for which is to run
 	 * the cache in write-through mode.  Unfortunately, this
 	 * has a major (negative) impact on performance.  So, we
 	 * go ahead and run fast-and-loose, in the hopes that we
 	 * don't line up the planets in a way that will trip the
 	 * bugs.
 	 *
 	 * However, we give you the option to be slow-but-correct.
 	 */
 	write_through = 1;
 #elif defined(XSCALE_CACHE_WRITE_BACK)
 	/* force write back cache mode */
 	write_through = 0;
 #elif defined(CPU_XSCALE_PXA2X0)
 	/*
 	 * Intel PXA2[15]0 processors are known to have a bug in
 	 * write-back cache on revision 4 and earlier (stepping
 	 * A[01] and B[012]).  Fixed for C0 and later.
 	 */
 	{
 		uint32_t id, type;
 
 		id = cpufunc_id();
 		type = id & ~(CPU_ID_XSCALE_COREREV_MASK|CPU_ID_REVISION_MASK);
 
 		if (type == CPU_ID_PXA250 || type == CPU_ID_PXA210) {
 			if ((id & CPU_ID_REVISION_MASK) < 5) {
 				/* write through for stepping A0-1 and B0-2 */
 				write_through = 1;
 			}
 		}
 	}
 #endif /* XSCALE_CACHE_WRITE_THROUGH */
 
 	if (write_through) {
 		pte_l1_s_cache_mode = L1_S_C;
 		pte_l2_l_cache_mode = L2_C;
 		pte_l2_s_cache_mode = L2_C;
 	}
 
 #if (ARM_NMMUS > 1)
 	xscale_use_minidata = 1;
 #endif
 
 	pte_l2_s_prot_u = L2_S_PROT_U_xscale;
 	pte_l2_s_prot_w = L2_S_PROT_W_xscale;
 	pte_l2_s_prot_mask = L2_S_PROT_MASK_xscale;
 
 	pte_l1_s_proto = L1_S_PROTO_xscale;
 	pte_l1_c_proto = L1_C_PROTO_xscale;
 	pte_l2_s_proto = L2_S_PROTO_xscale;
 
 #ifdef CPU_XSCALE_CORE3
 	pmap_copy_page_func = pmap_copy_page_generic;
 	pmap_zero_page_func = pmap_zero_page_generic;
 	xscale_use_minidata = 0;
 	/* Make sure it is L2-cachable */
     	pte_l1_s_cache_mode |= L1_S_XSCALE_TEX(TEX_XSCALE_T);
 	pte_l1_s_cache_mode_pt = pte_l1_s_cache_mode &~ L1_S_XSCALE_P;
 	pte_l2_l_cache_mode |= L2_XSCALE_L_TEX(TEX_XSCALE_T) ;
 	pte_l2_l_cache_mode_pt = pte_l1_s_cache_mode;
 	pte_l2_s_cache_mode |= L2_XSCALE_T_TEX(TEX_XSCALE_T);
 	pte_l2_s_cache_mode_pt = pte_l2_s_cache_mode;
 
 #else
 	pmap_copy_page_func = pmap_copy_page_xscale;
 	pmap_zero_page_func = pmap_zero_page_xscale;
 #endif
 
 	/*
 	 * Disable ECC protection of page table access, for now.
 	 */
 	__asm __volatile("mrc p15, 0, %0, c1, c0, 1" : "=r" (auxctl));
 	auxctl &= ~XSCALE_AUXCTL_P;
 	__asm __volatile("mcr p15, 0, %0, c1, c0, 1" : : "r" (auxctl));
 }
 
 /*
  * xscale_setup_minidata:
  *
  *	Set up the mini-data cache clean area.  We require the
  *	caller to allocate the right amount of physically and
  *	virtually contiguous space.
  */
 extern vm_offset_t xscale_minidata_clean_addr;
 extern vm_size_t xscale_minidata_clean_size; /* already initialized */
 void
 xscale_setup_minidata(vm_offset_t l1pt, vm_offset_t va, vm_paddr_t pa)
 {
 	pd_entry_t *pde = (pd_entry_t *) l1pt;
 	pt_entry_t *pte;
 	vm_size_t size;
 	uint32_t auxctl;
 
 	xscale_minidata_clean_addr = va;
 
 	/* Round it to page size. */
 	size = (xscale_minidata_clean_size + L2_S_OFFSET) & L2_S_FRAME;
 
 	for (; size != 0;
 	     va += L2_S_SIZE, pa += L2_S_SIZE, size -= L2_S_SIZE) {
 		pte = (pt_entry_t *) kernel_pt_lookup(
 		    pde[L1_IDX(va)] & L1_C_ADDR_MASK);
 		if (pte == NULL)
 			panic("xscale_setup_minidata: can't find L2 table for "
 			    "VA 0x%08x", (u_int32_t) va);
 		pte[l2pte_index(va)] =
 		    L2_S_PROTO | pa | L2_S_PROT(PTE_KERNEL, VM_PROT_READ) |
 		    L2_C | L2_XSCALE_T_TEX(TEX_XSCALE_X);
 	}
 
 	/*
 	 * Configure the mini-data cache for write-back with
 	 * read/write-allocate.
 	 *
 	 * NOTE: In order to reconfigure the mini-data cache, we must
 	 * make sure it contains no valid data!  In order to do that,
 	 * we must issue a global data cache invalidate command!
 	 *
 	 * WE ASSUME WE ARE RUNNING UN-CACHED WHEN THIS ROUTINE IS CALLED!
 	 * THIS IS VERY IMPORTANT!
 	 */
 
 	/* Invalidate data and mini-data. */
 	__asm __volatile("mcr p15, 0, %0, c7, c6, 0" : : "r" (0));
 	__asm __volatile("mrc p15, 0, %0, c1, c0, 1" : "=r" (auxctl));
 	auxctl = (auxctl & ~XSCALE_AUXCTL_MD_MASK) | XSCALE_AUXCTL_MD_WB_RWA;
 	__asm __volatile("mcr p15, 0, %0, c1, c0, 1" : : "r" (auxctl));
 }
 #endif
 
 /*
  * Allocate an L1 translation table for the specified pmap.
  * This is called at pmap creation time.
  */
 static void
 pmap_alloc_l1(pmap_t pm)
 {
 	struct l1_ttable *l1;
 	u_int8_t domain;
 
 	/*
 	 * Remove the L1 at the head of the LRU list
 	 */
 	mtx_lock(&l1_lru_lock);
 	l1 = TAILQ_FIRST(&l1_lru_list);
 	TAILQ_REMOVE(&l1_lru_list, l1, l1_lru);
 
 	/*
 	 * Pick the first available domain number, and update
 	 * the link to the next number.
 	 */
 	domain = l1->l1_domain_first;
 	l1->l1_domain_first = l1->l1_domain_free[domain];
 
 	/*
 	 * If there are still free domain numbers in this L1,
 	 * put it back on the TAIL of the LRU list.
 	 */
 	if (++l1->l1_domain_use_count < PMAP_DOMAINS)
 		TAILQ_INSERT_TAIL(&l1_lru_list, l1, l1_lru);
 
 	mtx_unlock(&l1_lru_lock);
 
 	/*
 	 * Fix up the relevant bits in the pmap structure
 	 */
 	pm->pm_l1 = l1;
 	pm->pm_domain = domain + 1;
 }
 
 /*
  * Free an L1 translation table.
  * This is called at pmap destruction time.
  */
 static void
 pmap_free_l1(pmap_t pm)
 {
 	struct l1_ttable *l1 = pm->pm_l1;
 
 	mtx_lock(&l1_lru_lock);
 
 	/*
 	 * If this L1 is currently on the LRU list, remove it.
 	 */
 	if (l1->l1_domain_use_count < PMAP_DOMAINS)
 		TAILQ_REMOVE(&l1_lru_list, l1, l1_lru);
 
 	/*
 	 * Free up the domain number which was allocated to the pmap
 	 */
 	l1->l1_domain_free[pm->pm_domain - 1] = l1->l1_domain_first;
 	l1->l1_domain_first = pm->pm_domain - 1;
 	l1->l1_domain_use_count--;
 
 	/*
 	 * The L1 now must have at least 1 free domain, so add
 	 * it back to the LRU list. If the use count is zero,
 	 * put it at the head of the list, otherwise it goes
 	 * to the tail.
 	 */
 	if (l1->l1_domain_use_count == 0) {
 		TAILQ_INSERT_HEAD(&l1_lru_list, l1, l1_lru);
 	}	else
 		TAILQ_INSERT_TAIL(&l1_lru_list, l1, l1_lru);
 
 	mtx_unlock(&l1_lru_lock);
 }
 
 /*
  * Returns a pointer to the L2 bucket associated with the specified pmap
  * and VA, or NULL if no L2 bucket exists for the address.
  */
 static PMAP_INLINE struct l2_bucket *
 pmap_get_l2_bucket(pmap_t pm, vm_offset_t va)
 {
 	struct l2_dtable *l2;
 	struct l2_bucket *l2b;
 	u_short l1idx;
 
 	l1idx = L1_IDX(va);
 
 	if ((l2 = pm->pm_l2[L2_IDX(l1idx)]) == NULL ||
 	    (l2b = &l2->l2_bucket[L2_BUCKET(l1idx)])->l2b_kva == NULL)
 		return (NULL);
 
 	return (l2b);
 }
 
 /*
  * Returns a pointer to the L2 bucket associated with the specified pmap
  * and VA.
  *
  * If no L2 bucket exists, perform the necessary allocations to put an L2
  * bucket/page table in place.
  *
  * Note that if a new L2 bucket/page was allocated, the caller *must*
  * increment the bucket occupancy counter appropriately *before* 
  * releasing the pmap's lock to ensure no other thread or cpu deallocates
  * the bucket/page in the meantime.
  */
 static struct l2_bucket *
 pmap_alloc_l2_bucket(pmap_t pm, vm_offset_t va)
 {
 	struct l2_dtable *l2;
 	struct l2_bucket *l2b;
 	u_short l1idx;
 
 	l1idx = L1_IDX(va);
 
 	PMAP_ASSERT_LOCKED(pm);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if ((l2 = pm->pm_l2[L2_IDX(l1idx)]) == NULL) {
 		/*
 		 * No mapping at this address, as there is
 		 * no entry in the L1 table.
 		 * Need to allocate a new l2_dtable.
 		 */
 again_l2table:
 		PMAP_UNLOCK(pm);
 		vm_page_unlock_queues();
 		if ((l2 = pmap_alloc_l2_dtable()) == NULL) {
 			vm_page_lock_queues();
 			PMAP_LOCK(pm);
 			return (NULL);
 		}
 		vm_page_lock_queues();
 		PMAP_LOCK(pm);
 		if (pm->pm_l2[L2_IDX(l1idx)] != NULL) {
 			PMAP_UNLOCK(pm);
 			vm_page_unlock_queues();
 			uma_zfree(l2table_zone, l2);
 			vm_page_lock_queues();
 			PMAP_LOCK(pm);
 			l2 = pm->pm_l2[L2_IDX(l1idx)];
 			if (l2 == NULL)
 				goto again_l2table;
 			/*
 			 * Someone already allocated the l2_dtable while
 			 * we were doing the same.
 			 */
 		} else {
 			bzero(l2, sizeof(*l2));
 			/*
 			 * Link it into the parent pmap
 			 */
 			pm->pm_l2[L2_IDX(l1idx)] = l2;
 		}
 	} 
 
 	l2b = &l2->l2_bucket[L2_BUCKET(l1idx)];
 
 	/*
 	 * Fetch pointer to the L2 page table associated with the address.
 	 */
 	if (l2b->l2b_kva == NULL) {
 		pt_entry_t *ptep;
 
 		/*
 		 * No L2 page table has been allocated. Chances are, this
 		 * is because we just allocated the l2_dtable, above.
 		 */
 again_ptep:
 		PMAP_UNLOCK(pm);
 		vm_page_unlock_queues();
 		ptep = (void*)uma_zalloc(l2zone, M_NOWAIT|M_USE_RESERVE);
 		vm_page_lock_queues();
 		PMAP_LOCK(pm);
 		if (l2b->l2b_kva != 0) {
 			/* We lost the race. */
 			PMAP_UNLOCK(pm);
 			vm_page_unlock_queues();
 			uma_zfree(l2zone, ptep);
 			vm_page_lock_queues();
 			PMAP_LOCK(pm);
 			if (l2b->l2b_kva == 0)
 				goto again_ptep;
 			return (l2b);
 		}
 		l2b->l2b_phys = vtophys(ptep);
 		if (ptep == NULL) {
 			/*
 			 * Oops, no more L2 page tables available at this
 			 * time. We may need to deallocate the l2_dtable
 			 * if we allocated a new one above.
 			 */
 			if (l2->l2_occupancy == 0) {
 				pm->pm_l2[L2_IDX(l1idx)] = NULL;
 				pmap_free_l2_dtable(l2);
 			}
 			return (NULL);
 		}
 
 		l2->l2_occupancy++;
 		l2b->l2b_kva = ptep;
 		l2b->l2b_l1idx = l1idx;
 	}
 
 	return (l2b);
 }
 
 static PMAP_INLINE void
 #ifndef PMAP_INCLUDE_PTE_SYNC
 pmap_free_l2_ptp(pt_entry_t *l2)
 #else
 pmap_free_l2_ptp(boolean_t need_sync, pt_entry_t *l2)
 #endif
 {
 #ifdef PMAP_INCLUDE_PTE_SYNC
 	/*
 	 * Note: With a write-back cache, we may need to sync this
 	 * L2 table before re-using it.
 	 * This is because it may have belonged to a non-current
 	 * pmap, in which case the cache syncs would have been
 	 * skipped when the pages were being unmapped. If the
 	 * L2 table were then to be immediately re-allocated to
 	 * the *current* pmap, it may well contain stale mappings
 	 * which have not yet been cleared by a cache write-back
 	 * and so would still be visible to the mmu.
 	 */
 	if (need_sync)
 		PTE_SYNC_RANGE(l2, L2_TABLE_SIZE_REAL / sizeof(pt_entry_t));
 #endif
 	uma_zfree(l2zone, l2);
 }
 /*
  * One or more mappings in the specified L2 descriptor table have just been
  * invalidated.
  *
  * Garbage collect the metadata and descriptor table itself if necessary.
  *
  * The pmap lock must be acquired when this is called (not necessary
  * for the kernel pmap).
  */
 static void
 pmap_free_l2_bucket(pmap_t pm, struct l2_bucket *l2b, u_int count)
 {
 	struct l2_dtable *l2;
 	pd_entry_t *pl1pd, l1pd;
 	pt_entry_t *ptep;
 	u_short l1idx;
 
 
 	/*
 	 * Update the bucket's reference count according to how many
 	 * PTEs the caller has just invalidated.
 	 */
 	l2b->l2b_occupancy -= count;
 
 	/*
 	 * Note:
 	 *
 	 * Level 2 page tables allocated to the kernel pmap are never freed
 	 * as that would require checking all Level 1 page tables and
 	 * removing any references to the Level 2 page table. See also the
 	 * comment elsewhere about never freeing bootstrap L2 descriptors.
 	 *
 	 * We make do with just invalidating the mapping in the L2 table.
 	 *
 	 * This isn't really a big deal in practice and, in fact, leads
 	 * to a performance win over time as we don't need to continually
 	 * alloc/free.
 	 */
 	if (l2b->l2b_occupancy > 0 || pm == pmap_kernel())
 		return;
 
 	/*
 	 * There are no more valid mappings in this level 2 page table.
 	 * Go ahead and NULL-out the pointer in the bucket, then
 	 * free the page table.
 	 */
 	l1idx = l2b->l2b_l1idx;
 	ptep = l2b->l2b_kva;
 	l2b->l2b_kva = NULL;
 
 	pl1pd = &pm->pm_l1->l1_kva[l1idx];
 
 	/*
 	 * If the L1 slot matches the pmap's domain
 	 * number, then invalidate it.
 	 */
 	l1pd = *pl1pd & (L1_TYPE_MASK | L1_C_DOM_MASK);
 	if (l1pd == (L1_C_DOM(pm->pm_domain) | L1_TYPE_C)) {
 		*pl1pd = 0;
 		PTE_SYNC(pl1pd);
 	}
 
 	/*
 	 * Release the L2 descriptor table back to the pool cache.
 	 */
 #ifndef PMAP_INCLUDE_PTE_SYNC
 	pmap_free_l2_ptp(ptep);
 #else
 	pmap_free_l2_ptp(!pmap_is_current(pm), ptep);
 #endif
 
 	/*
 	 * Update the reference count in the associated l2_dtable
 	 */
 	l2 = pm->pm_l2[L2_IDX(l1idx)];
 	if (--l2->l2_occupancy > 0)
 		return;
 
 	/*
 	 * There are no more valid mappings in any of the Level 1
 	 * slots managed by this l2_dtable. Go ahead and NULL-out
 	 * the pointer in the parent pmap and free the l2_dtable.
 	 */
 	pm->pm_l2[L2_IDX(l1idx)] = NULL;
 	pmap_free_l2_dtable(l2);
 }
 
 /*
  * Pool cache constructors for L2 descriptor tables, metadata and pmap
  * structures.
  */
 static int
 pmap_l2ptp_ctor(void *mem, int size, void *arg, int flags)
 {
 #ifndef PMAP_INCLUDE_PTE_SYNC
 	struct l2_bucket *l2b;
 	pt_entry_t *ptep, pte;
 #ifdef ARM_USE_SMALL_ALLOC
 	pd_entry_t *pde;
 #endif
 	vm_offset_t va = (vm_offset_t)mem & ~PAGE_MASK;
 
 	/*
 	 * The mappings for these page tables were initially made using
 	 * pmap_kenter() by the pool subsystem. Therefore, the cache-
 	 * mode will not be right for page table mappings. To avoid
 	 * polluting the pmap_kenter() code with a special case for
 	 * page tables, we simply fix up the cache-mode here if it's not
 	 * correct.
 	 */
 #ifdef ARM_USE_SMALL_ALLOC
 	pde = &kernel_pmap->pm_l1->l1_kva[L1_IDX(va)];
 	if (!l1pte_section_p(*pde)) {
 #endif
 		l2b = pmap_get_l2_bucket(pmap_kernel(), va);
 		ptep = &l2b->l2b_kva[l2pte_index(va)];
 		pte = *ptep;
 		
 		if ((pte & L2_S_CACHE_MASK) != pte_l2_s_cache_mode_pt) {
 			/*
 			 * Page tables must have the cache-mode set to 
 			 * Write-Thru.
 			 */
 			*ptep = (pte & ~L2_S_CACHE_MASK) | pte_l2_s_cache_mode_pt;
 			PTE_SYNC(ptep);
 			cpu_tlb_flushD_SE(va);
 			cpu_cpwait();
 		}
 #ifdef ARM_USE_SMALL_ALLOC
 	}
 #endif
 #endif
 	memset(mem, 0, L2_TABLE_SIZE_REAL);
 	PTE_SYNC_RANGE(mem, L2_TABLE_SIZE_REAL / sizeof(pt_entry_t));
 	return (0);
 }
 
 /*
  * A bunch of routines to conditionally flush the caches/TLB depending
  * on whether the specified pmap actually needs to be flushed at any
  * given time.
  */
 static PMAP_INLINE void
 pmap_tlb_flushID_SE(pmap_t pm, vm_offset_t va)
 {
 
 	if (pmap_is_current(pm))
 		cpu_tlb_flushID_SE(va);
 }
 
 static PMAP_INLINE void
 pmap_tlb_flushD_SE(pmap_t pm, vm_offset_t va)
 {
 
 	if (pmap_is_current(pm))
 		cpu_tlb_flushD_SE(va);
 }
 
 static PMAP_INLINE void
 pmap_tlb_flushID(pmap_t pm)
 {
 
 	if (pmap_is_current(pm))
 		cpu_tlb_flushID();
 }
 static PMAP_INLINE void
 pmap_tlb_flushD(pmap_t pm)
 {
 
 	if (pmap_is_current(pm))
 		cpu_tlb_flushD();
 }
 
 static int
 pmap_has_valid_mapping(pmap_t pm, vm_offset_t va)
 {
 	pd_entry_t *pde;
 	pt_entry_t *ptep;
 
 	if (pmap_get_pde_pte(pm, va, &pde, &ptep) &&
 	    ptep && ((*ptep & L2_TYPE_MASK) != L2_TYPE_INV))
 		return (1);
 
 	return (0);
 }
 
 static PMAP_INLINE void
 pmap_idcache_wbinv_range(pmap_t pm, vm_offset_t va, vm_size_t len)
 {
 	vm_size_t rest;
 
 	CTR4(KTR_PMAP, "pmap_dcache_wbinv_range: pmap %p is_kernel %d va 0x%08x"
 	    " len 0x%x ", pm, pm == pmap_kernel(), va, len);
 
 	if (pmap_is_current(pm) || pm == pmap_kernel()) {
 		rest = MIN(PAGE_SIZE - (va & PAGE_MASK), len);
 		while (len > 0) {
 			if (pmap_has_valid_mapping(pm, va)) {
 				cpu_idcache_wbinv_range(va, rest);
 				cpu_l2cache_wbinv_range(va, rest);
 			}
 			len -= rest;
 			va += rest;
 			rest = MIN(PAGE_SIZE, len);
 		}
 	}
 }
 
 static PMAP_INLINE void
 pmap_dcache_wb_range(pmap_t pm, vm_offset_t va, vm_size_t len, boolean_t do_inv,
     boolean_t rd_only)
 {
 	vm_size_t rest;
 
 	CTR4(KTR_PMAP, "pmap_dcache_wb_range: pmap %p is_kernel %d va 0x%08x "
 	    "len 0x%x ", pm, pm == pmap_kernel(), va, len);
 	CTR2(KTR_PMAP, " do_inv %d rd_only %d", do_inv, rd_only);
 
 	if (pmap_is_current(pm)) {
 		rest = MIN(PAGE_SIZE - (va & PAGE_MASK), len);
 		while (len > 0) {
 			if (pmap_has_valid_mapping(pm, va)) {
 				if (do_inv && rd_only) {
 					cpu_dcache_inv_range(va, rest);
 					cpu_l2cache_inv_range(va, rest);
 				} else if (do_inv) {
 					cpu_dcache_wbinv_range(va, rest);
 					cpu_l2cache_wbinv_range(va, rest);
 				} else if (!rd_only) {
 					cpu_dcache_wb_range(va, rest);
 					cpu_l2cache_wb_range(va, rest);
 				}
 			}
 			len -= rest;
 			va += rest;
 
 			rest = MIN(PAGE_SIZE, len);
 		}
 	}
 }
 
 static PMAP_INLINE void
 pmap_idcache_wbinv_all(pmap_t pm)
 {
 
 	if (pmap_is_current(pm)) {
 		cpu_idcache_wbinv_all();
 		cpu_l2cache_wbinv_all();
 	}
 }
 
 #ifdef notyet
 static PMAP_INLINE void
 pmap_dcache_wbinv_all(pmap_t pm)
 {
 
 	if (pmap_is_current(pm)) {
 		cpu_dcache_wbinv_all();
 		cpu_l2cache_wbinv_all();
 	}
 }
 #endif
 
 /*
  * PTE_SYNC_CURRENT:
  *
  *     Make sure the pte is written out to RAM.
  *     We need to do this for one of two cases:
  *       - We're dealing with the kernel pmap
  *       - There is no pmap active in the cache/tlb.
  *       - The specified pmap is 'active' in the cache/tlb.
  */
 #ifdef PMAP_INCLUDE_PTE_SYNC
 #define	PTE_SYNC_CURRENT(pm, ptep)	\
 do {					\
 	if (PMAP_NEEDS_PTE_SYNC && 	\
 	    pmap_is_current(pm))	\
 		PTE_SYNC(ptep);		\
 } while (/*CONSTCOND*/0)
 #else
 #define	PTE_SYNC_CURRENT(pm, ptep)	/* nothing */
 #endif
 
 /*
  * cacheable == -1 means we must make the entry uncacheable, 1 means
  * cacheable;
  */
 static __inline void
 pmap_set_cache_entry(pv_entry_t pv, pmap_t pm, vm_offset_t va, int cacheable)
 {
 	struct l2_bucket *l2b;
 	pt_entry_t *ptep, pte;
 
 	l2b = pmap_get_l2_bucket(pv->pv_pmap, pv->pv_va);
 	ptep = &l2b->l2b_kva[l2pte_index(pv->pv_va)];
 
 	if (cacheable == 1) {
 		pte = (*ptep & ~L2_S_CACHE_MASK) | pte_l2_s_cache_mode;
 		if (l2pte_valid(pte)) {
 			if (PV_BEEN_EXECD(pv->pv_flags)) {
 				pmap_tlb_flushID_SE(pv->pv_pmap, pv->pv_va);
 			} else if (PV_BEEN_REFD(pv->pv_flags)) {
 				pmap_tlb_flushD_SE(pv->pv_pmap, pv->pv_va);
 			}
 		}
 	} else {
 		pte = *ptep &~ L2_S_CACHE_MASK;
 		if ((va != pv->pv_va || pm != pv->pv_pmap) &&
 			    l2pte_valid(pte)) {
 			if (PV_BEEN_EXECD(pv->pv_flags)) {
 				pmap_idcache_wbinv_range(pv->pv_pmap,
 					    pv->pv_va, PAGE_SIZE);
 				pmap_tlb_flushID_SE(pv->pv_pmap, pv->pv_va);
 			} else if (PV_BEEN_REFD(pv->pv_flags)) {
 				pmap_dcache_wb_range(pv->pv_pmap,
 					    pv->pv_va, PAGE_SIZE, TRUE,
 					    (pv->pv_flags & PVF_WRITE) == 0);
 				pmap_tlb_flushD_SE(pv->pv_pmap,
 					    pv->pv_va);
 			}
 		}
 	}
 	*ptep = pte;
 	PTE_SYNC_CURRENT(pv->pv_pmap, ptep);
 }
 
 static void
 pmap_fix_cache(struct vm_page *pg, pmap_t pm, vm_offset_t va)
 {
 	int pmwc = 0;
 	int writable = 0, kwritable = 0, uwritable = 0;
 	int entries = 0, kentries = 0, uentries = 0;
 	struct pv_entry *pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 
 	/* the cache gets written back/invalidated on context switch.
 	 * therefore, if a user page shares an entry in the same page or
 	 * with the kernel map and at least one is writable, then the
 	 * cache entry must be set write-through.
 	 */
 
 	TAILQ_FOREACH(pv, &pg->md.pv_list, pv_list) {
 			/* generate a count of the pv_entry uses */
 		if (pv->pv_flags & PVF_WRITE) {
 			if (pv->pv_pmap == pmap_kernel())
 				kwritable++;
 			else if (pv->pv_pmap == pm)
 				uwritable++;
 			writable++;
 		}
 		if (pv->pv_pmap == pmap_kernel())
 			kentries++;
 		else {
 			if (pv->pv_pmap == pm)
 				uentries++;
 			entries++;
 		}
 	}
 		/*
 		 * check if the user duplicate mapping has
 		 * been removed.
 		 */
 	if ((pm != pmap_kernel()) && (((uentries > 1) && uwritable) ||
 	    (uwritable > 1)))
 			pmwc = 1;
 
 	TAILQ_FOREACH(pv, &pg->md.pv_list, pv_list) {
 		/* check for user uncachable conditions - order is important */
 		if (pm != pmap_kernel() &&
 		    (pv->pv_pmap == pm || pv->pv_pmap == pmap_kernel())) {
 
 			if ((uentries > 1 && uwritable) || uwritable > 1) {
 
 				/* user duplicate mapping */
 				if (pv->pv_pmap != pmap_kernel())
 					pv->pv_flags |= PVF_MWC;
 
 				if (!(pv->pv_flags & PVF_NC)) {
 					pv->pv_flags |= PVF_NC;
 					pmap_set_cache_entry(pv, pm, va, -1);
 				}
 				continue;
 			} else	/* no longer a duplicate user */
 				pv->pv_flags &= ~PVF_MWC;
 		}
 
 		/*
 		 * check for kernel uncachable conditions
 		 * kernel writable or kernel readable with writable user entry
 		 */
 		if ((kwritable && (entries || kentries > 1)) ||
 		    (kwritable > 1) ||
 		    ((kwritable != writable) && kentries &&
 		     (pv->pv_pmap == pmap_kernel() ||
 		      (pv->pv_flags & PVF_WRITE) ||
 		      (pv->pv_flags & PVF_MWC)))) {
 
 			if (!(pv->pv_flags & PVF_NC)) {
 				pv->pv_flags |= PVF_NC;
 				pmap_set_cache_entry(pv, pm, va, -1);
 			}
 			continue;
 		}
 
 			/* kernel and user are cachable */
 		if ((pm == pmap_kernel()) && !(pv->pv_flags & PVF_MWC) &&
 		    (pv->pv_flags & PVF_NC)) {
 
 			pv->pv_flags &= ~PVF_NC;
 			pmap_set_cache_entry(pv, pm, va, 1);
 			continue;
 		}
 			/* user is no longer sharable and writable */
 		if (pm != pmap_kernel() &&
 		    (pv->pv_pmap == pm || pv->pv_pmap == pmap_kernel()) &&
 		    !pmwc && (pv->pv_flags & PVF_NC)) {
 
 			pv->pv_flags &= ~(PVF_NC | PVF_MWC);
 			pmap_set_cache_entry(pv, pm, va, 1);
 		}
 	}
 
 	if ((kwritable == 0) && (writable == 0)) {
 		pg->md.pvh_attrs &= ~PVF_MOD;
 		vm_page_flag_clear(pg, PG_WRITEABLE);
 		return;
 	}
 }
 
 /*
  * Modify pte bits for all ptes corresponding to the given physical address.
  * We use `maskbits' rather than `clearbits' because we're always passing
  * constants and the latter would require an extra inversion at run-time.
  */
 static int 
 pmap_clearbit(struct vm_page *pg, u_int maskbits)
 {
 	struct l2_bucket *l2b;
 	struct pv_entry *pv;
 	pt_entry_t *ptep, npte, opte;
 	pmap_t pm;
 	vm_offset_t va;
 	u_int oflags;
 	int count = 0;
 
 	vm_page_lock_queues();
 
 	if (maskbits & PVF_WRITE)
 		maskbits |= PVF_MOD;
 	/*
 	 * Clear saved attributes (modify, reference)
 	 */
 	pg->md.pvh_attrs &= ~(maskbits & (PVF_MOD | PVF_REF));
 
 	if (TAILQ_EMPTY(&pg->md.pv_list)) {
 		vm_page_unlock_queues();
 		return (0);
 	}
 
 	/*
 	 * Loop over all current mappings setting/clearing as appropos
 	 */
 	TAILQ_FOREACH(pv, &pg->md.pv_list, pv_list) {
 		va = pv->pv_va;
 		pm = pv->pv_pmap;
 		oflags = pv->pv_flags;
 
 		if (!(oflags & maskbits)) {
 			if ((maskbits & PVF_WRITE) && (pv->pv_flags & PVF_NC)) {
 				/* It is safe to re-enable cacheing here. */
 				PMAP_LOCK(pm);
 				l2b = pmap_get_l2_bucket(pm, va);
 				ptep = &l2b->l2b_kva[l2pte_index(va)];
 				*ptep |= pte_l2_s_cache_mode;
 				PTE_SYNC(ptep);
 				PMAP_UNLOCK(pm);
 				pv->pv_flags &= ~(PVF_NC | PVF_MWC);
 				
 			}
 			continue;
 		}
 		pv->pv_flags &= ~maskbits;
 
 		PMAP_LOCK(pm);
 
 		l2b = pmap_get_l2_bucket(pm, va);
 
 		ptep = &l2b->l2b_kva[l2pte_index(va)];
 		npte = opte = *ptep;
 
 		if (maskbits & (PVF_WRITE|PVF_MOD)) {
 			if ((pv->pv_flags & PVF_NC)) {
 				/* 
 				 * Entry is not cacheable:
 				 *
 				 * Don't turn caching on again if this is a 
 				 * modified emulation. This would be
 				 * inconsitent with the settings created by
 				 * pmap_fix_cache(). Otherwise, it's safe
 				 * to re-enable cacheing.
 				 *
 				 * There's no need to call pmap_fix_cache()
 				 * here: all pages are losing their write
 				 * permission.
 				 */
 				if (maskbits & PVF_WRITE) {
 					npte |= pte_l2_s_cache_mode;
 					pv->pv_flags &= ~(PVF_NC | PVF_MWC);
 				}
 			} else
 			if (opte & L2_S_PROT_W) {
 				vm_page_dirty(pg);
 				/* 
 				 * Entry is writable/cacheable: check if pmap
 				 * is current if it is flush it, otherwise it
 				 * won't be in the cache
 				 */
 				if (PV_BEEN_EXECD(oflags))
 					pmap_idcache_wbinv_range(pm, pv->pv_va,
 					    PAGE_SIZE);
 				else
 				if (PV_BEEN_REFD(oflags))
 					pmap_dcache_wb_range(pm, pv->pv_va,
 					    PAGE_SIZE,
 					    (maskbits & PVF_REF) ? TRUE : FALSE,
 					    FALSE);
 			}
 
 			/* make the pte read only */
 			npte &= ~L2_S_PROT_W;
 		}
 
 		if (maskbits & PVF_REF) {
 			if ((pv->pv_flags & PVF_NC) == 0 &&
 			    (maskbits & (PVF_WRITE|PVF_MOD)) == 0) {
 				/*
 				 * Check npte here; we may have already
 				 * done the wbinv above, and the validity
 				 * of the PTE is the same for opte and
 				 * npte.
 				 */
 				if (npte & L2_S_PROT_W) {
 					if (PV_BEEN_EXECD(oflags))
 						pmap_idcache_wbinv_range(pm,
 						    pv->pv_va, PAGE_SIZE);
 					else
 					if (PV_BEEN_REFD(oflags))
 						pmap_dcache_wb_range(pm,
 						    pv->pv_va, PAGE_SIZE,
 						    TRUE, FALSE);
 				} else
 				if ((npte & L2_TYPE_MASK) != L2_TYPE_INV) {
 					/* XXXJRT need idcache_inv_range */
 					if (PV_BEEN_EXECD(oflags))
 						pmap_idcache_wbinv_range(pm,
 						    pv->pv_va, PAGE_SIZE);
 					else
 					if (PV_BEEN_REFD(oflags))
 						pmap_dcache_wb_range(pm,
 						    pv->pv_va, PAGE_SIZE,
 						    TRUE, TRUE);
 				}
 			}
 
 			/*
 			 * Make the PTE invalid so that we will take a
 			 * page fault the next time the mapping is
 			 * referenced.
 			 */
 			npte &= ~L2_TYPE_MASK;
 			npte |= L2_TYPE_INV;
 		}
 
 		if (npte != opte) {
 			count++;
 			*ptep = npte;
 			PTE_SYNC(ptep);
 			/* Flush the TLB entry if a current pmap. */
 			if (PV_BEEN_EXECD(oflags))
 				pmap_tlb_flushID_SE(pm, pv->pv_va);
 			else
 			if (PV_BEEN_REFD(oflags))
 				pmap_tlb_flushD_SE(pm, pv->pv_va);
 		}
 
 		PMAP_UNLOCK(pm);
 
 	}
 
 	if (maskbits & PVF_WRITE)
 		vm_page_flag_clear(pg, PG_WRITEABLE);
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  * main pv_entry manipulation functions:
  *   pmap_enter_pv: enter a mapping onto a vm_page list
  *   pmap_remove_pv: remove a mappiing from a vm_page list
  *
  * NOTE: pmap_enter_pv expects to lock the pvh itself
  *       pmap_remove_pv expects te caller to lock the pvh before calling
  */
 
 /*
  * pmap_enter_pv: enter a mapping onto a vm_page lst
  *
  * => caller should hold the proper lock on pmap_main_lock
  * => caller should have pmap locked
  * => we will gain the lock on the vm_page and allocate the new pv_entry
  * => caller should adjust ptp's wire_count before calling
  * => caller should not adjust pmap's wire_count
  */
 static void
 pmap_enter_pv(struct vm_page *pg, struct pv_entry *pve, pmap_t pm,
     vm_offset_t va, u_int flags)
 {
 
 	int km;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 
 	if (pg->md.pv_kva) {
 		/* PMAP_ASSERT_LOCKED(pmap_kernel()); */
 		pve->pv_pmap = pmap_kernel();
 		pve->pv_va = pg->md.pv_kva;
 		pve->pv_flags = PVF_WRITE | PVF_UNMAN;
 		pg->md.pv_kva = 0;
 
 		if (!(km = PMAP_OWNED(pmap_kernel())))
 			PMAP_LOCK(pmap_kernel());
 		TAILQ_INSERT_HEAD(&pg->md.pv_list, pve, pv_list);
 		TAILQ_INSERT_HEAD(&pve->pv_pmap->pm_pvlist, pve, pv_plist);
 		PMAP_UNLOCK(pmap_kernel());
 		vm_page_unlock_queues();
 		if ((pve = pmap_get_pv_entry()) == NULL)
 			panic("pmap_kenter_internal: no pv entries");
 		vm_page_lock_queues();
 		if (km)
 			PMAP_LOCK(pmap_kernel());
 	}
 
 	PMAP_ASSERT_LOCKED(pm);
 	pve->pv_pmap = pm;
 	pve->pv_va = va;
 	pve->pv_flags = flags;
 
 	TAILQ_INSERT_HEAD(&pg->md.pv_list, pve, pv_list);
 	TAILQ_INSERT_HEAD(&pm->pm_pvlist, pve, pv_plist);
 	pg->md.pvh_attrs |= flags & (PVF_REF | PVF_MOD);
 	if (pve->pv_flags & PVF_WIRED)
 		++pm->pm_stats.wired_count;
 	vm_page_flag_set(pg, PG_REFERENCED);
 }
 
 /*
  *
  * pmap_find_pv: Find a pv entry
  *
  * => caller should hold lock on vm_page
  */
 static PMAP_INLINE struct pv_entry *
 pmap_find_pv(struct vm_page *pg, pmap_t pm, vm_offset_t va)
 {
 	struct pv_entry *pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	TAILQ_FOREACH(pv, &pg->md.pv_list, pv_list)
 	    if (pm == pv->pv_pmap && va == pv->pv_va)
 		    break;
 	return (pv);
 }
 
 /*
  * vector_page_setprot:
  *
  *	Manipulate the protection of the vector page.
  */
 void
 vector_page_setprot(int prot)
 {
 	struct l2_bucket *l2b;
 	pt_entry_t *ptep;
 
 	l2b = pmap_get_l2_bucket(pmap_kernel(), vector_page);
 
 	ptep = &l2b->l2b_kva[l2pte_index(vector_page)];
 
 	*ptep = (*ptep & ~L1_S_PROT_MASK) | L2_S_PROT(PTE_KERNEL, prot);
 	PTE_SYNC(ptep);
 	cpu_tlb_flushD_SE(vector_page);
 	cpu_cpwait();
 }
 
 /*
  * pmap_remove_pv: try to remove a mapping from a pv_list
  *
  * => caller should hold proper lock on pmap_main_lock
  * => pmap should be locked
  * => caller should hold lock on vm_page [so that attrs can be adjusted]
  * => caller should adjust ptp's wire_count and free PTP if needed
  * => caller should NOT adjust pmap's wire_count
  * => we return the removed pve
  */
 
 static void
 pmap_nuke_pv(struct vm_page *pg, pmap_t pm, struct pv_entry *pve)
 {
 
 	struct pv_entry *pv;
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_ASSERT_LOCKED(pm);
 	TAILQ_REMOVE(&pg->md.pv_list, pve, pv_list);
 	TAILQ_REMOVE(&pm->pm_pvlist, pve, pv_plist);
 	if (pve->pv_flags & PVF_WIRED)
 		--pm->pm_stats.wired_count;
 	if (pg->md.pvh_attrs & PVF_MOD)
 		vm_page_dirty(pg);
 	if (TAILQ_FIRST(&pg->md.pv_list) == NULL)
 		pg->md.pvh_attrs &= ~PVF_REF;
        	else
 		vm_page_flag_set(pg, PG_REFERENCED);
 	if ((pve->pv_flags & PVF_NC) && ((pm == pmap_kernel()) ||
 	     (pve->pv_flags & PVF_WRITE) || !(pve->pv_flags & PVF_MWC)))
 		pmap_fix_cache(pg, pm, 0);
 	else if (pve->pv_flags & PVF_WRITE) {
 		TAILQ_FOREACH(pve, &pg->md.pv_list, pv_list)
 		    if (pve->pv_flags & PVF_WRITE)
 			    break;
 		if (!pve) {
 			pg->md.pvh_attrs &= ~PVF_MOD;
 			vm_page_flag_clear(pg, PG_WRITEABLE);
 		}
 	}
 	pv = TAILQ_FIRST(&pg->md.pv_list);
 	if (pv != NULL && (pv->pv_flags & PVF_UNMAN) &&
 	    TAILQ_NEXT(pv, pv_list) == NULL) {
 		pm = kernel_pmap;
 		pg->md.pv_kva = pv->pv_va;
 			/* a recursive pmap_nuke_pv */
 		TAILQ_REMOVE(&pg->md.pv_list, pv, pv_list);
 		TAILQ_REMOVE(&pm->pm_pvlist, pv, pv_plist);
 		if (pv->pv_flags & PVF_WIRED)
 			--pm->pm_stats.wired_count;
 		pg->md.pvh_attrs &= ~PVF_REF;
 		pg->md.pvh_attrs &= ~PVF_MOD;
 		vm_page_flag_clear(pg, PG_WRITEABLE);
 		pmap_free_pv_entry(pv);
 	}
 }
 
 static struct pv_entry *
 pmap_remove_pv(struct vm_page *pg, pmap_t pm, vm_offset_t va)
 {
 	struct pv_entry *pve;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	pve = TAILQ_FIRST(&pg->md.pv_list);
 
 	while (pve) {
 		if (pve->pv_pmap == pm && pve->pv_va == va) {	/* match? */
 			pmap_nuke_pv(pg, pm, pve);
 			break;
 		}
 		pve = TAILQ_NEXT(pve, pv_list);
 	}
 
 	if (pve == NULL && pg->md.pv_kva == va)
 		pg->md.pv_kva = 0;
 
 	return(pve);				/* return removed pve */
 }
 /*
  *
  * pmap_modify_pv: Update pv flags
  *
  * => caller should hold lock on vm_page [so that attrs can be adjusted]
  * => caller should NOT adjust pmap's wire_count
  * => we return the old flags
  * 
  * Modify a physical-virtual mapping in the pv table
  */
 static u_int
 pmap_modify_pv(struct vm_page *pg, pmap_t pm, vm_offset_t va,
     u_int clr_mask, u_int set_mask)
 {
 	struct pv_entry *npv;
 	u_int flags, oflags;
 
 	PMAP_ASSERT_LOCKED(pm);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if ((npv = pmap_find_pv(pg, pm, va)) == NULL)
 		return (0);
 
 	/*
 	 * There is at least one VA mapping this page.
 	 */
 
 	if (clr_mask & (PVF_REF | PVF_MOD))
 		pg->md.pvh_attrs |= set_mask & (PVF_REF | PVF_MOD);
 
 	oflags = npv->pv_flags;
 	npv->pv_flags = flags = (oflags & ~clr_mask) | set_mask;
 
 	if ((flags ^ oflags) & PVF_WIRED) {
 		if (flags & PVF_WIRED)
 			++pm->pm_stats.wired_count;
 		else
 			--pm->pm_stats.wired_count;
 	}
 
 	if ((flags ^ oflags) & PVF_WRITE)
 		pmap_fix_cache(pg, pm, 0);
 
 	return (oflags);
 }
 
 /* Function to set the debug level of the pmap code */
 #ifdef PMAP_DEBUG
 void
 pmap_debug(int level)
 {
 	pmap_debug_level = level;
 	dprintf("pmap_debug: level=%d\n", pmap_debug_level);
 }
 #endif  /* PMAP_DEBUG */
 
 void
 pmap_pinit0(struct pmap *pmap)
 {
 	PDEBUG(1, printf("pmap_pinit0: pmap = %08x\n", (u_int32_t) pmap));
 
 	dprintf("pmap_pinit0: pmap = %08x, pm_pdir = %08x\n",
 		(u_int32_t) pmap, (u_int32_t) pmap->pm_pdir);
 	bcopy(kernel_pmap, pmap, sizeof(*pmap));
 	bzero(&pmap->pm_mtx, sizeof(pmap->pm_mtx));
 	PMAP_LOCK_INIT(pmap);
 }
 
 /*
  *	Initialize a vm_page's machine-dependent fields.
  */
 void
 pmap_page_init(vm_page_t m)
 {
 
 	TAILQ_INIT(&m->md.pv_list);
 }
 
 /*
  *      Initialize the pmap module.
  *      Called by vm_init, to initialize any structures that the pmap
  *      system needs to map virtual memory.
  */
 void
 pmap_init(void)
 {
 	int shpgperproc = PMAP_SHPGPERPROC;
 
 	PDEBUG(1, printf("pmap_init: phys_start = %08x\n", PHYSADDR));
 
 	/*
 	 * init the pv free list
 	 */
 	pvzone = uma_zcreate("PV ENTRY", sizeof (struct pv_entry), NULL, NULL, 
 	    NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_VM | UMA_ZONE_NOFREE);
 	/*
 	 * Now it is safe to enable pv_table recording.
 	 */
 	PDEBUG(1, printf("pmap_init: done!\n"));
 
 	TUNABLE_INT_FETCH("vm.pmap.shpgperproc", &shpgperproc);
 	
 	pv_entry_max = shpgperproc * maxproc + cnt.v_page_count;
 	pv_entry_high_water = 9 * (pv_entry_max / 10);
 	l2zone = uma_zcreate("L2 Table", L2_TABLE_SIZE_REAL, pmap_l2ptp_ctor,
 	    NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_VM | UMA_ZONE_NOFREE);
 	l2table_zone = uma_zcreate("L2 Table", sizeof(struct l2_dtable),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR,
 	    UMA_ZONE_VM | UMA_ZONE_NOFREE);
 
 	uma_zone_set_obj(pvzone, &pvzone_obj, pv_entry_max);
 
 }
 
 int
 pmap_fault_fixup(pmap_t pm, vm_offset_t va, vm_prot_t ftype, int user)
 {
 	struct l2_dtable *l2;
 	struct l2_bucket *l2b;
 	pd_entry_t *pl1pd, l1pd;
 	pt_entry_t *ptep, pte;
 	vm_paddr_t pa;
 	u_int l1idx;
 	int rv = 0;
 
 	l1idx = L1_IDX(va);
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 
 	/*
 	 * If there is no l2_dtable for this address, then the process
 	 * has no business accessing it.
 	 *
 	 * Note: This will catch userland processes trying to access
 	 * kernel addresses.
 	 */
 	l2 = pm->pm_l2[L2_IDX(l1idx)];
 	if (l2 == NULL)
 		goto out;
 
 	/*
 	 * Likewise if there is no L2 descriptor table
 	 */
 	l2b = &l2->l2_bucket[L2_BUCKET(l1idx)];
 	if (l2b->l2b_kva == NULL)
 		goto out;
 
 	/*
 	 * Check the PTE itself.
 	 */
 	ptep = &l2b->l2b_kva[l2pte_index(va)];
 	pte = *ptep;
 	if (pte == 0)
 		goto out;
 
 	/*
 	 * Catch a userland access to the vector page mapped at 0x0
 	 */
 	if (user && (pte & L2_S_PROT_U) == 0)
 		goto out;
 	if (va == vector_page)
 		goto out;
 
 	pa = l2pte_pa(pte);
 
 	if ((ftype & VM_PROT_WRITE) && (pte & L2_S_PROT_W) == 0) {
 		/*
 		 * This looks like a good candidate for "page modified"
 		 * emulation...
 		 */
 		struct pv_entry *pv;
 		struct vm_page *pg;
 
 		/* Extract the physical address of the page */
 		if ((pg = PHYS_TO_VM_PAGE(pa)) == NULL) {
 			goto out;
 		}
 		/* Get the current flags for this page. */
 
 		pv = pmap_find_pv(pg, pm, va);
 		if (pv == NULL) {
 			goto out;
 		}
 
 		/*
 		 * Do the flags say this page is writable? If not then it
 		 * is a genuine write fault. If yes then the write fault is
 		 * our fault as we did not reflect the write access in the
 		 * PTE. Now we know a write has occurred we can correct this
 		 * and also set the modified bit
 		 */
 		if ((pv->pv_flags & PVF_WRITE) == 0) {
 			goto out;
 		}
 
 		pg->md.pvh_attrs |= PVF_REF | PVF_MOD;
 		vm_page_dirty(pg);
 		pv->pv_flags |= PVF_REF | PVF_MOD;
 
 		/* 
 		 * Re-enable write permissions for the page.  No need to call
 		 * pmap_fix_cache(), since this is just a
 		 * modified-emulation fault, and the PVF_WRITE bit isn't
 		 * changing. We've already set the cacheable bits based on
 		 * the assumption that we can write to this page.
 		 */
 		*ptep = (pte & ~L2_TYPE_MASK) | L2_S_PROTO | L2_S_PROT_W;
 		PTE_SYNC(ptep);
 		rv = 1;
 	} else
 	if ((pte & L2_TYPE_MASK) == L2_TYPE_INV) {
 		/*
 		 * This looks like a good candidate for "page referenced"
 		 * emulation.
 		 */
 		struct pv_entry *pv;
 		struct vm_page *pg;
 
 		/* Extract the physical address of the page */
 		if ((pg = PHYS_TO_VM_PAGE(pa)) == NULL)
 			goto out;
 		/* Get the current flags for this page. */
 
 		pv = pmap_find_pv(pg, pm, va);
 		if (pv == NULL)
 			goto out;
 
 		pg->md.pvh_attrs |= PVF_REF;
 		pv->pv_flags |= PVF_REF;
 
 
 		*ptep = (pte & ~L2_TYPE_MASK) | L2_S_PROTO;
 		PTE_SYNC(ptep);
 		rv = 1;
 	}
 
 	/*
 	 * We know there is a valid mapping here, so simply
 	 * fix up the L1 if necessary.
 	 */
 	pl1pd = &pm->pm_l1->l1_kva[l1idx];
 	l1pd = l2b->l2b_phys | L1_C_DOM(pm->pm_domain) | L1_C_PROTO;
 	if (*pl1pd != l1pd) {
 		*pl1pd = l1pd;
 		PTE_SYNC(pl1pd);
 		rv = 1;
 	}
 
 #ifdef CPU_SA110
 	/*
 	 * There are bugs in the rev K SA110.  This is a check for one
 	 * of them.
 	 */
 	if (rv == 0 && curcpu()->ci_arm_cputype == CPU_ID_SA110 &&
 	    curcpu()->ci_arm_cpurev < 3) {
 		/* Always current pmap */
 		if (l2pte_valid(pte)) {
 			extern int kernel_debug;
 			if (kernel_debug & 1) {
 				struct proc *p = curlwp->l_proc;
 				printf("prefetch_abort: page is already "
 				    "mapped - pte=%p *pte=%08x\n", ptep, pte);
 				printf("prefetch_abort: pc=%08lx proc=%p "
 				    "process=%s\n", va, p, p->p_comm);
 				printf("prefetch_abort: far=%08x fs=%x\n",
 				    cpu_faultaddress(), cpu_faultstatus());
 			}
 #ifdef DDB
 			if (kernel_debug & 2)
 				Debugger();
 #endif
 			rv = 1;
 		}
 	}
 #endif /* CPU_SA110 */
 
 #ifdef DEBUG
 	/*
 	 * If 'rv == 0' at this point, it generally indicates that there is a
 	 * stale TLB entry for the faulting address. This happens when two or
 	 * more processes are sharing an L1. Since we don't flush the TLB on
 	 * a context switch between such processes, we can take domain faults
 	 * for mappings which exist at the same VA in both processes. EVEN IF
 	 * WE'VE RECENTLY FIXED UP THE CORRESPONDING L1 in pmap_enter(), for
 	 * example.
 	 *
 	 * This is extremely likely to happen if pmap_enter() updated the L1
 	 * entry for a recently entered mapping. In this case, the TLB is
 	 * flushed for the new mapping, but there may still be TLB entries for
 	 * other mappings belonging to other processes in the 1MB range
 	 * covered by the L1 entry.
 	 *
 	 * Since 'rv == 0', we know that the L1 already contains the correct
 	 * value, so the fault must be due to a stale TLB entry.
 	 *
 	 * Since we always need to flush the TLB anyway in the case where we
 	 * fixed up the L1, or frobbed the L2 PTE, we effectively deal with
 	 * stale TLB entries dynamically.
 	 *
 	 * However, the above condition can ONLY happen if the current L1 is
 	 * being shared. If it happens when the L1 is unshared, it indicates
 	 * that other parts of the pmap are not doing their job WRT managing
 	 * the TLB.
 	 */
 	if (rv == 0 && pm->pm_l1->l1_domain_use_count == 1) {
 		extern int last_fault_code;
 		printf("fixup: pm %p, va 0x%lx, ftype %d - nothing to do!\n",
 		    pm, va, ftype);
 		printf("fixup: l2 %p, l2b %p, ptep %p, pl1pd %p\n",
 		    l2, l2b, ptep, pl1pd);
 		printf("fixup: pte 0x%x, l1pd 0x%x, last code 0x%x\n",
 		    pte, l1pd, last_fault_code);
 #ifdef DDB
 		Debugger();
 #endif
 	}
 #endif
 
 	cpu_tlb_flushID_SE(va);
 	cpu_cpwait();
 
 	rv = 1;
 
 out:
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 	return (rv);
 }
 
 void
 pmap_postinit(void)
 {
 	struct l2_bucket *l2b;
 	struct l1_ttable *l1;
 	pd_entry_t *pl1pt;
 	pt_entry_t *ptep, pte;
 	vm_offset_t va, eva;
 	u_int loop, needed;
 	
 	needed = (maxproc / PMAP_DOMAINS) + ((maxproc % PMAP_DOMAINS) ? 1 : 0);
 	needed -= 1;
 	l1 = malloc(sizeof(*l1) * needed, M_VMPMAP, M_WAITOK);
 
 	for (loop = 0; loop < needed; loop++, l1++) {
 		/* Allocate a L1 page table */
 		va = (vm_offset_t)contigmalloc(L1_TABLE_SIZE, M_VMPMAP, 0, 0x0,
 		    0xffffffff, L1_TABLE_SIZE, 0);
 
 		if (va == 0)
 			panic("Cannot allocate L1 KVM");
 
 		eva = va + L1_TABLE_SIZE;
 		pl1pt = (pd_entry_t *)va;
 		
 		while (va < eva) {
 				l2b = pmap_get_l2_bucket(pmap_kernel(), va);
 				ptep = &l2b->l2b_kva[l2pte_index(va)];
 				pte = *ptep;
 				pte = (pte & ~L2_S_CACHE_MASK) | pte_l2_s_cache_mode_pt;
 				*ptep = pte;
 				PTE_SYNC(ptep);
 				cpu_tlb_flushD_SE(va);
 				
 				va += PAGE_SIZE;
 		}
 		pmap_init_l1(l1, pl1pt);
 	}
 
 
 #ifdef DEBUG
 	printf("pmap_postinit: Allocated %d static L1 descriptor tables\n",
 	    needed);
 #endif
 }
 
 /*
  * This is used to stuff certain critical values into the PCB where they
  * can be accessed quickly from cpu_switch() et al.
  */
 void
 pmap_set_pcb_pagedir(pmap_t pm, struct pcb *pcb)
 {
 	struct l2_bucket *l2b;
 
 	pcb->pcb_pagedir = pm->pm_l1->l1_physaddr;
 	pcb->pcb_dacr = (DOMAIN_CLIENT << (PMAP_DOMAIN_KERNEL * 2)) |
 	    (DOMAIN_CLIENT << (pm->pm_domain * 2));
 
 	if (vector_page < KERNBASE) {
 		pcb->pcb_pl1vec = &pm->pm_l1->l1_kva[L1_IDX(vector_page)];
 		l2b = pmap_get_l2_bucket(pm, vector_page);
 		pcb->pcb_l1vec = l2b->l2b_phys | L1_C_PROTO |
 	 	    L1_C_DOM(pm->pm_domain) | L1_C_DOM(PMAP_DOMAIN_KERNEL);
 	} else
 		pcb->pcb_pl1vec = NULL;
 }
 
 void
 pmap_activate(struct thread *td)
 {
 	pmap_t pm;
 	struct pcb *pcb;
 
 	pm = vmspace_pmap(td->td_proc->p_vmspace);
 	pcb = td->td_pcb;
 
 	critical_enter();
 	pmap_set_pcb_pagedir(pm, pcb);
 
 	if (td == curthread) {
 		u_int cur_dacr, cur_ttb;
 
 		__asm __volatile("mrc p15, 0, %0, c2, c0, 0" : "=r"(cur_ttb));
 		__asm __volatile("mrc p15, 0, %0, c3, c0, 0" : "=r"(cur_dacr));
 
 		cur_ttb &= ~(L1_TABLE_SIZE - 1);
 
 		if (cur_ttb == (u_int)pcb->pcb_pagedir &&
 		    cur_dacr == pcb->pcb_dacr) {
 			/*
 			 * No need to switch address spaces.
 			 */
 			critical_exit();
 			return;
 		}
 
 
 		/*
 		 * We MUST, I repeat, MUST fix up the L1 entry corresponding
 		 * to 'vector_page' in the incoming L1 table before switching
 		 * to it otherwise subsequent interrupts/exceptions (including
 		 * domain faults!) will jump into hyperspace.
 		 */
 		if (pcb->pcb_pl1vec) {
 
 			*pcb->pcb_pl1vec = pcb->pcb_l1vec;
 			/*
 			 * Don't need to PTE_SYNC() at this point since
 			 * cpu_setttb() is about to flush both the cache
 			 * and the TLB.
 			 */
 		}
 
 		cpu_domains(pcb->pcb_dacr);
 		cpu_setttb(pcb->pcb_pagedir);
 	}
 	critical_exit();
 }
 
 static int
 pmap_set_pt_cache_mode(pd_entry_t *kl1, vm_offset_t va)
 {
 	pd_entry_t *pdep, pde;
 	pt_entry_t *ptep, pte;
 	vm_offset_t pa;
 	int rv = 0;
 
 	/*
 	 * Make sure the descriptor itself has the correct cache mode
 	 */
 	pdep = &kl1[L1_IDX(va)];
 	pde = *pdep;
 
 	if (l1pte_section_p(pde)) {
 		if ((pde & L1_S_CACHE_MASK) != pte_l1_s_cache_mode_pt) {
 			*pdep = (pde & ~L1_S_CACHE_MASK) |
 			    pte_l1_s_cache_mode_pt;
 			PTE_SYNC(pdep);
 			cpu_dcache_wbinv_range((vm_offset_t)pdep,
 			    sizeof(*pdep));
 			cpu_l2cache_wbinv_range((vm_offset_t)pdep,
 			    sizeof(*pdep));
 			rv = 1;
 		}
 	} else {
 		pa = (vm_paddr_t)(pde & L1_C_ADDR_MASK);
 		ptep = (pt_entry_t *)kernel_pt_lookup(pa);
 		if (ptep == NULL)
 			panic("pmap_bootstrap: No L2 for L2 @ va %p\n", ptep);
 
 		ptep = &ptep[l2pte_index(va)];
 		pte = *ptep;
 		if ((pte & L2_S_CACHE_MASK) != pte_l2_s_cache_mode_pt) {
 			*ptep = (pte & ~L2_S_CACHE_MASK) |
 			    pte_l2_s_cache_mode_pt;
 			PTE_SYNC(ptep);
 			cpu_dcache_wbinv_range((vm_offset_t)ptep,
 			    sizeof(*ptep));
 			cpu_l2cache_wbinv_range((vm_offset_t)ptep,
 			    sizeof(*ptep));
 			rv = 1;
 		}
 	}
 
 	return (rv);
 }
 
 static void
 pmap_alloc_specials(vm_offset_t *availp, int pages, vm_offset_t *vap, 
     pt_entry_t **ptep)
 {
 	vm_offset_t va = *availp;
 	struct l2_bucket *l2b;
 
 	if (ptep) {
 		l2b = pmap_get_l2_bucket(pmap_kernel(), va);
 		if (l2b == NULL)
 			panic("pmap_alloc_specials: no l2b for 0x%x", va);
 
 		*ptep = &l2b->l2b_kva[l2pte_index(va)];
 	}
 
 	*vap = va;
 	*availp = va + (PAGE_SIZE * pages);
 }
 
 /*
  *	Bootstrap the system enough to run with virtual memory.
  *
  *	On the arm this is called after mapping has already been enabled
  *	and just syncs the pmap module with what has already been done.
  *	[We can't call it easily with mapping off since the kernel is not
  *	mapped with PA == VA, hence we would have to relocate every address
  *	from the linked base (virtual) address "KERNBASE" to the actual
  *	(physical) address starting relative to 0]
  */
 #define PMAP_STATIC_L2_SIZE 16
 #ifdef ARM_USE_SMALL_ALLOC
 extern struct mtx smallalloc_mtx;
 #endif
 
 void
 pmap_bootstrap(vm_offset_t firstaddr, vm_offset_t lastaddr, struct pv_addr *l1pt)
 {
 	static struct l1_ttable static_l1;
 	static struct l2_dtable static_l2[PMAP_STATIC_L2_SIZE];
 	struct l1_ttable *l1 = &static_l1;
 	struct l2_dtable *l2;
 	struct l2_bucket *l2b;
 	pd_entry_t pde;
 	pd_entry_t *kernel_l1pt = (pd_entry_t *)l1pt->pv_va;
 	pt_entry_t *ptep;
 	vm_paddr_t pa;
 	vm_offset_t va;
 	vm_size_t size;
 	int l1idx, l2idx, l2next = 0;
 
 	PDEBUG(1, printf("firstaddr = %08x, lastaddr = %08x\n",
 	    firstaddr, lastaddr));
 	
 	virtual_avail = firstaddr;
 	kernel_pmap->pm_l1 = l1;
 	kernel_l1pa = l1pt->pv_pa;
 	
 	/*
 	 * Scan the L1 translation table created by initarm() and create
 	 * the required metadata for all valid mappings found in it.
 	 */
 	for (l1idx = 0; l1idx < (L1_TABLE_SIZE / sizeof(pd_entry_t)); l1idx++) {
 		pde = kernel_l1pt[l1idx];
 
 		/*
 		 * We're only interested in Coarse mappings.
 		 * pmap_extract() can deal with section mappings without
 		 * recourse to checking L2 metadata.
 		 */
 		if ((pde & L1_TYPE_MASK) != L1_TYPE_C)
 			continue;
 
 		/*
 		 * Lookup the KVA of this L2 descriptor table
 		 */
 		pa = (vm_paddr_t)(pde & L1_C_ADDR_MASK);
 		ptep = (pt_entry_t *)kernel_pt_lookup(pa);
 		
 		if (ptep == NULL) {
 			panic("pmap_bootstrap: No L2 for va 0x%x, pa 0x%lx",
 			    (u_int)l1idx << L1_S_SHIFT, (long unsigned int)pa);
 		}
 
 		/*
 		 * Fetch the associated L2 metadata structure.
 		 * Allocate a new one if necessary.
 		 */
 		if ((l2 = kernel_pmap->pm_l2[L2_IDX(l1idx)]) == NULL) {
 			if (l2next == PMAP_STATIC_L2_SIZE)
 				panic("pmap_bootstrap: out of static L2s");
 			kernel_pmap->pm_l2[L2_IDX(l1idx)] = l2 = 
 			    &static_l2[l2next++];
 		}
 
 		/*
 		 * One more L1 slot tracked...
 		 */
 		l2->l2_occupancy++;
 
 		/*
 		 * Fill in the details of the L2 descriptor in the
 		 * appropriate bucket.
 		 */
 		l2b = &l2->l2_bucket[L2_BUCKET(l1idx)];
 		l2b->l2b_kva = ptep;
 		l2b->l2b_phys = pa;
 		l2b->l2b_l1idx = l1idx;
 
 		/*
 		 * Establish an initial occupancy count for this descriptor
 		 */
 		for (l2idx = 0;
 		    l2idx < (L2_TABLE_SIZE_REAL / sizeof(pt_entry_t));
 		    l2idx++) {
 			if ((ptep[l2idx] & L2_TYPE_MASK) != L2_TYPE_INV) {
 				l2b->l2b_occupancy++;
 			}
 		}
 
 		/*
 		 * Make sure the descriptor itself has the correct cache mode.
 		 * If not, fix it, but whine about the problem. Port-meisters
 		 * should consider this a clue to fix up their initarm()
 		 * function. :)
 		 */
 		if (pmap_set_pt_cache_mode(kernel_l1pt, (vm_offset_t)ptep)) {
 			printf("pmap_bootstrap: WARNING! wrong cache mode for "
 			    "L2 pte @ %p\n", ptep);
 		}
 	}
 
 	
 	/*
 	 * Ensure the primary (kernel) L1 has the correct cache mode for
 	 * a page table. Bitch if it is not correctly set.
 	 */
 	for (va = (vm_offset_t)kernel_l1pt;
 	    va < ((vm_offset_t)kernel_l1pt + L1_TABLE_SIZE); va += PAGE_SIZE) {
 		if (pmap_set_pt_cache_mode(kernel_l1pt, va))
 			printf("pmap_bootstrap: WARNING! wrong cache mode for "
 			    "primary L1 @ 0x%x\n", va);
 	}
 
 	cpu_dcache_wbinv_all();
 	cpu_l2cache_wbinv_all();
 	cpu_tlb_flushID();
 	cpu_cpwait();
 
 	PMAP_LOCK_INIT(kernel_pmap);
-	kernel_pmap->pm_active = -1;
+	CPU_FILL(&kernel_pmap->pm_active);
 	kernel_pmap->pm_domain = PMAP_DOMAIN_KERNEL;
 	TAILQ_INIT(&kernel_pmap->pm_pvlist);
 	
 	/*
 	 * Reserve some special page table entries/VA space for temporary
 	 * mapping of pages.
 	 */
 #define SYSMAP(c, p, v, n)						\
     v = (c)va; va += ((n)*PAGE_SIZE); p = pte; pte += (n);
     
 	pmap_alloc_specials(&virtual_avail, 1, &csrcp, &csrc_pte);
 	pmap_set_pt_cache_mode(kernel_l1pt, (vm_offset_t)csrc_pte);
 	pmap_alloc_specials(&virtual_avail, 1, &cdstp, &cdst_pte);
 	pmap_set_pt_cache_mode(kernel_l1pt, (vm_offset_t)cdst_pte);
 	size = ((lastaddr - pmap_curmaxkvaddr) + L1_S_OFFSET) / L1_S_SIZE;
 	pmap_alloc_specials(&virtual_avail,
 	    round_page(size * L2_TABLE_SIZE_REAL) / PAGE_SIZE,
 	    &pmap_kernel_l2ptp_kva, NULL);
 	
 	size = (size + (L2_BUCKET_SIZE - 1)) / L2_BUCKET_SIZE;
 	pmap_alloc_specials(&virtual_avail,
 	    round_page(size * sizeof(struct l2_dtable)) / PAGE_SIZE,
 	    &pmap_kernel_l2dtable_kva, NULL);
 
 	pmap_alloc_specials(&virtual_avail,
 	    1, (vm_offset_t*)&_tmppt, NULL);
 	pmap_alloc_specials(&virtual_avail,
 	    MAXDUMPPGS, (vm_offset_t *)&crashdumpmap, NULL);
 	SLIST_INIT(&l1_list);
 	TAILQ_INIT(&l1_lru_list);
 	mtx_init(&l1_lru_lock, "l1 list lock", NULL, MTX_DEF);
 	pmap_init_l1(l1, kernel_l1pt);
 	cpu_dcache_wbinv_all();
 	cpu_l2cache_wbinv_all();
 
 	virtual_avail = round_page(virtual_avail);
 	virtual_end = lastaddr;
 	kernel_vm_end = pmap_curmaxkvaddr;
 	arm_nocache_startaddr = lastaddr;
 	mtx_init(&cmtx, "TMP mappings mtx", NULL, MTX_DEF);
 
 #ifdef ARM_USE_SMALL_ALLOC
 	mtx_init(&smallalloc_mtx, "Small alloc page list", NULL, MTX_DEF);
 	arm_init_smallalloc();
 #endif
 	pmap_set_pcb_pagedir(kernel_pmap, thread0.td_pcb);
 }
 
 /***************************************************
  * Pmap allocation/deallocation routines.
  ***************************************************/
 
 /*
  * Release any resources held by the given physical map.
  * Called when a pmap initialized by pmap_pinit is being released.
  * Should only be called if the map contains no valid mappings.
  */
 void
 pmap_release(pmap_t pmap)
 {
 	struct pcb *pcb;
 	
 	pmap_idcache_wbinv_all(pmap);
 	cpu_l2cache_wbinv_all();
 	pmap_tlb_flushID(pmap);
 	cpu_cpwait();
 	if (vector_page < KERNBASE) {
 		struct pcb *curpcb = PCPU_GET(curpcb);
 		pcb = thread0.td_pcb;
 		if (pmap_is_current(pmap)) {
 			/*
  			 * Frob the L1 entry corresponding to the vector
 			 * page so that it contains the kernel pmap's domain
 			 * number. This will ensure pmap_remove() does not
 			 * pull the current vector page out from under us.
 			 */
 			critical_enter();
 			*pcb->pcb_pl1vec = pcb->pcb_l1vec;
 			cpu_domains(pcb->pcb_dacr);
 			cpu_setttb(pcb->pcb_pagedir);
 			critical_exit();
 		}
 		pmap_remove(pmap, vector_page, vector_page + PAGE_SIZE);
 		/*
 		 * Make sure cpu_switch(), et al, DTRT. This is safe to do
 		 * since this process has no remaining mappings of its own.
 		 */
 		curpcb->pcb_pl1vec = pcb->pcb_pl1vec;
 		curpcb->pcb_l1vec = pcb->pcb_l1vec;
 		curpcb->pcb_dacr = pcb->pcb_dacr;
 		curpcb->pcb_pagedir = pcb->pcb_pagedir;
 
 	}
 	pmap_free_l1(pmap);
 	PMAP_LOCK_DESTROY(pmap);
 	
 	dprintf("pmap_release()\n");
 }
 
 
 
 /*
  * Helper function for pmap_grow_l2_bucket()
  */
 static __inline int
 pmap_grow_map(vm_offset_t va, pt_entry_t cache_mode, vm_paddr_t *pap)
 {
 	struct l2_bucket *l2b;
 	pt_entry_t *ptep;
 	vm_paddr_t pa;
 	struct vm_page *pg;
 	
 	pg = vm_page_alloc(NULL, 0, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
 	if (pg == NULL)
 		return (1);
 	pa = VM_PAGE_TO_PHYS(pg);
 
 	if (pap)
 		*pap = pa;
 
 	l2b = pmap_get_l2_bucket(pmap_kernel(), va);
 
 	ptep = &l2b->l2b_kva[l2pte_index(va)];
 	*ptep = L2_S_PROTO | pa | cache_mode |
 	    L2_S_PROT(PTE_KERNEL, VM_PROT_READ | VM_PROT_WRITE);
 	PTE_SYNC(ptep);
 	return (0);
 }
 
 /*
  * This is the same as pmap_alloc_l2_bucket(), except that it is only
  * used by pmap_growkernel().
  */
 static __inline struct l2_bucket *
 pmap_grow_l2_bucket(pmap_t pm, vm_offset_t va)
 {
 	struct l2_dtable *l2;
 	struct l2_bucket *l2b;
 	struct l1_ttable *l1;
 	pd_entry_t *pl1pd;
 	u_short l1idx;
 	vm_offset_t nva;
 
 	l1idx = L1_IDX(va);
 
 	if ((l2 = pm->pm_l2[L2_IDX(l1idx)]) == NULL) {
 		/*
 		 * No mapping at this address, as there is
 		 * no entry in the L1 table.
 		 * Need to allocate a new l2_dtable.
 		 */
 		nva = pmap_kernel_l2dtable_kva;
 		if ((nva & PAGE_MASK) == 0) {
 			/*
 			 * Need to allocate a backing page
 			 */
 			if (pmap_grow_map(nva, pte_l2_s_cache_mode, NULL))
 				return (NULL);
 		}
 
 		l2 = (struct l2_dtable *)nva;
 		nva += sizeof(struct l2_dtable);
 
 		if ((nva & PAGE_MASK) < (pmap_kernel_l2dtable_kva & 
 		    PAGE_MASK)) {
 			/*
 			 * The new l2_dtable straddles a page boundary.
 			 * Map in another page to cover it.
 			 */
 			if (pmap_grow_map(nva, pte_l2_s_cache_mode, NULL))
 				return (NULL);
 		}
 
 		pmap_kernel_l2dtable_kva = nva;
 
 		/*
 		 * Link it into the parent pmap
 		 */
 		pm->pm_l2[L2_IDX(l1idx)] = l2;
 		memset(l2, 0, sizeof(*l2));
 	}
 
 	l2b = &l2->l2_bucket[L2_BUCKET(l1idx)];
 
 	/*
 	 * Fetch pointer to the L2 page table associated with the address.
 	 */
 	if (l2b->l2b_kva == NULL) {
 		pt_entry_t *ptep;
 
 		/*
 		 * No L2 page table has been allocated. Chances are, this
 		 * is because we just allocated the l2_dtable, above.
 		 */
 		nva = pmap_kernel_l2ptp_kva;
 		ptep = (pt_entry_t *)nva;
 		if ((nva & PAGE_MASK) == 0) {
 			/*
 			 * Need to allocate a backing page
 			 */
 			if (pmap_grow_map(nva, pte_l2_s_cache_mode_pt,
 			    &pmap_kernel_l2ptp_phys))
 				return (NULL);
 			PTE_SYNC_RANGE(ptep, PAGE_SIZE / sizeof(pt_entry_t));
 		}
 		memset(ptep, 0, L2_TABLE_SIZE_REAL);
 		l2->l2_occupancy++;
 		l2b->l2b_kva = ptep;
 		l2b->l2b_l1idx = l1idx;
 		l2b->l2b_phys = pmap_kernel_l2ptp_phys;
 
 		pmap_kernel_l2ptp_kva += L2_TABLE_SIZE_REAL;
 		pmap_kernel_l2ptp_phys += L2_TABLE_SIZE_REAL;
 	}
 
 	/* Distribute new L1 entry to all other L1s */
 	SLIST_FOREACH(l1, &l1_list, l1_link) {
 			pl1pd = &l1->l1_kva[L1_IDX(va)];
 			*pl1pd = l2b->l2b_phys | L1_C_DOM(PMAP_DOMAIN_KERNEL) |
 			    L1_C_PROTO;
 			PTE_SYNC(pl1pd);
 	}
 
 	return (l2b);
 }
 
 
 /*
  * grow the number of kernel page table entries, if needed
  */
 void
 pmap_growkernel(vm_offset_t addr)
 {
 	pmap_t kpm = pmap_kernel();
 
 	if (addr <= pmap_curmaxkvaddr)
 		return;		/* we are OK */
 
 	/*
 	 * whoops!   we need to add kernel PTPs
 	 */
 
 	/* Map 1MB at a time */
 	for (; pmap_curmaxkvaddr < addr; pmap_curmaxkvaddr += L1_S_SIZE)
 		pmap_grow_l2_bucket(kpm, pmap_curmaxkvaddr);
 
 	/*
 	 * flush out the cache, expensive but growkernel will happen so
 	 * rarely
 	 */
 	cpu_dcache_wbinv_all();
 	cpu_l2cache_wbinv_all();
 	cpu_tlb_flushD();
 	cpu_cpwait();
 	kernel_vm_end = pmap_curmaxkvaddr;
 }
 
 
 /*
  * Remove all pages from specified address space
  * this aids process exit speeds.  Also, this code
  * is special cased for current process only, but
  * can have the more generic (and slightly slower)
  * mode enabled.  This is much faster than pmap_remove
  * in the case of running down an entire address space.
  */
 void
 pmap_remove_pages(pmap_t pmap)
 {
 	struct pv_entry *pv, *npv;
 	struct l2_bucket *l2b = NULL;
 	vm_page_t m;
 	pt_entry_t *pt;
 	
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	cpu_idcache_wbinv_all();
 	cpu_l2cache_wbinv_all();
 	for (pv = TAILQ_FIRST(&pmap->pm_pvlist); pv; pv = npv) {
 		if (pv->pv_flags & PVF_WIRED || pv->pv_flags & PVF_UNMAN) {
 			/* Cannot remove wired or unmanaged pages now. */
 			npv = TAILQ_NEXT(pv, pv_plist);
 			continue;
 		}
 		pmap->pm_stats.resident_count--;
 		l2b = pmap_get_l2_bucket(pmap, pv->pv_va);
 		KASSERT(l2b != NULL, ("No L2 bucket in pmap_remove_pages"));
 		pt = &l2b->l2b_kva[l2pte_index(pv->pv_va)];
 		m = PHYS_TO_VM_PAGE(*pt & L2_ADDR_MASK);
 #ifdef ARM_USE_SMALL_ALLOC
 		KASSERT((vm_offset_t)m >= alloc_firstaddr, ("Trying to access non-existent page va %x pte %x", pv->pv_va, *pt));
 #else
 		KASSERT((vm_offset_t)m >= KERNBASE, ("Trying to access non-existent page va %x pte %x", pv->pv_va, *pt));
 #endif
 		*pt = 0;
 		PTE_SYNC(pt);
 		npv = TAILQ_NEXT(pv, pv_plist);
 		pmap_nuke_pv(m, pmap, pv);
 		if (TAILQ_EMPTY(&m->md.pv_list))
 			vm_page_flag_clear(m, PG_WRITEABLE);
 		pmap_free_pv_entry(pv);
 		pmap_free_l2_bucket(pmap, l2b, 1);
 	}
 	vm_page_unlock_queues();
 	cpu_tlb_flushID();
 	cpu_cpwait();
 	PMAP_UNLOCK(pmap);
 }
 
 
 /***************************************************
  * Low level mapping routines.....
  ***************************************************/
 
 #ifdef ARM_HAVE_SUPERSECTIONS
 /* Map a super section into the KVA. */
 
 void
 pmap_kenter_supersection(vm_offset_t va, uint64_t pa, int flags)
 {
 	pd_entry_t pd = L1_S_PROTO | L1_S_SUPERSEC | (pa & L1_SUP_FRAME) |
 	    (((pa >> 32) & 0xf) << 20) | L1_S_PROT(PTE_KERNEL,
 	    VM_PROT_READ|VM_PROT_WRITE) | L1_S_DOM(PMAP_DOMAIN_KERNEL);
 	struct l1_ttable *l1;	
 	vm_offset_t va0, va_end;
 
 	KASSERT(((va | pa) & L1_SUP_OFFSET) == 0,
 	    ("Not a valid super section mapping"));
 	if (flags & SECTION_CACHE)
 		pd |= pte_l1_s_cache_mode;
 	else if (flags & SECTION_PT)
 		pd |= pte_l1_s_cache_mode_pt;
 	va0 = va & L1_SUP_FRAME;
 	va_end = va + L1_SUP_SIZE;
 	SLIST_FOREACH(l1, &l1_list, l1_link) {
 		va = va0;
 		for (; va < va_end; va += L1_S_SIZE) {
 			l1->l1_kva[L1_IDX(va)] = pd;
 			PTE_SYNC(&l1->l1_kva[L1_IDX(va)]);
 		}
 	}
 }
 #endif
 
 /* Map a section into the KVA. */
 
 void
 pmap_kenter_section(vm_offset_t va, vm_offset_t pa, int flags)
 {
 	pd_entry_t pd = L1_S_PROTO | pa | L1_S_PROT(PTE_KERNEL,
 	    VM_PROT_READ|VM_PROT_WRITE) | L1_S_DOM(PMAP_DOMAIN_KERNEL);
 	struct l1_ttable *l1;
 
 	KASSERT(((va | pa) & L1_S_OFFSET) == 0,
 	    ("Not a valid section mapping"));
 	if (flags & SECTION_CACHE)
 		pd |= pte_l1_s_cache_mode;
 	else if (flags & SECTION_PT)
 		pd |= pte_l1_s_cache_mode_pt;
 	SLIST_FOREACH(l1, &l1_list, l1_link) {
 		l1->l1_kva[L1_IDX(va)] = pd;
 		PTE_SYNC(&l1->l1_kva[L1_IDX(va)]);
 	}
 }
 
 /*
  * Make a temporary mapping for a physical address.  This is only intended
  * to be used for panic dumps.
  */
 void *
 pmap_kenter_temp(vm_paddr_t pa, int i)
 {
 	vm_offset_t va;
 
 	va = (vm_offset_t)crashdumpmap + (i * PAGE_SIZE);
 	pmap_kenter(va, pa);
 	return ((void *)crashdumpmap);
 }
 
 /*
  * add a wired page to the kva
  * note that in order for the mapping to take effect -- you
  * should do a invltlb after doing the pmap_kenter...
  */
 static PMAP_INLINE void
 pmap_kenter_internal(vm_offset_t va, vm_offset_t pa, int flags)
 {
 	struct l2_bucket *l2b;
 	pt_entry_t *pte;
 	pt_entry_t opte;
 	struct pv_entry *pve;
 	vm_page_t m;
 
 	PDEBUG(1, printf("pmap_kenter: va = %08x, pa = %08x\n",
 	    (uint32_t) va, (uint32_t) pa));
 
 
 	l2b = pmap_get_l2_bucket(pmap_kernel(), va);
 	if (l2b == NULL)
 		l2b = pmap_grow_l2_bucket(pmap_kernel(), va);
 	KASSERT(l2b != NULL, ("No L2 Bucket"));
 	pte = &l2b->l2b_kva[l2pte_index(va)];
 	opte = *pte;
 	PDEBUG(1, printf("pmap_kenter: pte = %08x, opte = %08x, npte = %08x\n",
 	    (uint32_t) pte, opte, *pte));
 	if (l2pte_valid(opte)) {
 		pmap_kremove(va);
 	} else {
 		if (opte == 0)
 			l2b->l2b_occupancy++;
 	}
 	*pte = L2_S_PROTO | pa | L2_S_PROT(PTE_KERNEL, 
 	    VM_PROT_READ | VM_PROT_WRITE);
 	if (flags & KENTER_CACHE)
 		*pte |= pte_l2_s_cache_mode;
 	if (flags & KENTER_USER)
 		*pte |= L2_S_PROT_U;
 	PTE_SYNC(pte);
 
 		/* kernel direct mappings can be shared, so use a pv_entry
 		 * to ensure proper caching.
 		 *
 		 * The pvzone is used to delay the recording of kernel
 		 * mappings until the VM is running.
 		 * 
 		 * This expects the physical memory to have vm_page_array entry.
 		 */
 	if (pvzone != NULL && (m = vm_phys_paddr_to_vm_page(pa))) {
 		vm_page_lock_queues();
 		if (!TAILQ_EMPTY(&m->md.pv_list) || m->md.pv_kva) {
 			/* release vm_page lock for pv_entry UMA */
 			vm_page_unlock_queues();
 			if ((pve = pmap_get_pv_entry()) == NULL)
 				panic("pmap_kenter_internal: no pv entries");	
 			vm_page_lock_queues();
 			PMAP_LOCK(pmap_kernel());
 			pmap_enter_pv(m, pve, pmap_kernel(), va,
 			    PVF_WRITE | PVF_UNMAN);
 			pmap_fix_cache(m, pmap_kernel(), va);
 			PMAP_UNLOCK(pmap_kernel());
 		} else {
 			m->md.pv_kva = va;
 		}
 		vm_page_unlock_queues();
 	}
 }
 
 void
 pmap_kenter(vm_offset_t va, vm_paddr_t pa)
 {
 	pmap_kenter_internal(va, pa, KENTER_CACHE);
 }
 
 void
 pmap_kenter_nocache(vm_offset_t va, vm_paddr_t pa)
 {
 
 	pmap_kenter_internal(va, pa, 0);
 }
 
 void
 pmap_kenter_user(vm_offset_t va, vm_paddr_t pa)
 {
 
 	pmap_kenter_internal(va, pa, KENTER_CACHE|KENTER_USER);
 	/*
 	 * Call pmap_fault_fixup now, to make sure we'll have no exception
 	 * at the first use of the new address, or bad things will happen,
 	 * as we use one of these addresses in the exception handlers.
 	 */
 	pmap_fault_fixup(pmap_kernel(), va, VM_PROT_READ|VM_PROT_WRITE, 1);
 }
 
 /*
  * remove a page from the kernel pagetables
  */
 void
 pmap_kremove(vm_offset_t va)
 {
 	struct l2_bucket *l2b;
 	pt_entry_t *pte, opte;
 	struct pv_entry *pve;
 	vm_page_t m;
 	vm_offset_t pa;
 		
 	l2b = pmap_get_l2_bucket(pmap_kernel(), va);
 	if (!l2b)
 		return;
 	KASSERT(l2b != NULL, ("No L2 Bucket"));
 	pte = &l2b->l2b_kva[l2pte_index(va)];
 	opte = *pte;
 	if (l2pte_valid(opte)) {
 			/* pa = vtophs(va) taken from pmap_extract() */
 		switch (opte & L2_TYPE_MASK) {
 		case L2_TYPE_L:
 			pa = (opte & L2_L_FRAME) | (va & L2_L_OFFSET);
 			break;
 		default:
 			pa = (opte & L2_S_FRAME) | (va & L2_S_OFFSET);
 			break;
 		}
 			/* note: should never have to remove an allocation
 			 * before the pvzone is initialized.
 			 */
 		vm_page_lock_queues();
 		PMAP_LOCK(pmap_kernel());
 		if (pvzone != NULL && (m = vm_phys_paddr_to_vm_page(pa)) &&
 		    (pve = pmap_remove_pv(m, pmap_kernel(), va)))
 			pmap_free_pv_entry(pve); 
 		PMAP_UNLOCK(pmap_kernel());
 		vm_page_unlock_queues();
 		va = va & ~PAGE_MASK;
 		cpu_dcache_wbinv_range(va, PAGE_SIZE);
 		cpu_l2cache_wbinv_range(va, PAGE_SIZE);
 		cpu_tlb_flushD_SE(va);
 		cpu_cpwait();
 		*pte = 0;
 	}
 }
 
 
 /*
  *	Used to map a range of physical addresses into kernel
  *	virtual address space.
  *
  *	The value passed in '*virt' is a suggested virtual address for
  *	the mapping. Architectures which can support a direct-mapped
  *	physical to virtual region can return the appropriate address
  *	within that region, leaving '*virt' unchanged. Other
  *	architectures should map the pages starting at '*virt' and
  *	update '*virt' with the first usable address after the mapped
  *	region.
  */
 vm_offset_t
 pmap_map(vm_offset_t *virt, vm_offset_t start, vm_offset_t end, int prot)
 {
 #ifdef ARM_USE_SMALL_ALLOC
 	return (arm_ptovirt(start));
 #else
 	vm_offset_t sva = *virt;
 	vm_offset_t va = sva;
 
 	PDEBUG(1, printf("pmap_map: virt = %08x, start = %08x, end = %08x, "
 	    "prot = %d\n", (uint32_t) *virt, (uint32_t) start, (uint32_t) end,
 	    prot));
 	    
 	while (start < end) {
 		pmap_kenter(va, start);
 		va += PAGE_SIZE;
 		start += PAGE_SIZE;
 	}
 	*virt = va;
 	return (sva);
 #endif
 }
 
 static void
 pmap_wb_page(vm_page_t m)
 {
 	struct pv_entry *pv;
 
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list)
 	    pmap_dcache_wb_range(pv->pv_pmap, pv->pv_va, PAGE_SIZE, FALSE,
 		(pv->pv_flags & PVF_WRITE) == 0);
 }
 
 static void
 pmap_inv_page(vm_page_t m)
 {
 	struct pv_entry *pv;
 
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list)
 	    pmap_dcache_wb_range(pv->pv_pmap, pv->pv_va, PAGE_SIZE, TRUE, TRUE);
 }
 /*
  * Add a list of wired pages to the kva
  * this routine is only used for temporary
  * kernel mappings that do not need to have
  * page modification or references recorded.
  * Note that old mappings are simply written
  * over.  The page *must* be wired.
  */
 void
 pmap_qenter(vm_offset_t va, vm_page_t *m, int count)
 {
 	int i;
 
 	for (i = 0; i < count; i++) {
 		pmap_wb_page(m[i]);
 		pmap_kenter_internal(va, VM_PAGE_TO_PHYS(m[i]), 
 		    KENTER_CACHE);
 		va += PAGE_SIZE;
 	}
 }
 
 
 /*
  * this routine jerks page mappings from the
  * kernel -- it is meant only for temporary mappings.
  */
 void
 pmap_qremove(vm_offset_t va, int count)
 {
 	vm_paddr_t pa;
 	int i;
 
 	for (i = 0; i < count; i++) {
 		pa = vtophys(va);
 		if (pa) {
 			pmap_inv_page(PHYS_TO_VM_PAGE(pa));
 			pmap_kremove(va);
 		}
 		va += PAGE_SIZE;
 	}
 }
 
 
 /*
  * pmap_object_init_pt preloads the ptes for a given object
  * into the specified pmap.  This eliminates the blast of soft
  * faults on process startup and immediately after an mmap.
  */
 void
 pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_object_t object,
     vm_pindex_t pindex, vm_size_t size)
 {
 
 	VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
 	KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG,
 	    ("pmap_object_init_pt: non-device object"));
 }
 
 
 /*
  *	pmap_is_prefaultable:
  *
  *	Return whether or not the specified virtual address is elgible
  *	for prefault.
  */
 boolean_t
 pmap_is_prefaultable(pmap_t pmap, vm_offset_t addr)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 
 	if (!pmap_get_pde_pte(pmap, addr, &pde, &pte))
 		return (FALSE);
 	KASSERT(pte != NULL, ("Valid mapping but no pte ?"));
 	if (*pte == 0)
 		return (TRUE);
 	return (FALSE);
 }
 
 /*
  * Fetch pointers to the PDE/PTE for the given pmap/VA pair.
  * Returns TRUE if the mapping exists, else FALSE.
  *
  * NOTE: This function is only used by a couple of arm-specific modules.
  * It is not safe to take any pmap locks here, since we could be right
  * in the middle of debugging the pmap anyway...
  *
  * It is possible for this routine to return FALSE even though a valid
  * mapping does exist. This is because we don't lock, so the metadata
  * state may be inconsistent.
  *
  * NOTE: We can return a NULL *ptp in the case where the L1 pde is
  * a "section" mapping.
  */
 boolean_t
 pmap_get_pde_pte(pmap_t pm, vm_offset_t va, pd_entry_t **pdp, pt_entry_t **ptp)
 {
 	struct l2_dtable *l2;
 	pd_entry_t *pl1pd, l1pd;
 	pt_entry_t *ptep;
 	u_short l1idx;
 
 	if (pm->pm_l1 == NULL)
 		return (FALSE);
 
 	l1idx = L1_IDX(va);
 	*pdp = pl1pd = &pm->pm_l1->l1_kva[l1idx];
 	l1pd = *pl1pd;
 
 	if (l1pte_section_p(l1pd)) {
 		*ptp = NULL;
 		return (TRUE);
 	}
 
 	if (pm->pm_l2 == NULL)
 		return (FALSE);
 
 	l2 = pm->pm_l2[L2_IDX(l1idx)];
 
 	if (l2 == NULL ||
 	    (ptep = l2->l2_bucket[L2_BUCKET(l1idx)].l2b_kva) == NULL) {
 		return (FALSE);
 	}
 
 	*ptp = &ptep[l2pte_index(va)];
 	return (TRUE);
 }
 
 /*
  *      Routine:        pmap_remove_all
  *      Function:
  *              Removes this physical page from
  *              all physical maps in which it resides.
  *              Reflects back modify bits to the pager.
  *
  *      Notes:
  *              Original versions of this routine were very
  *              inefficient because they iteratively called
  *              pmap_remove (slow...)
  */
 void
 pmap_remove_all(vm_page_t m)
 {
 	pv_entry_t pv;
 	pt_entry_t *ptep;
 	struct l2_bucket *l2b;
 	boolean_t flush = FALSE;
 	pmap_t curpm;
 	int flags = 0;
 
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_remove_all: page %p is fictitious", m));
 	if (TAILQ_EMPTY(&m->md.pv_list))
 		return;
 	vm_page_lock_queues();
 	pmap_remove_write(m);
 	curpm = vmspace_pmap(curproc->p_vmspace);
 	while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		if (flush == FALSE && (pv->pv_pmap == curpm ||
 		    pv->pv_pmap == pmap_kernel()))
 			flush = TRUE;
 
 		PMAP_LOCK(pv->pv_pmap);
 		/*
 		 * Cached contents were written-back in pmap_remove_write(),
 		 * but we still have to invalidate the cache entry to make
 		 * sure stale data are not retrieved when another page will be
 		 * mapped under this virtual address.
 		 */
 		if (pmap_is_current(pv->pv_pmap)) {
 			cpu_dcache_inv_range(pv->pv_va, PAGE_SIZE);
 			if (pmap_has_valid_mapping(pv->pv_pmap, pv->pv_va))
 				cpu_l2cache_inv_range(pv->pv_va, PAGE_SIZE);
 		}
 
 		if (pv->pv_flags & PVF_UNMAN) {
 			/* remove the pv entry, but do not remove the mapping
 			 * and remember this is a kernel mapped page
 			 */
 			m->md.pv_kva = pv->pv_va;
 		} else {
 			/* remove the mapping and pv entry */
 			l2b = pmap_get_l2_bucket(pv->pv_pmap, pv->pv_va);
 			KASSERT(l2b != NULL, ("No l2 bucket"));
 			ptep = &l2b->l2b_kva[l2pte_index(pv->pv_va)];
 			*ptep = 0;
 			PTE_SYNC_CURRENT(pv->pv_pmap, ptep);
 			pmap_free_l2_bucket(pv->pv_pmap, l2b, 1);
 			pv->pv_pmap->pm_stats.resident_count--;
 			flags |= pv->pv_flags;
 		}
 		pmap_nuke_pv(m, pv->pv_pmap, pv);
 		PMAP_UNLOCK(pv->pv_pmap);
 		pmap_free_pv_entry(pv);
 	}
 
 	if (flush) {
 		if (PV_BEEN_EXECD(flags))
 			pmap_tlb_flushID(curpm);
 		else
 			pmap_tlb_flushD(curpm);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 
 /*
  *	Set the physical protection on the
  *	specified range of this map as requested.
  */
 void
 pmap_protect(pmap_t pm, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
 {
 	struct l2_bucket *l2b;
 	pt_entry_t *ptep, pte;
 	vm_offset_t next_bucket;
 	u_int flags;
 	int flush;
 
 	CTR4(KTR_PMAP, "pmap_protect: pmap %p sva 0x%08x eva 0x%08x prot %x",
 	    pm, sva, eva, prot);
 
 	if ((prot & VM_PROT_READ) == 0) {
 		pmap_remove(pm, sva, eva);
 		return;
 	}
 
 	if (prot & VM_PROT_WRITE) {
 		/*
 		 * If this is a read->write transition, just ignore it and let
 		 * vm_fault() take care of it later.
 		 */
 		return;
 	}
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 
 	/*
 	 * OK, at this point, we know we're doing write-protect operation.
 	 * If the pmap is active, write-back the range.
 	 */
 	pmap_dcache_wb_range(pm, sva, eva - sva, FALSE, FALSE);
 
 	flush = ((eva - sva) >= (PAGE_SIZE * 4)) ? 0 : -1;
 	flags = 0;
 
 	while (sva < eva) {
 		next_bucket = L2_NEXT_BUCKET(sva);
 		if (next_bucket > eva)
 			next_bucket = eva;
 
 		l2b = pmap_get_l2_bucket(pm, sva);
 		if (l2b == NULL) {
 			sva = next_bucket;
 			continue;
 		}
 
 		ptep = &l2b->l2b_kva[l2pte_index(sva)];
 
 		while (sva < next_bucket) {
 			if ((pte = *ptep) != 0 && (pte & L2_S_PROT_W) != 0) {
 				struct vm_page *pg;
 				u_int f;
 
 				pg = PHYS_TO_VM_PAGE(l2pte_pa(pte));
 				pte &= ~L2_S_PROT_W;
 				*ptep = pte;
 				PTE_SYNC(ptep);
 
 				if (pg != NULL) {
 					f = pmap_modify_pv(pg, pm, sva,
 					    PVF_WRITE, 0);
 					vm_page_dirty(pg);
 				} else
 					f = PVF_REF | PVF_EXEC;
 
 				if (flush >= 0) {
 					flush++;
 					flags |= f;
 				} else
 				if (PV_BEEN_EXECD(f))
 					pmap_tlb_flushID_SE(pm, sva);
 				else
 				if (PV_BEEN_REFD(f))
 					pmap_tlb_flushD_SE(pm, sva);
 			}
 
 			sva += PAGE_SIZE;
 			ptep++;
 		}
 	}
 
 
 	if (flush) {
 		if (PV_BEEN_EXECD(flags))
 			pmap_tlb_flushID(pm);
 		else
 		if (PV_BEEN_REFD(flags))
 			pmap_tlb_flushD(pm);
 	}
 	vm_page_unlock_queues();
 
  	PMAP_UNLOCK(pm);
 }
 
 
 /*
  *	Insert the given physical page (p) at
  *	the specified virtual address (v) in the
  *	target physical map with the protection requested.
  *
  *	If specified, the page will be wired down, meaning
  *	that the related pte can not be reclaimed.
  *
  *	NB:  This is the only routine which MAY NOT lazy-evaluate
  *	or lose information.  That is, this routine must actually
  *	insert this page into the given map NOW.
  */
 
 void
 pmap_enter(pmap_t pmap, vm_offset_t va, vm_prot_t access, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	pmap_enter_locked(pmap, va, m, prot, wired, M_WAITOK);
 	vm_page_unlock_queues();
  	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	The page queues and pmap must be locked.
  */
 static void
 pmap_enter_locked(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
     boolean_t wired, int flags)
 {
 	struct l2_bucket *l2b = NULL;
 	struct vm_page *opg;
 	struct pv_entry *pve = NULL;
 	pt_entry_t *ptep, npte, opte;
 	u_int nflags;
 	u_int oflags;
 	vm_paddr_t pa;
 
 	PMAP_ASSERT_LOCKED(pmap);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if (va == vector_page) {
 		pa = systempage.pv_pa;
 		m = NULL;
 	} else {
 		KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 		    (m->oflags & VPO_BUSY) != 0 || (flags & M_NOWAIT) != 0,
 		    ("pmap_enter_locked: page %p is not busy", m));
 		pa = VM_PAGE_TO_PHYS(m);
 	}
 	nflags = 0;
 	if (prot & VM_PROT_WRITE)
 		nflags |= PVF_WRITE;
 	if (prot & VM_PROT_EXECUTE)
 		nflags |= PVF_EXEC;
 	if (wired)
 		nflags |= PVF_WIRED;
 	PDEBUG(1, printf("pmap_enter: pmap = %08x, va = %08x, m = %08x, prot = %x, "
 	    "wired = %x\n", (uint32_t) pmap, va, (uint32_t) m, prot, wired));
 	    
 	if (pmap == pmap_kernel()) {
 		l2b = pmap_get_l2_bucket(pmap, va);
 		if (l2b == NULL)
 			l2b = pmap_grow_l2_bucket(pmap, va);
 	} else {
 do_l2b_alloc:
 		l2b = pmap_alloc_l2_bucket(pmap, va);
 		if (l2b == NULL) {
 			if (flags & M_WAITOK) {
 				PMAP_UNLOCK(pmap);
 				vm_page_unlock_queues();
 				VM_WAIT;
 				vm_page_lock_queues();
 				PMAP_LOCK(pmap);
 				goto do_l2b_alloc;
 			}
 			return;
 		}
 	}
 
 	ptep = &l2b->l2b_kva[l2pte_index(va)];
 		    
 	opte = *ptep;
 	npte = pa;
 	oflags = 0;
 	if (opte) {
 		/*
 		 * There is already a mapping at this address.
 		 * If the physical address is different, lookup the
 		 * vm_page.
 		 */
 		if (l2pte_pa(opte) != pa)
 			opg = PHYS_TO_VM_PAGE(l2pte_pa(opte));
 		else
 			opg = m;
 	} else
 		opg = NULL;
 
 	if ((prot & (VM_PROT_ALL)) ||
 	    (!m || m->md.pvh_attrs & PVF_REF)) {
 		/*
 		 * - The access type indicates that we don't need
 		 *   to do referenced emulation.
 		 * OR
 		 * - The physical page has already been referenced
 		 *   so no need to re-do referenced emulation here.
 		 */
 		npte |= L2_S_PROTO;
 		
 		nflags |= PVF_REF;
 		
 		if (m && ((prot & VM_PROT_WRITE) != 0 ||
 		    (m->md.pvh_attrs & PVF_MOD))) {
 			/*
 			 * This is a writable mapping, and the
 			 * page's mod state indicates it has
 			 * already been modified. Make it
 			 * writable from the outset.
 			 */
 			nflags |= PVF_MOD;
 			if (!(m->md.pvh_attrs & PVF_MOD))
 				vm_page_dirty(m);
 		}
 		if (m && opte)
 			vm_page_flag_set(m, PG_REFERENCED);
 	} else {
 		/*
 		 * Need to do page referenced emulation.
 		 */
 		npte |= L2_TYPE_INV;
 	}
 	
 	if (prot & VM_PROT_WRITE) {
 		npte |= L2_S_PROT_W;
 		if (m != NULL &&
 		    (m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0)
 			vm_page_flag_set(m, PG_WRITEABLE);
 	}
 	npte |= pte_l2_s_cache_mode;
 	if (m && m == opg) {
 		/*
 		 * We're changing the attrs of an existing mapping.
 		 */
 		oflags = pmap_modify_pv(m, pmap, va,
 		    PVF_WRITE | PVF_EXEC | PVF_WIRED |
 		    PVF_MOD | PVF_REF, nflags);
 		
 		/*
 		 * We may need to flush the cache if we're
 		 * doing rw-ro...
 		 */
 		if (pmap_is_current(pmap) &&
 		    (oflags & PVF_NC) == 0 &&
 		    (opte & L2_S_PROT_W) != 0 &&
 		    (prot & VM_PROT_WRITE) == 0 &&
 		    (opte & L2_TYPE_MASK) != L2_TYPE_INV) {
 			cpu_dcache_wb_range(va, PAGE_SIZE);
 			cpu_l2cache_wb_range(va, PAGE_SIZE);
 		}
 	} else {
 		/*
 		 * New mapping, or changing the backing page
 		 * of an existing mapping.
 		 */
 		if (opg) {
 			/*
 			 * Replacing an existing mapping with a new one.
 			 * It is part of our managed memory so we
 			 * must remove it from the PV list
 			 */
 			if ((pve = pmap_remove_pv(opg, pmap, va))) {
 
 			/* note for patch: the oflags/invalidation was moved
 			 * because PG_FICTITIOUS pages could free the pve
 			 */
 			    oflags = pve->pv_flags;
 			/*
 			 * If the old mapping was valid (ref/mod
 			 * emulation creates 'invalid' mappings
 			 * initially) then make sure to frob
 			 * the cache.
 			 */
 			    if ((oflags & PVF_NC) == 0 && l2pte_valid(opte)) {
 				if (PV_BEEN_EXECD(oflags)) {
 					pmap_idcache_wbinv_range(pmap, va,
 					    PAGE_SIZE);
 				} else
 					if (PV_BEEN_REFD(oflags)) {
 						pmap_dcache_wb_range(pmap, va,
 						    PAGE_SIZE, TRUE,
 						    (oflags & PVF_WRITE) == 0);
 					}
 			    }
 
 			/* free/allocate a pv_entry for UNMANAGED pages if
 			 * this physical page is not/is already mapped.
 			 */
 
 			    if (m && ((m->flags & PG_FICTITIOUS) ||
 				((m->flags & PG_UNMANAGED) &&
 				  !m->md.pv_kva &&
 				 TAILQ_EMPTY(&m->md.pv_list)))) {
 				pmap_free_pv_entry(pve);
 				pve = NULL;
 			    }
 			} else if (m && !(m->flags & PG_FICTITIOUS) &&
 				 (!(m->flags & PG_UNMANAGED) || m->md.pv_kva ||
 				  !TAILQ_EMPTY(&m->md.pv_list)))
 				pve = pmap_get_pv_entry();
 		} else if (m && !(m->flags & PG_FICTITIOUS) &&
 			   (!(m->flags & PG_UNMANAGED) || m->md.pv_kva ||
 			   !TAILQ_EMPTY(&m->md.pv_list)))
 			pve = pmap_get_pv_entry();
 
 		if (m && !(m->flags & PG_FICTITIOUS)) {
 			KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva,
 		    	("pmap_enter: managed mapping within the clean submap"));
 			if (m->flags & PG_UNMANAGED) {
 				if (!TAILQ_EMPTY(&m->md.pv_list) ||
 				     m->md.pv_kva) {
 					KASSERT(pve != NULL, ("No pv"));
 					nflags |= PVF_UNMAN;
 					pmap_enter_pv(m, pve, pmap, va, nflags);
 				} else
 					m->md.pv_kva = va;
 			} else {
 				KASSERT(pve != NULL, ("No pv"));
 				pmap_enter_pv(m, pve, pmap, va, nflags);
 			}
 		}
 	}
 	/*
 	 * Make sure userland mappings get the right permissions
 	 */
 	if (pmap != pmap_kernel() && va != vector_page) {
 		npte |= L2_S_PROT_U;
 	}
 
 	/*
 	 * Keep the stats up to date
 	 */
 	if (opte == 0) {
 		l2b->l2b_occupancy++;
 		pmap->pm_stats.resident_count++;
 	} 
 
 
 	/*
 	 * If this is just a wiring change, the two PTEs will be
 	 * identical, so there's no need to update the page table.
 	 */
 	if (npte != opte) {
 		boolean_t is_cached = pmap_is_current(pmap);
 
 		*ptep = npte;
 		if (is_cached) {
 			/*
 			 * We only need to frob the cache/tlb if this pmap
 			 * is current
 			 */
 			PTE_SYNC(ptep);
 			if (L1_IDX(va) != L1_IDX(vector_page) && 
 			    l2pte_valid(npte)) {
 				/*
 				 * This mapping is likely to be accessed as
 				 * soon as we return to userland. Fix up the
 				 * L1 entry to avoid taking another
 				 * page/domain fault.
 				 */
 				pd_entry_t *pl1pd, l1pd;
 
 				pl1pd = &pmap->pm_l1->l1_kva[L1_IDX(va)];
 				l1pd = l2b->l2b_phys | L1_C_DOM(pmap->pm_domain) |
 				    L1_C_PROTO;
 				if (*pl1pd != l1pd) {
 					*pl1pd = l1pd;
 					PTE_SYNC(pl1pd);
 				}
 			}
 		}
 
 		if (PV_BEEN_EXECD(oflags))
 			pmap_tlb_flushID_SE(pmap, va);
 		else if (PV_BEEN_REFD(oflags))
 			pmap_tlb_flushD_SE(pmap, va);
 
 
 		if (m)
 			pmap_fix_cache(m, pmap, va);
 	}
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 pmap_enter_object(pmap_t pmap, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	vm_page_t m;
 	vm_pindex_t diff, psize;
 
 	psize = atop(end - start);
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		pmap_enter_locked(pmap, start + ptoa(diff), m, prot &
 		    (VM_PROT_READ | VM_PROT_EXECUTE), FALSE, M_NOWAIT);
 		m = TAILQ_NEXT(m, listq);
 	}
 	vm_page_unlock_queues();
  	PMAP_UNLOCK(pmap);
 }
 
 /*
  * this code makes some *MAJOR* assumptions:
  * 1. Current pmap & pmap exists.
  * 2. Not wired.
  * 3. Read access.
  * 4. No page table pages.
  * but is *MUCH* faster than pmap_enter...
  */
 
 void
 pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 
 	vm_page_lock_queues();
  	PMAP_LOCK(pmap);
 	pmap_enter_locked(pmap, va, m, prot & (VM_PROT_READ | VM_PROT_EXECUTE),
 	    FALSE, M_NOWAIT);
 	vm_page_unlock_queues();
  	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	Routine:	pmap_change_wiring
  *	Function:	Change the wiring attribute for a map/virtual-address
  *			pair.
  *	In/out conditions:
  *			The mapping must already exist in the pmap.
  */
 void
 pmap_change_wiring(pmap_t pmap, vm_offset_t va, boolean_t wired)
 {
 	struct l2_bucket *l2b;
 	pt_entry_t *ptep, pte;
 	vm_page_t pg;
 
 	vm_page_lock_queues();
  	PMAP_LOCK(pmap);
 	l2b = pmap_get_l2_bucket(pmap, va);
 	KASSERT(l2b, ("No l2b bucket in pmap_change_wiring"));
 	ptep = &l2b->l2b_kva[l2pte_index(va)];
 	pte = *ptep;
 	pg = PHYS_TO_VM_PAGE(l2pte_pa(pte));
 	if (pg) 
 		pmap_modify_pv(pg, pmap, va, PVF_WIRED, wired ? PVF_WIRED : 0);
 	vm_page_unlock_queues();
  	PMAP_UNLOCK(pmap);
 }
 
 
 /*
  *	Copy the range specified by src_addr/len
  *	from the source map to the range dst_addr/len
  *	in the destination map.
  *
  *	This routine is only advisory and need not do anything.
  */
 void
 pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr,
     vm_size_t len, vm_offset_t src_addr)
 {
 }
 
 
 /*
  *	Routine:	pmap_extract
  *	Function:
  *		Extract the physical page address associated
  *		with the given map/virtual_address pair.
  */
 vm_paddr_t
 pmap_extract(pmap_t pm, vm_offset_t va)
 {
 	struct l2_dtable *l2;
 	pd_entry_t l1pd;
 	pt_entry_t *ptep, pte;
 	vm_paddr_t pa;
 	u_int l1idx;
 	l1idx = L1_IDX(va);
 
 	PMAP_LOCK(pm);
 	l1pd = pm->pm_l1->l1_kva[l1idx];
 	if (l1pte_section_p(l1pd)) {
 		/*
 		 * These should only happen for pmap_kernel()
 		 */
 		KASSERT(pm == pmap_kernel(), ("huh"));
 		/* XXX: what to do about the bits > 32 ? */
 		if (l1pd & L1_S_SUPERSEC) 
 			pa = (l1pd & L1_SUP_FRAME) | (va & L1_SUP_OFFSET);
 		else
 			pa = (l1pd & L1_S_FRAME) | (va & L1_S_OFFSET);
 	} else {
 		/*
 		 * Note that we can't rely on the validity of the L1
 		 * descriptor as an indication that a mapping exists.
 		 * We have to look it up in the L2 dtable.
 		 */
 		l2 = pm->pm_l2[L2_IDX(l1idx)];
 
 		if (l2 == NULL ||
 		    (ptep = l2->l2_bucket[L2_BUCKET(l1idx)].l2b_kva) == NULL) {
 			PMAP_UNLOCK(pm);
 			return (0);
 		}
 
 		ptep = &ptep[l2pte_index(va)];
 		pte = *ptep;
 
 		if (pte == 0) {
 			PMAP_UNLOCK(pm);
 			return (0);
 		}
 
 		switch (pte & L2_TYPE_MASK) {
 		case L2_TYPE_L:
 			pa = (pte & L2_L_FRAME) | (va & L2_L_OFFSET);
 			break;
 
 		default:
 			pa = (pte & L2_S_FRAME) | (va & L2_S_OFFSET);
 			break;
 		}
 	}
 
 	PMAP_UNLOCK(pm);
 	return (pa);
 }
 
 /*
  * Atomically extract and hold the physical page with the given
  * pmap and virtual address pair if that mapping permits the given
  * protection.
  *
  */
 vm_page_t
 pmap_extract_and_hold(pmap_t pmap, vm_offset_t va, vm_prot_t prot)
 {
 	struct l2_dtable *l2;
 	pd_entry_t l1pd;
 	pt_entry_t *ptep, pte;
 	vm_paddr_t pa, paddr;
 	vm_page_t m = NULL;
 	u_int l1idx;
 	l1idx = L1_IDX(va);
 	paddr = 0;
 
  	PMAP_LOCK(pmap);
 retry:
 	l1pd = pmap->pm_l1->l1_kva[l1idx];
 	if (l1pte_section_p(l1pd)) {
 		/*
 		 * These should only happen for pmap_kernel()
 		 */
 		KASSERT(pmap == pmap_kernel(), ("huh"));
 		/* XXX: what to do about the bits > 32 ? */
 		if (l1pd & L1_S_SUPERSEC) 
 			pa = (l1pd & L1_SUP_FRAME) | (va & L1_SUP_OFFSET);
 		else
 			pa = (l1pd & L1_S_FRAME) | (va & L1_S_OFFSET);
 		if (vm_page_pa_tryrelock(pmap, pa & PG_FRAME, &paddr))
 			goto retry;
 		if (l1pd & L1_S_PROT_W || (prot & VM_PROT_WRITE) == 0) {
 			m = PHYS_TO_VM_PAGE(pa);
 			vm_page_hold(m);
 		}
 			
 	} else {
 		/*
 		 * Note that we can't rely on the validity of the L1
 		 * descriptor as an indication that a mapping exists.
 		 * We have to look it up in the L2 dtable.
 		 */
 		l2 = pmap->pm_l2[L2_IDX(l1idx)];
 
 		if (l2 == NULL ||
 		    (ptep = l2->l2_bucket[L2_BUCKET(l1idx)].l2b_kva) == NULL) {
 		 	PMAP_UNLOCK(pmap);
 			return (NULL);
 		}
 
 		ptep = &ptep[l2pte_index(va)];
 		pte = *ptep;
 
 		if (pte == 0) {
 		 	PMAP_UNLOCK(pmap);
 			return (NULL);
 		}
 		if (pte & L2_S_PROT_W || (prot & VM_PROT_WRITE) == 0) {
 			switch (pte & L2_TYPE_MASK) {
 			case L2_TYPE_L:
 				pa = (pte & L2_L_FRAME) | (va & L2_L_OFFSET);
 				break;
 				
 			default:
 				pa = (pte & L2_S_FRAME) | (va & L2_S_OFFSET);
 				break;
 			}
 			if (vm_page_pa_tryrelock(pmap, pa & PG_FRAME, &paddr))
 				goto retry;		
 			m = PHYS_TO_VM_PAGE(pa);
 			vm_page_hold(m);
 		}
 	}
 
  	PMAP_UNLOCK(pmap);
 	PA_UNLOCK_COND(paddr);
 	return (m);
 }
 
 /*
  * Initialize a preallocated and zeroed pmap structure,
  * such as one in a vmspace structure.
  */
 
 int
 pmap_pinit(pmap_t pmap)
 {
 	PDEBUG(1, printf("pmap_pinit: pmap = %08x\n", (uint32_t) pmap));
 	
 	PMAP_LOCK_INIT(pmap);
 	pmap_alloc_l1(pmap);
 	bzero(pmap->pm_l2, sizeof(pmap->pm_l2));
 
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 		
 	TAILQ_INIT(&pmap->pm_pvlist);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 	pmap->pm_stats.resident_count = 1;
 	if (vector_page < KERNBASE) {
 		pmap_enter(pmap, vector_page,
 		    VM_PROT_READ, PHYS_TO_VM_PAGE(systempage.pv_pa),
 		    VM_PROT_READ, 1);
 	} 
 	return (1);
 }
 
 
 /***************************************************
  * page management routines.
  ***************************************************/
 
 
 static void
 pmap_free_pv_entry(pv_entry_t pv)
 {
 	pv_entry_count--;
 	uma_zfree(pvzone, pv);
 }
 
 
 /*
  * get a new pv_entry, allocating a block from the system
  * when needed.
  * the memory allocation is performed bypassing the malloc code
  * because of the possibility of allocations at interrupt time.
  */
 static pv_entry_t
 pmap_get_pv_entry(void)
 {
 	pv_entry_t ret_value;
 	
 	pv_entry_count++;
 	if (pv_entry_count > pv_entry_high_water)
 		pagedaemon_wakeup();
 	ret_value = uma_zalloc(pvzone, M_NOWAIT);
 	return ret_value;
 }
 
 /*
  *	Remove the given range of addresses from the specified map.
  *
  *	It is assumed that the start and end are properly
  *	rounded to the page size.
  */
 #define	PMAP_REMOVE_CLEAN_LIST_SIZE	3
 void
 pmap_remove(pmap_t pm, vm_offset_t sva, vm_offset_t eva)
 {
 	struct l2_bucket *l2b;
 	vm_offset_t next_bucket;
 	pt_entry_t *ptep;
 	u_int total;
 	u_int mappings, is_exec, is_refd;
 	int flushall = 0;
 
 
 	/*
 	 * we lock in the pmap => pv_head direction
 	 */
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	total = 0;
 	while (sva < eva) {
 		/*
 		 * Do one L2 bucket's worth at a time.
 		 */
 		next_bucket = L2_NEXT_BUCKET(sva);
 		if (next_bucket > eva)
 			next_bucket = eva;
 
 		l2b = pmap_get_l2_bucket(pm, sva);
 		if (l2b == NULL) {
 			sva = next_bucket;
 			continue;
 		}
 
 		ptep = &l2b->l2b_kva[l2pte_index(sva)];
 		mappings = 0;
 
 		while (sva < next_bucket) {
 			struct vm_page *pg;
 			pt_entry_t pte;
 			vm_paddr_t pa;
 
 			pte = *ptep;
 
 			if (pte == 0) {
 				/*
 				 * Nothing here, move along
 				 */
 				sva += PAGE_SIZE;
 				ptep++;
 				continue;
 			}
 
 			pm->pm_stats.resident_count--;
 			pa = l2pte_pa(pte);
 			is_exec = 0;
 			is_refd = 1;
 
 			/*
 			 * Update flags. In a number of circumstances,
 			 * we could cluster a lot of these and do a
 			 * number of sequential pages in one go.
 			 */
 			if ((pg = PHYS_TO_VM_PAGE(pa)) != NULL) {
 				struct pv_entry *pve;
 
 				pve = pmap_remove_pv(pg, pm, sva);
 				if (pve) {
 					is_exec = PV_BEEN_EXECD(pve->pv_flags);
 					is_refd = PV_BEEN_REFD(pve->pv_flags);
 					pmap_free_pv_entry(pve);
 				}
 			}
 
 			if (l2pte_valid(pte) && pmap_is_current(pm)) {
 				if (total < PMAP_REMOVE_CLEAN_LIST_SIZE) {
 					total++;
 			   		if (is_exec) {
         					cpu_idcache_wbinv_range(sva,
 						    PAGE_SIZE);
 						cpu_l2cache_wbinv_range(sva,
 						    PAGE_SIZE);
 						cpu_tlb_flushID_SE(sva);
 			   		} else if (is_refd) {
 						cpu_dcache_wbinv_range(sva,
 						    PAGE_SIZE);
 						cpu_l2cache_wbinv_range(sva,
 						    PAGE_SIZE);
 						cpu_tlb_flushD_SE(sva);
 					}
 				} else if (total == PMAP_REMOVE_CLEAN_LIST_SIZE) {
 					/* flushall will also only get set for
 					 * for a current pmap
 					 */
 					cpu_idcache_wbinv_all();
 					cpu_l2cache_wbinv_all();
 					flushall = 1;
 					total++;
 				}
 			}
 			*ptep = 0;
 			PTE_SYNC(ptep);
 
 			sva += PAGE_SIZE;
 			ptep++;
 			mappings++;
 		}
 
 		pmap_free_l2_bucket(pm, l2b, mappings);
 	}
 
 	vm_page_unlock_queues();
 	if (flushall)
 		cpu_tlb_flushID();
  	PMAP_UNLOCK(pm);
 }
 
 /*
  * pmap_zero_page()
  * 
  * Zero a given physical page by mapping it at a page hook point.
  * In doing the zero page op, the page we zero is mapped cachable, as with
  * StrongARM accesses to non-cached pages are non-burst making writing
  * _any_ bulk data very slow.
  */
 #if (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0 || defined(CPU_XSCALE_CORE3)
 void
 pmap_zero_page_generic(vm_paddr_t phys, int off, int size)
 {
 #ifdef ARM_USE_SMALL_ALLOC
 	char *dstpg;
 #endif
 
 #ifdef DEBUG
 	struct vm_page *pg = PHYS_TO_VM_PAGE(phys);
 
 	if (pg->md.pvh_list != NULL)
 		panic("pmap_zero_page: page has mappings");
 #endif
 
 	if (_arm_bzero && size >= _min_bzero_size &&
 	    _arm_bzero((void *)(phys + off), size, IS_PHYSICAL) == 0)
 		return;
 
 #ifdef ARM_USE_SMALL_ALLOC
 	dstpg = (char *)arm_ptovirt(phys);
 	if (off || size != PAGE_SIZE) {
 		bzero(dstpg + off, size);
 		cpu_dcache_wbinv_range((vm_offset_t)(dstpg + off), size);
 		cpu_l2cache_wbinv_range((vm_offset_t)(dstpg + off), size);
 	} else {
 		bzero_page((vm_offset_t)dstpg);
 		cpu_dcache_wbinv_range((vm_offset_t)dstpg, PAGE_SIZE);
 		cpu_l2cache_wbinv_range((vm_offset_t)dstpg, PAGE_SIZE);
 	}
 #else
 
 	mtx_lock(&cmtx);
 	/*
 	 * Hook in the page, zero it, invalidate the TLB as needed.
 	 *
 	 * Note the temporary zero-page mapping must be a non-cached page in
 	 * order to work without corruption when write-allocate is enabled.
 	 */
 	*cdst_pte = L2_S_PROTO | phys | L2_S_PROT(PTE_KERNEL, VM_PROT_WRITE);
 	cpu_tlb_flushD_SE(cdstp);
 	cpu_cpwait();
 	if (off || size != PAGE_SIZE)
 		bzero((void *)(cdstp + off), size);
 	else
 		bzero_page(cdstp);
 
 	mtx_unlock(&cmtx);
 #endif
 }
 #endif /* (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0 */
 
 #if ARM_MMU_XSCALE == 1
 void
 pmap_zero_page_xscale(vm_paddr_t phys, int off, int size)
 {
 #ifdef ARM_USE_SMALL_ALLOC
 	char *dstpg;
 #endif
 
 	if (_arm_bzero && size >= _min_bzero_size &&
 	    _arm_bzero((void *)(phys + off), size, IS_PHYSICAL) == 0)
 		return;
 #ifdef ARM_USE_SMALL_ALLOC
 	dstpg = (char *)arm_ptovirt(phys);
 	if (off || size != PAGE_SIZE) {
 		bzero(dstpg + off, size);
 		cpu_dcache_wbinv_range((vm_offset_t)(dstpg + off), size);
 	} else {
 		bzero_page((vm_offset_t)dstpg);
 		cpu_dcache_wbinv_range((vm_offset_t)dstpg, PAGE_SIZE);
 	}
 #else
 	mtx_lock(&cmtx);
 	/*
 	 * Hook in the page, zero it, and purge the cache for that
 	 * zeroed page. Invalidate the TLB as needed.
 	 */
 	*cdst_pte = L2_S_PROTO | phys |
 	    L2_S_PROT(PTE_KERNEL, VM_PROT_WRITE) |
 	    L2_C | L2_XSCALE_T_TEX(TEX_XSCALE_X);	/* mini-data */
 	PTE_SYNC(cdst_pte);
 	cpu_tlb_flushD_SE(cdstp);
 	cpu_cpwait();
 	if (off || size != PAGE_SIZE)
 		bzero((void *)(cdstp + off), size);
 	else
 		bzero_page(cdstp);
 	mtx_unlock(&cmtx);
 	xscale_cache_clean_minidata();
 #endif
 }
 
 /*
  * Change the PTEs for the specified kernel mappings such that they
  * will use the mini data cache instead of the main data cache.
  */
 void
 pmap_use_minicache(vm_offset_t va, vm_size_t size)
 {
 	struct l2_bucket *l2b;
 	pt_entry_t *ptep, *sptep, pte;
 	vm_offset_t next_bucket, eva;
 
 #if (ARM_NMMUS > 1) || defined(CPU_XSCALE_CORE3)
 	if (xscale_use_minidata == 0)
 		return;
 #endif
 
 	eva = va + size;
 
 	while (va < eva) {
 		next_bucket = L2_NEXT_BUCKET(va);
 		if (next_bucket > eva)
 			next_bucket = eva;
 
 		l2b = pmap_get_l2_bucket(pmap_kernel(), va);
 
 		sptep = ptep = &l2b->l2b_kva[l2pte_index(va)];
 
 		while (va < next_bucket) {
 			pte = *ptep;
 			if (!l2pte_minidata(pte)) {
 				cpu_dcache_wbinv_range(va, PAGE_SIZE);
 				cpu_tlb_flushD_SE(va);
 				*ptep = pte & ~L2_B;
 			}
 			ptep++;
 			va += PAGE_SIZE;
 		}
 		PTE_SYNC_RANGE(sptep, (u_int)(ptep - sptep));
 	}
 	cpu_cpwait();
 }
 #endif /* ARM_MMU_XSCALE == 1 */
 
 /*
  *	pmap_zero_page zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  */
 void
 pmap_zero_page(vm_page_t m)
 {
 	pmap_zero_page_func(VM_PAGE_TO_PHYS(m), 0, PAGE_SIZE);
 }
 
 
 /*
  *	pmap_zero_page_area zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  *
  *	off and size may not cover an area beyond a single hardware page.
  */
 void
 pmap_zero_page_area(vm_page_t m, int off, int size)
 {
 
 	pmap_zero_page_func(VM_PAGE_TO_PHYS(m), off, size);
 }
 
 
 /*
  *	pmap_zero_page_idle zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.  This
  *	is intended to be called from the vm_pagezero process only and
  *	outside of Giant.
  */
 void
 pmap_zero_page_idle(vm_page_t m)
 {
 
 	pmap_zero_page(m);
 }
 
 #if 0
 /*
  * pmap_clean_page()
  *
  * This is a local function used to work out the best strategy to clean
  * a single page referenced by its entry in the PV table. It should be used by
  * pmap_copy_page, pmap_zero page and maybe some others later on.
  *
  * Its policy is effectively:
  *  o If there are no mappings, we don't bother doing anything with the cache.
  *  o If there is one mapping, we clean just that page.
  *  o If there are multiple mappings, we clean the entire cache.
  *
  * So that some functions can be further optimised, it returns 0 if it didn't
  * clean the entire cache, or 1 if it did.
  *
  * XXX One bug in this routine is that if the pv_entry has a single page
  * mapped at 0x00000000 a whole cache clean will be performed rather than
  * just the 1 page. Since this should not occur in everyday use and if it does
  * it will just result in not the most efficient clean for the page.
  *
  * We don't yet use this function but may want to.
  */
 static int
 pmap_clean_page(struct pv_entry *pv, boolean_t is_src)
 {
 	pmap_t pm, pm_to_clean = NULL;
 	struct pv_entry *npv;
 	u_int cache_needs_cleaning = 0;
 	u_int flags = 0;
 	vm_offset_t page_to_clean = 0;
 
 	if (pv == NULL) {
 		/* nothing mapped in so nothing to flush */
 		return (0);
 	}
 
 	/*
 	 * Since we flush the cache each time we change to a different
 	 * user vmspace, we only need to flush the page if it is in the
 	 * current pmap.
 	 */
 	if (curthread)
 		pm = vmspace_pmap(curproc->p_vmspace);
 	else
 		pm = pmap_kernel();
 
 	for (npv = pv; npv; npv = TAILQ_NEXT(npv, pv_list)) {
 		if (npv->pv_pmap == pmap_kernel() || npv->pv_pmap == pm) {
 			flags |= npv->pv_flags;
 			/*
 			 * The page is mapped non-cacheable in 
 			 * this map.  No need to flush the cache.
 			 */
 			if (npv->pv_flags & PVF_NC) {
 #ifdef DIAGNOSTIC
 				if (cache_needs_cleaning)
 					panic("pmap_clean_page: "
 					    "cache inconsistency");
 #endif
 				break;
 			} else if (is_src && (npv->pv_flags & PVF_WRITE) == 0)
 				continue;
 			if (cache_needs_cleaning) {
 				page_to_clean = 0;
 				break;
 			} else {
 				page_to_clean = npv->pv_va;
 				pm_to_clean = npv->pv_pmap;
 			}
 			cache_needs_cleaning = 1;
 		}
 	}
 	if (page_to_clean) {
 		if (PV_BEEN_EXECD(flags))
 			pmap_idcache_wbinv_range(pm_to_clean, page_to_clean,
 			    PAGE_SIZE);
 		else
 			pmap_dcache_wb_range(pm_to_clean, page_to_clean,
 			    PAGE_SIZE, !is_src, (flags & PVF_WRITE) == 0);
 	} else if (cache_needs_cleaning) {
 		if (PV_BEEN_EXECD(flags))
 			pmap_idcache_wbinv_all(pm);
 		else
 			pmap_dcache_wbinv_all(pm);
 		return (1);
 	}
 	return (0);
 }
 #endif
 
 /*
  *	pmap_copy_page copies the specified (machine independent)
  *	page by mapping the page into virtual memory and using
  *	bcopy to copy the page, one machine dependent page at a
  *	time.
  */
 
 /*
  * pmap_copy_page()
  *
  * Copy one physical page into another, by mapping the pages into
  * hook points. The same comment regarding cachability as in
  * pmap_zero_page also applies here.
  */
 #if  (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0 || defined (CPU_XSCALE_CORE3)
 void
 pmap_copy_page_generic(vm_paddr_t src, vm_paddr_t dst)
 {
 #if 0
 	struct vm_page *src_pg = PHYS_TO_VM_PAGE(src);
 #endif
 #ifdef DEBUG
 	struct vm_page *dst_pg = PHYS_TO_VM_PAGE(dst);
 
 	if (dst_pg->md.pvh_list != NULL)
 		panic("pmap_copy_page: dst page has mappings");
 #endif
 
 
 	/*
 	 * Clean the source page.  Hold the source page's lock for
 	 * the duration of the copy so that no other mappings can
 	 * be created while we have a potentially aliased mapping.
 	 */
 #if 0
 	/*
 	 * XXX: Not needed while we call cpu_dcache_wbinv_all() in
 	 * pmap_copy_page().
 	 */
 	(void) pmap_clean_page(TAILQ_FIRST(&src_pg->md.pv_list), TRUE);
 #endif
 	/*
 	 * Map the pages into the page hook points, copy them, and purge
 	 * the cache for the appropriate page. Invalidate the TLB
 	 * as required.
 	 */
 	mtx_lock(&cmtx);
 	*csrc_pte = L2_S_PROTO | src |
 	    L2_S_PROT(PTE_KERNEL, VM_PROT_READ) | pte_l2_s_cache_mode;
 	PTE_SYNC(csrc_pte);
 	*cdst_pte = L2_S_PROTO | dst |
 	    L2_S_PROT(PTE_KERNEL, VM_PROT_WRITE) | pte_l2_s_cache_mode;
 	PTE_SYNC(cdst_pte);
 	cpu_tlb_flushD_SE(csrcp);
 	cpu_tlb_flushD_SE(cdstp);
 	cpu_cpwait();
 	bcopy_page(csrcp, cdstp);
 	mtx_unlock(&cmtx);
 	cpu_dcache_inv_range(csrcp, PAGE_SIZE);
 	cpu_dcache_wbinv_range(cdstp, PAGE_SIZE);
 	cpu_l2cache_inv_range(csrcp, PAGE_SIZE);
 	cpu_l2cache_wbinv_range(cdstp, PAGE_SIZE);
 }
 #endif /* (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0 */
 
 #if ARM_MMU_XSCALE == 1
 void
 pmap_copy_page_xscale(vm_paddr_t src, vm_paddr_t dst)
 {
 #if 0
 	/* XXX: Only needed for pmap_clean_page(), which is commented out. */
 	struct vm_page *src_pg = PHYS_TO_VM_PAGE(src);
 #endif
 #ifdef DEBUG
 	struct vm_page *dst_pg = PHYS_TO_VM_PAGE(dst);
 
 	if (dst_pg->md.pvh_list != NULL)
 		panic("pmap_copy_page: dst page has mappings");
 #endif
 
 
 	/*
 	 * Clean the source page.  Hold the source page's lock for
 	 * the duration of the copy so that no other mappings can
 	 * be created while we have a potentially aliased mapping.
 	 */
 #if 0
 	/*
 	 * XXX: Not needed while we call cpu_dcache_wbinv_all() in
 	 * pmap_copy_page().
 	 */
 	(void) pmap_clean_page(TAILQ_FIRST(&src_pg->md.pv_list), TRUE);
 #endif
 	/*
 	 * Map the pages into the page hook points, copy them, and purge
 	 * the cache for the appropriate page. Invalidate the TLB
 	 * as required.
 	 */
 	mtx_lock(&cmtx);
 	*csrc_pte = L2_S_PROTO | src |
 	    L2_S_PROT(PTE_KERNEL, VM_PROT_READ) |
 	    L2_C | L2_XSCALE_T_TEX(TEX_XSCALE_X);	/* mini-data */
 	PTE_SYNC(csrc_pte);
 	*cdst_pte = L2_S_PROTO | dst |
 	    L2_S_PROT(PTE_KERNEL, VM_PROT_WRITE) |
 	    L2_C | L2_XSCALE_T_TEX(TEX_XSCALE_X);	/* mini-data */
 	PTE_SYNC(cdst_pte);
 	cpu_tlb_flushD_SE(csrcp);
 	cpu_tlb_flushD_SE(cdstp);
 	cpu_cpwait();
 	bcopy_page(csrcp, cdstp);
 	mtx_unlock(&cmtx);
 	xscale_cache_clean_minidata();
 }
 #endif /* ARM_MMU_XSCALE == 1 */
 
 void
 pmap_copy_page(vm_page_t src, vm_page_t dst)
 {
 #ifdef ARM_USE_SMALL_ALLOC
 	vm_offset_t srcpg, dstpg;
 #endif
 
 	cpu_dcache_wbinv_all();
 	cpu_l2cache_wbinv_all();
 	if (_arm_memcpy && PAGE_SIZE >= _min_memcpy_size &&
 	    _arm_memcpy((void *)VM_PAGE_TO_PHYS(dst), 
 	    (void *)VM_PAGE_TO_PHYS(src), PAGE_SIZE, IS_PHYSICAL) == 0)
 		return;
 #ifdef ARM_USE_SMALL_ALLOC
 	srcpg = arm_ptovirt(VM_PAGE_TO_PHYS(src));
 	dstpg = arm_ptovirt(VM_PAGE_TO_PHYS(dst));
 	bcopy_page(srcpg, dstpg);
 	cpu_dcache_wbinv_range(dstpg, PAGE_SIZE);
 	cpu_l2cache_wbinv_range(dstpg, PAGE_SIZE);
 #else
 	pmap_copy_page_func(VM_PAGE_TO_PHYS(src), VM_PAGE_TO_PHYS(dst));
 #endif
 }
 
 
 
 
 /*
  * this routine returns true if a physical page resides
  * in the given pmap.
  */
 boolean_t
 pmap_page_exists_quick(pmap_t pmap, vm_page_t m)
 {
 	pv_entry_t pv;
 	int loops = 0;
 	boolean_t rv;
 	
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_page_exists_quick: page %p is not managed", m));
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 	    	if (pv->pv_pmap == pmap) {
 			rv = TRUE;
 			break;
 	    	}
 		loops++;
 		if (loops >= 16)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  *	pmap_page_wired_mappings:
  *
  *	Return the number of managed mappings to the given physical page
  *	that are wired.
  */
 int
 pmap_page_wired_mappings(vm_page_t m)
 {
 	pv_entry_t pv;
 	int count;
 
 	count = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list)
 		if ((pv->pv_flags & PVF_WIRED) != 0)
 			count++;
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  *	pmap_ts_referenced:
  *
  *	Return the count of reference bits for a page, clearing all of them.
  */
 int
 pmap_ts_referenced(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_ts_referenced: page %p is not managed", m));
 	return (pmap_clearbit(m, PVF_REF));
 }
 
 
 boolean_t
 pmap_is_modified(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_modified: page %p is not managed", m));
 	if (m->md.pvh_attrs & PVF_MOD)
 		return (TRUE);
 	
 	return(FALSE);
 }
 
 
 /*
  *	Clear the modify bits on the specified physical page.
  */
 void
 pmap_clear_modify(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("pmap_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no mappings can be modified.
 	 * If the object containing the page is locked and the page is not
 	 * VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	if (m->md.pvh_attrs & PVF_MOD)
 		pmap_clearbit(m, PVF_MOD);
 }
 
 
 /*
  *	pmap_is_referenced:
  *
  *	Return whether or not the specified physical page was referenced
  *	in any physical maps.
  */
 boolean_t
 pmap_is_referenced(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_referenced: page %p is not managed", m));
 	return ((m->md.pvh_attrs & PVF_REF) != 0);
 }
 
 /*
  *	pmap_clear_reference:
  *
  *	Clear the reference bit on the specified physical page.
  */
 void
 pmap_clear_reference(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_reference: page %p is not managed", m));
 	if (m->md.pvh_attrs & PVF_REF) 
 		pmap_clearbit(m, PVF_REF);
 }
 
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 void
 pmap_remove_write(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) != 0 ||
 	    (m->flags & PG_WRITEABLE) != 0)
 		pmap_clearbit(m, PVF_WRITE);
 }
 
 
 /*
  * perform the pmap work for mincore
  */
 int
 pmap_mincore(pmap_t pmap, vm_offset_t addr, vm_paddr_t *locked_pa)
 {
 	printf("pmap_mincore()\n");
 	
 	return (0);
 }
 
 
 void
 pmap_sync_icache(pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 }
 
 
 /*
  *	Increase the starting virtual address of the given mapping if a
  *	different alignment might result in more superpage mappings.
  */
 void
 pmap_align_superpage(vm_object_t object, vm_ooffset_t offset,
     vm_offset_t *addr, vm_size_t size)
 {
 }
 
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  */
 void *
 pmap_mapdev(vm_offset_t pa, vm_size_t size)
 {
 	vm_offset_t va, tmpva, offset;
 	
 	offset = pa & PAGE_MASK;
 	size = roundup(size, PAGE_SIZE);
 	
 	GIANT_REQUIRED;
 	
 	va = kmem_alloc_nofault(kernel_map, size);
 	if (!va)
 		panic("pmap_mapdev: Couldn't alloc kernel virtual memory");
 	for (tmpva = va; size > 0;) {
 		pmap_kenter_internal(tmpva, pa, 0);
 		size -= PAGE_SIZE;
 		tmpva += PAGE_SIZE;
 		pa += PAGE_SIZE;
 	}
 	
 	return ((void *)(va + offset));
 }
 
 #define BOOTSTRAP_DEBUG
 
 /*
  * pmap_map_section:
  *
  *	Create a single section mapping.
  */
 void
 pmap_map_section(vm_offset_t l1pt, vm_offset_t va, vm_offset_t pa,
     int prot, int cache)
 {
 	pd_entry_t *pde = (pd_entry_t *) l1pt;
 	pd_entry_t fl;
 
 	KASSERT(((va | pa) & L1_S_OFFSET) == 0, ("ouin2"));
 
 	switch (cache) {
 	case PTE_NOCACHE:
 	default:
 		fl = 0;
 		break;
 
 	case PTE_CACHE:
 		fl = pte_l1_s_cache_mode;
 		break;
 
 	case PTE_PAGETABLE:
 		fl = pte_l1_s_cache_mode_pt;
 		break;
 	}
 
 	pde[va >> L1_S_SHIFT] = L1_S_PROTO | pa |
 	    L1_S_PROT(PTE_KERNEL, prot) | fl | L1_S_DOM(PMAP_DOMAIN_KERNEL);
 	PTE_SYNC(&pde[va >> L1_S_SHIFT]);
 
 }
 
 /*
  * pmap_link_l2pt:
  *
  *	Link the L2 page table specified by l2pv.pv_pa into the L1
  *	page table at the slot for "va".
  */
 void
 pmap_link_l2pt(vm_offset_t l1pt, vm_offset_t va, struct pv_addr *l2pv)
 {
 	pd_entry_t *pde = (pd_entry_t *) l1pt, proto;
 	u_int slot = va >> L1_S_SHIFT;
 
 	proto = L1_S_DOM(PMAP_DOMAIN_KERNEL) | L1_C_PROTO;
 
 #ifdef VERBOSE_INIT_ARM     
 	printf("pmap_link_l2pt: pa=0x%x va=0x%x\n", l2pv->pv_pa, l2pv->pv_va);
 #endif
 
 	pde[slot + 0] = proto | (l2pv->pv_pa + 0x000);
 
 	PTE_SYNC(&pde[slot]);
 
 	SLIST_INSERT_HEAD(&kernel_pt_list, l2pv, pv_list);
 
 	
 }
 
 /*
  * pmap_map_entry
  *
  * 	Create a single page mapping.
  */
 void
 pmap_map_entry(vm_offset_t l1pt, vm_offset_t va, vm_offset_t pa, int prot,
     int cache)
 {
 	pd_entry_t *pde = (pd_entry_t *) l1pt;
 	pt_entry_t fl;
 	pt_entry_t *pte;
 
 	KASSERT(((va | pa) & PAGE_MASK) == 0, ("ouin"));
 
 	switch (cache) {
 	case PTE_NOCACHE:
 	default:
 		fl = 0;
 		break;
 
 	case PTE_CACHE:
 		fl = pte_l2_s_cache_mode;
 		break;
 
 	case PTE_PAGETABLE:
 		fl = pte_l2_s_cache_mode_pt;
 		break;
 	}
 
 	if ((pde[va >> L1_S_SHIFT] & L1_TYPE_MASK) != L1_TYPE_C)
 		panic("pmap_map_entry: no L2 table for VA 0x%08x", va);
 
 	pte = (pt_entry_t *) kernel_pt_lookup(pde[L1_IDX(va)] & L1_C_ADDR_MASK);
 
 	if (pte == NULL)
 		panic("pmap_map_entry: can't find L2 table for VA 0x%08x", va);
 
 	pte[l2pte_index(va)] =
 	    L2_S_PROTO | pa | L2_S_PROT(PTE_KERNEL, prot) | fl;
 	PTE_SYNC(&pte[l2pte_index(va)]);
 }
 
 /*
  * pmap_map_chunk:
  *
  *	Map a chunk of memory using the most efficient mappings
  *	possible (section. large page, small page) into the
  *	provided L1 and L2 tables at the specified virtual address.
  */
 vm_size_t
 pmap_map_chunk(vm_offset_t l1pt, vm_offset_t va, vm_offset_t pa,
     vm_size_t size, int prot, int cache)
 {
 	pd_entry_t *pde = (pd_entry_t *) l1pt;
 	pt_entry_t *pte, f1, f2s, f2l;
 	vm_size_t resid;  
 	int i;
 
 	resid = (size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
 
 	if (l1pt == 0)
 		panic("pmap_map_chunk: no L1 table provided");
 
 #ifdef VERBOSE_INIT_ARM     
 	printf("pmap_map_chunk: pa=0x%x va=0x%x size=0x%x resid=0x%x "
 	    "prot=0x%x cache=%d\n", pa, va, size, resid, prot, cache);
 #endif
 
 	switch (cache) {
 	case PTE_NOCACHE:
 	default:
 		f1 = 0;
 		f2l = 0;
 		f2s = 0;
 		break;
 
 	case PTE_CACHE:
 		f1 = pte_l1_s_cache_mode;
 		f2l = pte_l2_l_cache_mode;
 		f2s = pte_l2_s_cache_mode;
 		break;
 
 	case PTE_PAGETABLE:
 		f1 = pte_l1_s_cache_mode_pt;
 		f2l = pte_l2_l_cache_mode_pt;
 		f2s = pte_l2_s_cache_mode_pt;
 		break;
 	}
 
 	size = resid;
 
 	while (resid > 0) {
 		/* See if we can use a section mapping. */
 		if (L1_S_MAPPABLE_P(va, pa, resid)) {
 #ifdef VERBOSE_INIT_ARM
 			printf("S");
 #endif
 			pde[va >> L1_S_SHIFT] = L1_S_PROTO | pa |
 			    L1_S_PROT(PTE_KERNEL, prot) | f1 |
 			    L1_S_DOM(PMAP_DOMAIN_KERNEL);
 			PTE_SYNC(&pde[va >> L1_S_SHIFT]);
 			va += L1_S_SIZE;
 			pa += L1_S_SIZE;
 			resid -= L1_S_SIZE;
 			continue;
 		}
 
 		/*
 		 * Ok, we're going to use an L2 table.  Make sure
 		 * one is actually in the corresponding L1 slot
 		 * for the current VA.
 		 */
 		if ((pde[va >> L1_S_SHIFT] & L1_TYPE_MASK) != L1_TYPE_C)
 			panic("pmap_map_chunk: no L2 table for VA 0x%08x", va);
 
 		pte = (pt_entry_t *) kernel_pt_lookup(
 		    pde[L1_IDX(va)] & L1_C_ADDR_MASK);
 		if (pte == NULL)
 			panic("pmap_map_chunk: can't find L2 table for VA"
 			    "0x%08x", va);
 		/* See if we can use a L2 large page mapping. */
 		if (L2_L_MAPPABLE_P(va, pa, resid)) {
 #ifdef VERBOSE_INIT_ARM
 			printf("L");
 #endif
 			for (i = 0; i < 16; i++) {
 				pte[l2pte_index(va) + i] =
 				    L2_L_PROTO | pa |
 				    L2_L_PROT(PTE_KERNEL, prot) | f2l;
 				PTE_SYNC(&pte[l2pte_index(va) + i]);
 			}
 			va += L2_L_SIZE;
 			pa += L2_L_SIZE;
 			resid -= L2_L_SIZE;
 			continue;
 		}
 
 		/* Use a small page mapping. */
 #ifdef VERBOSE_INIT_ARM
 		printf("P");
 #endif
 		pte[l2pte_index(va)] =
 		    L2_S_PROTO | pa | L2_S_PROT(PTE_KERNEL, prot) | f2s;
 		PTE_SYNC(&pte[l2pte_index(va)]);
 		va += PAGE_SIZE;
 		pa += PAGE_SIZE;
 		resid -= PAGE_SIZE;
 	}
 #ifdef VERBOSE_INIT_ARM
 	printf("\n");
 #endif
 	return (size);
 
 }
 
 /********************** Static device map routines ***************************/
 
 static const struct pmap_devmap *pmap_devmap_table;
 
 /*
  * Register the devmap table.  This is provided in case early console
  * initialization needs to register mappings created by bootstrap code
  * before pmap_devmap_bootstrap() is called.
  */
 void
 pmap_devmap_register(const struct pmap_devmap *table)
 {
 
 	pmap_devmap_table = table;
 }
 
 /*
  * Map all of the static regions in the devmap table, and remember
  * the devmap table so other parts of the kernel can look up entries
  * later.
  */
 void
 pmap_devmap_bootstrap(vm_offset_t l1pt, const struct pmap_devmap *table)
 {
 	int i;
 
 	pmap_devmap_table = table;
 
 	for (i = 0; pmap_devmap_table[i].pd_size != 0; i++) {
 #ifdef VERBOSE_INIT_ARM
 		printf("devmap: %08x -> %08x @ %08x\n",
 		    pmap_devmap_table[i].pd_pa,
 		    pmap_devmap_table[i].pd_pa +
 			pmap_devmap_table[i].pd_size - 1,
 		    pmap_devmap_table[i].pd_va);
 #endif
 		pmap_map_chunk(l1pt, pmap_devmap_table[i].pd_va,
 		    pmap_devmap_table[i].pd_pa,
 		    pmap_devmap_table[i].pd_size,
 		    pmap_devmap_table[i].pd_prot,
 		    pmap_devmap_table[i].pd_cache);
 	}
 }
 
 const struct pmap_devmap *
 pmap_devmap_find_pa(vm_paddr_t pa, vm_size_t size)
 {
 	int i;
 
 	if (pmap_devmap_table == NULL)
 		return (NULL);
 
 	for (i = 0; pmap_devmap_table[i].pd_size != 0; i++) {
 		if (pa >= pmap_devmap_table[i].pd_pa &&
 		    pa + size <= pmap_devmap_table[i].pd_pa +
 				 pmap_devmap_table[i].pd_size)
 			return (&pmap_devmap_table[i]);
 	}
 
 	return (NULL);
 }
 
 const struct pmap_devmap *
 pmap_devmap_find_va(vm_offset_t va, vm_size_t size)
 {
 	int i;
 
 	if (pmap_devmap_table == NULL)
 		return (NULL);
 
 	for (i = 0; pmap_devmap_table[i].pd_size != 0; i++) {
 		if (va >= pmap_devmap_table[i].pd_va &&
 		    va + size <= pmap_devmap_table[i].pd_va +
 				 pmap_devmap_table[i].pd_size)
 			return (&pmap_devmap_table[i]);
 	}
 
 	return (NULL);
 }
 
Index: head/sys/arm/include/_types.h
===================================================================
--- head/sys/arm/include/_types.h	(revision 222812)
+++ head/sys/arm/include/_types.h	(revision 222813)
@@ -1,123 +1,122 @@
 /*-
  * Copyright (c) 2002 Mike Barcroft <mike@FreeBSD.org>
  * Copyright (c) 1990, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	From: @(#)ansi.h	8.2 (Berkeley) 1/4/94
  *	From: @(#)types.h	8.3 (Berkeley) 1/5/94
  * $FreeBSD$
  */
 
 #ifndef _MACHINE__TYPES_H_
 #define	_MACHINE__TYPES_H_
 
 #ifndef _SYS_CDEFS_H_
 #error this file needs sys/cdefs.h as a prerequisite
 #endif
 
 /*
  * Basic types upon which most other types are built.
  */
 typedef	__signed char		__int8_t;
 typedef	unsigned char		__uint8_t;
 typedef	short			__int16_t;
 typedef	unsigned short		__uint16_t;
 typedef	int			__int32_t;
 typedef	unsigned int		__uint32_t;
 #ifndef lint
 __extension__
 #endif
 /* LONGLONG */
 typedef	long long		__int64_t;
 #ifndef lint
 __extension__
 #endif
 /* LONGLONG */
 typedef	unsigned long long	__uint64_t;
 
 /*
  * Standard type definitions.
  */
 typedef	__uint32_t	__clock_t;		/* clock()... */
-typedef	unsigned int	__cpumask_t;
 typedef	__int32_t	__critical_t;
 typedef	double		__double_t;
 typedef	double		__float_t;
 typedef	__int32_t	__intfptr_t;
 typedef	__int64_t	__intmax_t;
 typedef	__int32_t	__intptr_t;
 typedef	__int32_t	__int_fast8_t;
 typedef	__int32_t	__int_fast16_t;
 typedef	__int32_t	__int_fast32_t;
 typedef	__int64_t	__int_fast64_t;
 typedef	__int8_t	__int_least8_t;
 typedef	__int16_t	__int_least16_t;
 typedef	__int32_t	__int_least32_t;
 typedef	__int64_t	__int_least64_t;
 typedef	__int32_t	__ptrdiff_t;		/* ptr1 - ptr2 */
 typedef	__int32_t	__register_t;
 typedef	__int32_t	__segsz_t;		/* segment size (in pages) */
 typedef	__uint32_t	__size_t;		/* sizeof() */
 typedef	__int32_t	__ssize_t;		/* byte count or error */
 typedef	__int64_t	__time_t;		/* time()... */
 typedef	__uint32_t	__uintfptr_t;
 typedef	__uint64_t	__uintmax_t;
 typedef	__uint32_t	__uintptr_t;
 typedef	__uint32_t	__uint_fast8_t;
 typedef	__uint32_t	__uint_fast16_t;
 typedef	__uint32_t	__uint_fast32_t;
 typedef	__uint64_t	__uint_fast64_t;
 typedef	__uint8_t	__uint_least8_t;
 typedef	__uint16_t	__uint_least16_t;
 typedef	__uint32_t	__uint_least32_t;
 typedef	__uint64_t	__uint_least64_t;
 typedef	__uint32_t	__u_register_t;
 typedef	__uint32_t	__vm_offset_t;
 typedef	__int64_t	__vm_ooffset_t;
 typedef	__uint32_t	__vm_paddr_t;
 typedef	__uint64_t	__vm_pindex_t;
 typedef	__uint32_t	__vm_size_t;
 
 /*
  * Unusual type definitions.
  */
 #ifdef __GNUCLIKE_BUILTIN_VARARGS
 typedef __builtin_va_list	__va_list;	/* internally known to gcc */
 #else
 typedef	char *			__va_list;
 #endif /* __GNUCLIKE_BUILTIN_VARARGS */
 #if defined(__GNUCLIKE_BUILTIN_VAALIST) && !defined(__GNUC_VA_LIST) \
     && !defined(__NO_GNUC_VA_LIST)
 #define __GNUC_VA_LIST
 typedef __va_list		__gnuc_va_list;	/* compatibility w/GNU headers*/
 #endif
 
 #endif /* !_MACHINE__TYPES_H_ */
Index: head/sys/arm/include/pmap.h
===================================================================
--- head/sys/arm/include/pmap.h	(revision 222812)
+++ head/sys/arm/include/pmap.h	(revision 222813)
@@ -1,552 +1,553 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *      This product includes software developed by the University of
  *      California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * Derived from hp300 version by Mike Hibler, this version by William
  * Jolitz uses a recursive map [a pde points to the page directory] to
  * map the page tables using the pagetables themselves. This is done to
  * reduce the impact on kernel virtual memory for lots of sparse address
  * space, and to reduce the cost of memory to each process.
  *
  *      from: hp300: @(#)pmap.h 7.2 (Berkeley) 12/16/90
  *      from: @(#)pmap.h        7.4 (Berkeley) 5/12/91
  * 	from: FreeBSD: src/sys/i386/include/pmap.h,v 1.70 2000/11/30
  *
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_PMAP_H_
 #define _MACHINE_PMAP_H_
 
 #include <machine/pte.h>
 #include <machine/cpuconf.h>
 /*
  * Pte related macros
  */
 #define PTE_NOCACHE	0
 #define PTE_CACHE	1
 #define PTE_PAGETABLE	2
  
 #ifndef LOCORE
 
 #include <sys/queue.h>
+#include <sys/_cpuset.h>
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
 
 #define PDESIZE		sizeof(pd_entry_t)	/* for assembly files */
 #define PTESIZE		sizeof(pt_entry_t)	/* for assembly files */
 
 #ifdef _KERNEL
 
 #define vtophys(va)	pmap_extract(pmap_kernel(), (vm_offset_t)(va))
 #define pmap_kextract(va)	pmap_extract(pmap_kernel(), (vm_offset_t)(va))
 
 #endif
 
 #define	pmap_page_get_memattr(m)	VM_MEMATTR_DEFAULT
 #define	pmap_page_is_mapped(m)	(!TAILQ_EMPTY(&(m)->md.pv_list))
 #define	pmap_page_set_memattr(m, ma)	(void)0
 
 /*
  * Pmap stuff
  */
 
 /*
  * This structure is used to hold a virtual<->physical address
  * association and is used mostly by bootstrap code
  */
 struct pv_addr {
 	SLIST_ENTRY(pv_addr) pv_list;
 	vm_offset_t	pv_va;
 	vm_paddr_t	pv_pa;
 };
 
 struct	pv_entry;
 
 struct	md_page {
 	int pvh_attrs;
 	vm_offset_t pv_kva;		/* first kernel VA mapping */
 	TAILQ_HEAD(,pv_entry)	pv_list;
 };
 
 #define	VM_MDPAGE_INIT(pg)						\
 do {									\
 	TAILQ_INIT(&pg->pv_list);					\
 	mtx_init(&(pg)->md_page.pvh_mtx, "MDPAGE Mutex", NULL, MTX_DEV);\
 	(pg)->mdpage.pvh_attrs = 0;					\
 } while (/*CONSTCOND*/0)
 
 struct l1_ttable;
 struct l2_dtable;
 
 
 /*
  * The number of L2 descriptor tables which can be tracked by an l2_dtable.
  * A bucket size of 16 provides for 16MB of contiguous virtual address
  * space per l2_dtable. Most processes will, therefore, require only two or
  * three of these to map their whole working set.
  */
 #define	L2_BUCKET_LOG2	4
 #define	L2_BUCKET_SIZE	(1 << L2_BUCKET_LOG2)
 /*
  * Given the above "L2-descriptors-per-l2_dtable" constant, the number
  * of l2_dtable structures required to track all possible page descriptors
  * mappable by an L1 translation table is given by the following constants:
  */
 #define	L2_LOG2		((32 - L1_S_SHIFT) - L2_BUCKET_LOG2)
 #define	L2_SIZE		(1 << L2_LOG2)
 
 struct	pmap {
 	struct mtx		pm_mtx;
 	u_int8_t		pm_domain;
 	struct l1_ttable	*pm_l1;
 	struct l2_dtable	*pm_l2[L2_SIZE];
 	pd_entry_t		*pm_pdir;	/* KVA of page directory */
-	cpumask_t		pm_active;	/* active on cpus */
+	cpuset_t		pm_active;	/* active on cpus */
 	struct pmap_statistics	pm_stats;	/* pmap statictics */
 	TAILQ_HEAD(,pv_entry)	pm_pvlist;	/* list of mappings in pmap */
 };
 
 typedef struct pmap *pmap_t;
 
 #ifdef _KERNEL
 extern struct pmap	kernel_pmap_store;
 #define kernel_pmap	(&kernel_pmap_store)
 #define pmap_kernel() kernel_pmap
 
 #define	PMAP_ASSERT_LOCKED(pmap) \
 				mtx_assert(&(pmap)->pm_mtx, MA_OWNED)
 #define	PMAP_LOCK(pmap)		mtx_lock(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_DESTROY(pmap)	mtx_destroy(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_INIT(pmap)	mtx_init(&(pmap)->pm_mtx, "pmap", \
 				    NULL, MTX_DEF | MTX_DUPOK)
 #define	PMAP_OWNED(pmap)	mtx_owned(&(pmap)->pm_mtx)
 #define	PMAP_MTX(pmap)		(&(pmap)->pm_mtx)
 #define	PMAP_TRYLOCK(pmap)	mtx_trylock(&(pmap)->pm_mtx)
 #define	PMAP_UNLOCK(pmap)	mtx_unlock(&(pmap)->pm_mtx)
 #endif
 
 
 /*
  * For each vm_page_t, there is a list of all currently valid virtual
  * mappings of that page.  An entry is a pv_entry_t, the list is pv_list.
  */
 typedef struct pv_entry {
 	pmap_t          pv_pmap;        /* pmap where mapping lies */
 	vm_offset_t     pv_va;          /* virtual address for mapping */
 	TAILQ_ENTRY(pv_entry)   pv_list;
 	TAILQ_ENTRY(pv_entry)	pv_plist;
 	int		pv_flags;	/* flags (wired, etc...) */
 } *pv_entry_t;
 
 #ifdef _KERNEL
 
 boolean_t pmap_get_pde_pte(pmap_t, vm_offset_t, pd_entry_t **, pt_entry_t **);
 
 /*
  * virtual address to page table entry and
  * to physical address. Likewise for alternate address space.
  * Note: these work recursively, thus vtopte of a pte will give
  * the corresponding pde that in turn maps it.
  */
 
 /*
  * The current top of kernel VM.
  */
 extern vm_offset_t pmap_curmaxkvaddr;
 
 struct pcb;
 
 void	pmap_set_pcb_pagedir(pmap_t, struct pcb *);
 /* Virtual address to page table entry */
 static __inline pt_entry_t *
 vtopte(vm_offset_t va)
 {
 	pd_entry_t *pdep;
 	pt_entry_t *ptep;
 
 	if (pmap_get_pde_pte(pmap_kernel(), va, &pdep, &ptep) == FALSE)
 		return (NULL);
 	return (ptep);
 }
 
 extern vm_paddr_t phys_avail[];
 extern vm_offset_t virtual_avail;
 extern vm_offset_t virtual_end;
 
 void	pmap_bootstrap(vm_offset_t, vm_offset_t, struct pv_addr *);
 void	pmap_kenter(vm_offset_t va, vm_paddr_t pa);
 void	pmap_kenter_nocache(vm_offset_t va, vm_paddr_t pa);
 void	*pmap_kenter_temp(vm_paddr_t pa, int i);
 void 	pmap_kenter_user(vm_offset_t va, vm_paddr_t pa);
 void	pmap_kremove(vm_offset_t);
 void	*pmap_mapdev(vm_offset_t, vm_size_t);
 void	pmap_unmapdev(vm_offset_t, vm_size_t);
 vm_page_t	pmap_use_pt(pmap_t, vm_offset_t);
 void	pmap_debug(int);
 void	pmap_map_section(vm_offset_t, vm_offset_t, vm_offset_t, int, int);
 void	pmap_link_l2pt(vm_offset_t, vm_offset_t, struct pv_addr *);
 vm_size_t	pmap_map_chunk(vm_offset_t, vm_offset_t, vm_offset_t, vm_size_t, int, int);
 void
 pmap_map_entry(vm_offset_t l1pt, vm_offset_t va, vm_offset_t pa, int prot,
     int cache);
 int pmap_fault_fixup(pmap_t, vm_offset_t, vm_prot_t, int);
 
 /*
  * Definitions for MMU domains
  */
 #define	PMAP_DOMAINS		15	/* 15 'user' domains (1-15) */
 #define	PMAP_DOMAIN_KERNEL	0	/* The kernel uses domain #0 */
 
 /*
  * The new pmap ensures that page-tables are always mapping Write-Thru.
  * Thus, on some platforms we can run fast and loose and avoid syncing PTEs
  * on every change.
  *
  * Unfortunately, not all CPUs have a write-through cache mode.  So we
  * define PMAP_NEEDS_PTE_SYNC for C code to conditionally do PTE syncs,
  * and if there is the chance for PTE syncs to be needed, we define
  * PMAP_INCLUDE_PTE_SYNC so e.g. assembly code can include (and run)
  * the code.
  */
 extern int pmap_needs_pte_sync;
 
 /*
  * These macros define the various bit masks in the PTE.
  *
  * We use these macros since we use different bits on different processor
  * models.
  */
 #define	L1_S_PROT_U		(L1_S_AP(AP_U))
 #define	L1_S_PROT_W		(L1_S_AP(AP_W))
 #define	L1_S_PROT_MASK		(L1_S_PROT_U|L1_S_PROT_W)
 
 #define	L1_S_CACHE_MASK_generic	(L1_S_B|L1_S_C)
 #define	L1_S_CACHE_MASK_xscale	(L1_S_B|L1_S_C|L1_S_XSCALE_TEX(TEX_XSCALE_X)|\
     				L1_S_XSCALE_TEX(TEX_XSCALE_T))
 
 #define	L2_L_PROT_U		(L2_AP(AP_U))
 #define	L2_L_PROT_W		(L2_AP(AP_W))
 #define	L2_L_PROT_MASK		(L2_L_PROT_U|L2_L_PROT_W)
 
 #define	L2_L_CACHE_MASK_generic	(L2_B|L2_C)
 #define	L2_L_CACHE_MASK_xscale	(L2_B|L2_C|L2_XSCALE_L_TEX(TEX_XSCALE_X) | \
     				L2_XSCALE_L_TEX(TEX_XSCALE_T))
 
 #define	L2_S_PROT_U_generic	(L2_AP(AP_U))
 #define	L2_S_PROT_W_generic	(L2_AP(AP_W))
 #define	L2_S_PROT_MASK_generic	(L2_S_PROT_U|L2_S_PROT_W)
 
 #define	L2_S_PROT_U_xscale	(L2_AP0(AP_U))
 #define	L2_S_PROT_W_xscale	(L2_AP0(AP_W))
 #define	L2_S_PROT_MASK_xscale	(L2_S_PROT_U|L2_S_PROT_W)
 
 #define	L2_S_CACHE_MASK_generic	(L2_B|L2_C)
 #define	L2_S_CACHE_MASK_xscale	(L2_B|L2_C|L2_XSCALE_T_TEX(TEX_XSCALE_X)| \
     				 L2_XSCALE_T_TEX(TEX_XSCALE_X))
 
 #define	L1_S_PROTO_generic	(L1_TYPE_S | L1_S_IMP)
 #define	L1_S_PROTO_xscale	(L1_TYPE_S)
 
 #define	L1_C_PROTO_generic	(L1_TYPE_C | L1_C_IMP2)
 #define	L1_C_PROTO_xscale	(L1_TYPE_C)
 
 #define	L2_L_PROTO		(L2_TYPE_L)
 
 #define	L2_S_PROTO_generic	(L2_TYPE_S)
 #define	L2_S_PROTO_xscale	(L2_TYPE_XSCALE_XS)
 
 /*
  * User-visible names for the ones that vary with MMU class.
  */
 
 #if ARM_NMMUS > 1
 /* More than one MMU class configured; use variables. */
 #define	L2_S_PROT_U		pte_l2_s_prot_u
 #define	L2_S_PROT_W		pte_l2_s_prot_w
 #define	L2_S_PROT_MASK		pte_l2_s_prot_mask
 
 #define	L1_S_CACHE_MASK		pte_l1_s_cache_mask
 #define	L2_L_CACHE_MASK		pte_l2_l_cache_mask
 #define	L2_S_CACHE_MASK		pte_l2_s_cache_mask
 
 #define	L1_S_PROTO		pte_l1_s_proto
 #define	L1_C_PROTO		pte_l1_c_proto
 #define	L2_S_PROTO		pte_l2_s_proto
 
 #elif (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0
 #define	L2_S_PROT_U		L2_S_PROT_U_generic
 #define	L2_S_PROT_W		L2_S_PROT_W_generic
 #define	L2_S_PROT_MASK		L2_S_PROT_MASK_generic
 
 #define	L1_S_CACHE_MASK		L1_S_CACHE_MASK_generic
 #define	L2_L_CACHE_MASK		L2_L_CACHE_MASK_generic
 #define	L2_S_CACHE_MASK		L2_S_CACHE_MASK_generic
 
 #define	L1_S_PROTO		L1_S_PROTO_generic
 #define	L1_C_PROTO		L1_C_PROTO_generic
 #define	L2_S_PROTO		L2_S_PROTO_generic
 
 #elif ARM_MMU_XSCALE == 1
 #define	L2_S_PROT_U		L2_S_PROT_U_xscale
 #define	L2_S_PROT_W		L2_S_PROT_W_xscale
 #define	L2_S_PROT_MASK		L2_S_PROT_MASK_xscale
 
 #define	L1_S_CACHE_MASK		L1_S_CACHE_MASK_xscale
 #define	L2_L_CACHE_MASK		L2_L_CACHE_MASK_xscale
 #define	L2_S_CACHE_MASK		L2_S_CACHE_MASK_xscale
 
 #define	L1_S_PROTO		L1_S_PROTO_xscale
 #define	L1_C_PROTO		L1_C_PROTO_xscale
 #define	L2_S_PROTO		L2_S_PROTO_xscale
 
 #endif /* ARM_NMMUS > 1 */
 
 #if (ARM_MMU_SA1 == 1) && (ARM_NMMUS == 1)
 #define	PMAP_NEEDS_PTE_SYNC	1
 #define	PMAP_INCLUDE_PTE_SYNC
 #elif defined(CPU_XSCALE_81342)
 #define PMAP_NEEDS_PTE_SYNC	1
 #define PMAP_INCLUDE_PTE_SYNC
 #elif (ARM_MMU_SA1 == 0)
 #define	PMAP_NEEDS_PTE_SYNC	0
 #endif
 
 /*
  * These macros return various bits based on kernel/user and protection.
  * Note that the compiler will usually fold these at compile time.
  */
 #define	L1_S_PROT(ku, pr)	((((ku) == PTE_USER) ? L1_S_PROT_U : 0) | \
 				 (((pr) & VM_PROT_WRITE) ? L1_S_PROT_W : 0))
 
 #define	L2_L_PROT(ku, pr)	((((ku) == PTE_USER) ? L2_L_PROT_U : 0) | \
 				 (((pr) & VM_PROT_WRITE) ? L2_L_PROT_W : 0))
 
 #define	L2_S_PROT(ku, pr)	((((ku) == PTE_USER) ? L2_S_PROT_U : 0) | \
 				 (((pr) & VM_PROT_WRITE) ? L2_S_PROT_W : 0))
 
 /*
  * Macros to test if a mapping is mappable with an L1 Section mapping
  * or an L2 Large Page mapping.
  */
 #define	L1_S_MAPPABLE_P(va, pa, size)					\
 	((((va) | (pa)) & L1_S_OFFSET) == 0 && (size) >= L1_S_SIZE)
 
 #define	L2_L_MAPPABLE_P(va, pa, size)					\
 	((((va) | (pa)) & L2_L_OFFSET) == 0 && (size) >= L2_L_SIZE)
 
 /*
  * Provide a fallback in case we were not able to determine it at
  * compile-time.
  */
 #ifndef PMAP_NEEDS_PTE_SYNC
 #define	PMAP_NEEDS_PTE_SYNC	pmap_needs_pte_sync
 #define	PMAP_INCLUDE_PTE_SYNC
 #endif
 
 #define	PTE_SYNC(pte)							\
 do {									\
 	if (PMAP_NEEDS_PTE_SYNC) {					\
 		cpu_dcache_wb_range((vm_offset_t)(pte), sizeof(pt_entry_t));\
 		cpu_l2cache_wb_range((vm_offset_t)(pte), sizeof(pt_entry_t));\
 	}\
 } while (/*CONSTCOND*/0)
 
 #define	PTE_SYNC_RANGE(pte, cnt)					\
 do {									\
 	if (PMAP_NEEDS_PTE_SYNC) {					\
 		cpu_dcache_wb_range((vm_offset_t)(pte),			\
 		    (cnt) << 2); /* * sizeof(pt_entry_t) */		\
 		cpu_l2cache_wb_range((vm_offset_t)(pte), 		\
 		    (cnt) << 2); /* * sizeof(pt_entry_t) */		\
 	}								\
 } while (/*CONSTCOND*/0)
 
 extern pt_entry_t		pte_l1_s_cache_mode;
 extern pt_entry_t		pte_l1_s_cache_mask;
 
 extern pt_entry_t		pte_l2_l_cache_mode;
 extern pt_entry_t		pte_l2_l_cache_mask;
 
 extern pt_entry_t		pte_l2_s_cache_mode;
 extern pt_entry_t		pte_l2_s_cache_mask;
 
 extern pt_entry_t		pte_l1_s_cache_mode_pt;
 extern pt_entry_t		pte_l2_l_cache_mode_pt;
 extern pt_entry_t		pte_l2_s_cache_mode_pt;
 
 extern pt_entry_t		pte_l2_s_prot_u;
 extern pt_entry_t		pte_l2_s_prot_w;
 extern pt_entry_t		pte_l2_s_prot_mask;
  
 extern pt_entry_t		pte_l1_s_proto;
 extern pt_entry_t		pte_l1_c_proto;
 extern pt_entry_t		pte_l2_s_proto;
 
 extern void (*pmap_copy_page_func)(vm_paddr_t, vm_paddr_t);
 extern void (*pmap_zero_page_func)(vm_paddr_t, int, int);
 
 #if (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0 || defined(CPU_XSCALE_81342)
 void	pmap_copy_page_generic(vm_paddr_t, vm_paddr_t);
 void	pmap_zero_page_generic(vm_paddr_t, int, int);
 
 void	pmap_pte_init_generic(void);
 #if defined(CPU_ARM8)
 void	pmap_pte_init_arm8(void);
 #endif
 #if defined(CPU_ARM9)
 void	pmap_pte_init_arm9(void);
 #endif /* CPU_ARM9 */
 #if defined(CPU_ARM10)
 void	pmap_pte_init_arm10(void);
 #endif /* CPU_ARM10 */
 #endif /* (ARM_MMU_GENERIC + ARM_MMU_SA1) != 0 */
 
 #if /* ARM_MMU_SA1 == */1
 void	pmap_pte_init_sa1(void);
 #endif /* ARM_MMU_SA1 == 1 */
 
 #if ARM_MMU_XSCALE == 1
 void	pmap_copy_page_xscale(vm_paddr_t, vm_paddr_t);
 void	pmap_zero_page_xscale(vm_paddr_t, int, int);
 
 void	pmap_pte_init_xscale(void);
 
 void	xscale_setup_minidata(vm_offset_t, vm_offset_t, vm_offset_t);
 
 void	pmap_use_minicache(vm_offset_t, vm_size_t);
 #endif /* ARM_MMU_XSCALE == 1 */
 #if defined(CPU_XSCALE_81342)
 #define ARM_HAVE_SUPERSECTIONS
 #endif
 
 #define PTE_KERNEL	0
 #define PTE_USER	1
 #define	l1pte_valid(pde)	((pde) != 0)
 #define	l1pte_section_p(pde)	(((pde) & L1_TYPE_MASK) == L1_TYPE_S)
 #define	l1pte_page_p(pde)	(((pde) & L1_TYPE_MASK) == L1_TYPE_C)
 #define	l1pte_fpage_p(pde)	(((pde) & L1_TYPE_MASK) == L1_TYPE_F)
 
 #define l2pte_index(v)		(((v) & L2_ADDR_BITS) >> L2_S_SHIFT)
 #define	l2pte_valid(pte)	((pte) != 0)
 #define	l2pte_pa(pte)		((pte) & L2_S_FRAME)
 #define l2pte_minidata(pte)	(((pte) & \
 				 (L2_B | L2_C | L2_XSCALE_T_TEX(TEX_XSCALE_X)))\
 				 == (L2_C | L2_XSCALE_T_TEX(TEX_XSCALE_X)))
 
 /* L1 and L2 page table macros */
 #define pmap_pde_v(pde)		l1pte_valid(*(pde))
 #define pmap_pde_section(pde)	l1pte_section_p(*(pde))
 #define pmap_pde_page(pde)	l1pte_page_p(*(pde))
 #define pmap_pde_fpage(pde)	l1pte_fpage_p(*(pde))
 
 #define	pmap_pte_v(pte)		l2pte_valid(*(pte))
 #define	pmap_pte_pa(pte)	l2pte_pa(*(pte))
 
 /*
  * Flags that indicate attributes of pages or mappings of pages.
  *
  * The PVF_MOD and PVF_REF flags are stored in the mdpage for each
  * page.  PVF_WIRED, PVF_WRITE, and PVF_NC are kept in individual
  * pv_entry's for each page.  They live in the same "namespace" so
  * that we can clear multiple attributes at a time.
  *
  * Note the "non-cacheable" flag generally means the page has
  * multiple mappings in a given address space.
  */
 #define	PVF_MOD		0x01		/* page is modified */
 #define	PVF_REF		0x02		/* page is referenced */
 #define	PVF_WIRED	0x04		/* mapping is wired */
 #define	PVF_WRITE	0x08		/* mapping is writable */
 #define	PVF_EXEC	0x10		/* mapping is executable */
 #define	PVF_NC		0x20		/* mapping is non-cacheable */
 #define	PVF_MWC		0x40		/* mapping is used multiple times in userland */
 #define	PVF_UNMAN	0x80		/* mapping is unmanaged */
 
 void vector_page_setprot(int);
 
 void pmap_update(pmap_t);
 
 /*
  * This structure is used by machine-dependent code to describe
  * static mappings of devices, created at bootstrap time.
  */
 struct pmap_devmap {
 	vm_offset_t	pd_va;		/* virtual address */
 	vm_paddr_t	pd_pa;		/* physical address */
 	vm_size_t	pd_size;	/* size of region */
 	vm_prot_t	pd_prot;	/* protection code */
 	int		pd_cache;	/* cache attributes */
 };
 
 const struct pmap_devmap *pmap_devmap_find_pa(vm_paddr_t, vm_size_t);
 const struct pmap_devmap *pmap_devmap_find_va(vm_offset_t, vm_size_t);
 
 void	pmap_devmap_bootstrap(vm_offset_t, const struct pmap_devmap *);
 void	pmap_devmap_register(const struct pmap_devmap *);
 
 #define SECTION_CACHE	0x1
 #define SECTION_PT	0x2
 void	pmap_kenter_section(vm_offset_t, vm_paddr_t, int flags);
 #ifdef ARM_HAVE_SUPERSECTIONS
 void	pmap_kenter_supersection(vm_offset_t, uint64_t, int flags);
 #endif
 
 extern char *_tmppt;
 
 void	pmap_postinit(void);
 
 #ifdef ARM_USE_SMALL_ALLOC
 void	arm_add_smallalloc_pages(void *, void *, int, int);
 vm_offset_t arm_ptovirt(vm_paddr_t);
 void arm_init_smallalloc(void);
 struct arm_small_page {
 	void *addr;
 	TAILQ_ENTRY(arm_small_page) pg_list;
 };
 
 #endif
 
 #define ARM_NOCACHE_KVA_SIZE 0x1000000
 extern vm_offset_t arm_nocache_startaddr;
 void *arm_remap_nocache(void *, vm_size_t);
 void arm_unmap_nocache(void *, vm_size_t);
 
 extern vm_paddr_t dump_avail[];
 #endif	/* _KERNEL */
 
 #endif	/* !LOCORE */
 
 #endif	/* !_MACHINE_PMAP_H_ */
Index: head/sys/boot/i386/efi
===================================================================
--- head/sys/boot/i386/efi	(revision 222812)
+++ head/sys/boot/i386/efi	(revision 222813)

Property changes on: head/sys/boot/i386/efi
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/boot/i386/efi:r221273-222812
Index: head/sys/boot/ia64/efi
===================================================================
--- head/sys/boot/ia64/efi	(revision 222812)
+++ head/sys/boot/ia64/efi	(revision 222813)

Property changes on: head/sys/boot/ia64/efi
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/boot/ia64/efi:r221273-222812
Index: head/sys/boot/ia64/ski
===================================================================
--- head/sys/boot/ia64/ski	(revision 222812)
+++ head/sys/boot/ia64/ski	(revision 222813)

Property changes on: head/sys/boot/ia64/ski
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/boot/ia64/ski:r221273-222812
Index: head/sys/boot/powerpc/boot1.chrp
===================================================================
--- head/sys/boot/powerpc/boot1.chrp	(revision 222812)
+++ head/sys/boot/powerpc/boot1.chrp	(revision 222813)

Property changes on: head/sys/boot/powerpc/boot1.chrp
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/boot/powerpc/boot1.chrp:r221273-222812
Index: head/sys/boot/powerpc/ofw
===================================================================
--- head/sys/boot/powerpc/ofw	(revision 222812)
+++ head/sys/boot/powerpc/ofw	(revision 222813)

Property changes on: head/sys/boot/powerpc/ofw
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/boot/powerpc/ofw:r221273-222812
Index: head/sys/boot
===================================================================
--- head/sys/boot	(revision 222812)
+++ head/sys/boot	(revision 222813)

Property changes on: head/sys/boot
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/boot:r221273-222812
Index: head/sys/cddl/contrib/opensolaris
===================================================================
--- head/sys/cddl/contrib/opensolaris	(revision 222812)
+++ head/sys/cddl/contrib/opensolaris	(revision 222813)

Property changes on: head/sys/cddl/contrib/opensolaris
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/cddl/contrib/opensolaris:r221273-222812
Index: head/sys/cddl/dev/cyclic/i386/cyclic_machdep.c
===================================================================
--- head/sys/cddl/dev/cyclic/i386/cyclic_machdep.c	(revision 222812)
+++ head/sys/cddl/dev/cyclic/i386/cyclic_machdep.c	(revision 222813)
@@ -1,129 +1,131 @@
 /*-
  * Copyright 2006-2008 John Birrell <jb@FreeBSD.org>
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 
  * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  *
  */
 
 static void enable(cyb_arg_t);
 static void disable(cyb_arg_t);
 static void reprogram(cyb_arg_t, hrtime_t);
 static void xcall(cyb_arg_t, cpu_t *, cyc_func_t, void *);
 static void cyclic_clock(struct trapframe *frame);
 
 static cyc_backend_t	be	= {
 	NULL,		/* cyb_configure */
 	NULL,		/* cyb_unconfigure */
 	enable,
 	disable,
 	reprogram,
 	xcall,
 	NULL		/* cyb_arg_t cyb_arg */
 };
 
 static void
 cyclic_ap_start(void *dummy)
 {
 	/* Initialise the rest of the CPUs. */
 	cyclic_clock_func = cyclic_clock;
 	cyclic_mp_init();
 }
 
 SYSINIT(cyclic_ap_start, SI_SUB_SMP, SI_ORDER_ANY, cyclic_ap_start, NULL);
 
 /*
  *  Machine dependent cyclic subsystem initialisation.
  */
 static void
 cyclic_machdep_init(void)
 {
 	/* Register the cyclic backend. */
 	cyclic_init(&be);
 }
 
 static void
 cyclic_machdep_uninit(void)
 {
 	/* De-register the cyclic backend. */
 	cyclic_uninit();
 }
 
 /*
  * This function is the one registered by the machine dependent
  * initialiser as the callback for high speed timer events.
  */
 static void
 cyclic_clock(struct trapframe *frame)
 {
 	cpu_t *c = &solaris_cpu[curcpu];
 
 	if (c->cpu_cyclic != NULL) {
 		if (TRAPF_USERMODE(frame)) {
 			c->cpu_profile_pc = 0;
 			c->cpu_profile_upc = TRAPF_PC(frame);
 		} else {
 			c->cpu_profile_pc = TRAPF_PC(frame);
 			c->cpu_profile_upc = 0;
 		}
 
 		c->cpu_intr_actv = 1;
 
 		/* Fire any timers that are due. */
 		cyclic_fire(c);
 
 		c->cpu_intr_actv = 0;
 	}
 }
 
 static void
 enable(cyb_arg_t arg __unused)
 {
 
 }
 
 static void
 disable(cyb_arg_t arg __unused)
 {
 
 }
 
 static void
 reprogram(cyb_arg_t arg __unused, hrtime_t exp)
 {
 	struct bintime bt;
 	struct timespec ts;
 
 	ts.tv_sec = exp / 1000000000;
 	ts.tv_nsec = exp % 1000000000;
 	timespec2bintime(&ts, &bt);
 	clocksource_cyc_set(&bt);
 }
 
 static void xcall(cyb_arg_t arg __unused, cpu_t *c, cyc_func_t func,
     void *param)
 {
+	cpuset_t cpus;
 
-	smp_rendezvous_cpus((cpumask_t)1 << c->cpuid,
+	CPU_SETOF(c->cpuid, &cpus);
+	smp_rendezvous_cpus(cpus,
 	    smp_no_rendevous_barrier, func, smp_no_rendevous_barrier, param);
 }
Index: head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c
===================================================================
--- head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c	(revision 222812)
+++ head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c	(revision 222813)
@@ -1,516 +1,517 @@
 /*
  * CDDL HEADER START
  *
  * The contents of this file are subject to the terms of the
  * Common Development and Distribution License, Version 1.0 only
  * (the "License").  You may not use this file except in compliance
  * with the License.
  *
  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
  * or http://www.opensolaris.org/os/licensing.
  * See the License for the specific language governing permissions
  * and limitations under the License.
  *
  * When distributing Covered Code, include this CDDL HEADER in each
  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  * If applicable, add the following below this CDDL HEADER, with the
  * fields enclosed by brackets "[]" replaced with your own identifying
  * information: Portions Copyright [yyyy] [name of copyright owner]
  *
  * CDDL HEADER END
  *
  * $FreeBSD$
  *
  */
 /*
  * Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/types.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/kmem.h>
 #include <sys/smp.h>
 #include <sys/dtrace_impl.h>
 #include <sys/dtrace_bsd.h>
 #include <machine/clock.h>
 #include <machine/frame.h>
 #include <vm/pmap.h>
 
 extern uintptr_t 	dtrace_in_probe_addr;
 extern int		dtrace_in_probe;
 
 int dtrace_invop(uintptr_t, uintptr_t *, uintptr_t);
 
 typedef struct dtrace_invop_hdlr {
 	int (*dtih_func)(uintptr_t, uintptr_t *, uintptr_t);
 	struct dtrace_invop_hdlr *dtih_next;
 } dtrace_invop_hdlr_t;
 
 dtrace_invop_hdlr_t *dtrace_invop_hdlr;
 
 int
 dtrace_invop(uintptr_t addr, uintptr_t *stack, uintptr_t eax)
 {
 	dtrace_invop_hdlr_t *hdlr;
 	int rval;
 
 	for (hdlr = dtrace_invop_hdlr; hdlr != NULL; hdlr = hdlr->dtih_next)
 		if ((rval = hdlr->dtih_func(addr, stack, eax)) != 0)
 			return (rval);
 
 	return (0);
 }
 
 void
 dtrace_invop_add(int (*func)(uintptr_t, uintptr_t *, uintptr_t))
 {
 	dtrace_invop_hdlr_t *hdlr;
 
 	hdlr = kmem_alloc(sizeof (dtrace_invop_hdlr_t), KM_SLEEP);
 	hdlr->dtih_func = func;
 	hdlr->dtih_next = dtrace_invop_hdlr;
 	dtrace_invop_hdlr = hdlr;
 }
 
 void
 dtrace_invop_remove(int (*func)(uintptr_t, uintptr_t *, uintptr_t))
 {
 	dtrace_invop_hdlr_t *hdlr = dtrace_invop_hdlr, *prev = NULL;
 
 	for (;;) {
 		if (hdlr == NULL)
 			panic("attempt to remove non-existent invop handler");
 
 		if (hdlr->dtih_func == func)
 			break;
 
 		prev = hdlr;
 		hdlr = hdlr->dtih_next;
 	}
 
 	if (prev == NULL) {
 		ASSERT(dtrace_invop_hdlr == hdlr);
 		dtrace_invop_hdlr = hdlr->dtih_next;
 	} else {
 		ASSERT(dtrace_invop_hdlr != hdlr);
 		prev->dtih_next = hdlr->dtih_next;
 	}
 
 	kmem_free(hdlr, 0);
 }
 
 /*ARGSUSED*/
 void
 dtrace_toxic_ranges(void (*func)(uintptr_t base, uintptr_t limit))
 {
 	(*func)(0, (uintptr_t) addr_PTmap);
 }
 
 void
 dtrace_xcall(processorid_t cpu, dtrace_xcall_t func, void *arg)
 {
-	cpumask_t cpus;
+	cpuset_t cpus;
 
 	if (cpu == DTRACE_CPUALL)
 		cpus = all_cpus;
 	else
-		cpus = (cpumask_t)1 << cpu;
+		CPU_SETOF(cpu, &cpus);
 
 	smp_rendezvous_cpus(cpus, smp_no_rendevous_barrier, func,
 	    smp_no_rendevous_barrier, arg);
 }
 
 static void
 dtrace_sync_func(void)
 {
 }
 
 void
 dtrace_sync(void)
 {
         dtrace_xcall(DTRACE_CPUALL, (dtrace_xcall_t)dtrace_sync_func, NULL);
 }
 
 #ifdef notyet
 int (*dtrace_fasttrap_probe_ptr)(struct regs *);
 int (*dtrace_pid_probe_ptr)(struct regs *);
 int (*dtrace_return_probe_ptr)(struct regs *);
 
 void
 dtrace_user_probe(struct regs *rp, caddr_t addr, processorid_t cpuid)
 {
 	krwlock_t *rwp;
 	proc_t *p = curproc;
 	extern void trap(struct regs *, caddr_t, processorid_t);
 
 	if (USERMODE(rp->r_cs) || (rp->r_ps & PS_VM)) {
 		if (curthread->t_cred != p->p_cred) {
 			cred_t *oldcred = curthread->t_cred;
 			/*
 			 * DTrace accesses t_cred in probe context.  t_cred
 			 * must always be either NULL, or point to a valid,
 			 * allocated cred structure.
 			 */
 			curthread->t_cred = crgetcred();
 			crfree(oldcred);
 		}
 	}
 
 	if (rp->r_trapno == T_DTRACE_RET) {
 		uint8_t step = curthread->t_dtrace_step;
 		uint8_t ret = curthread->t_dtrace_ret;
 		uintptr_t npc = curthread->t_dtrace_npc;
 
 		if (curthread->t_dtrace_ast) {
 			aston(curthread);
 			curthread->t_sig_check = 1;
 		}
 
 		/*
 		 * Clear all user tracing flags.
 		 */
 		curthread->t_dtrace_ft = 0;
 
 		/*
 		 * If we weren't expecting to take a return probe trap, kill
 		 * the process as though it had just executed an unassigned
 		 * trap instruction.
 		 */
 		if (step == 0) {
 			tsignal(curthread, SIGILL);
 			return;
 		}
 
 		/*
 		 * If we hit this trap unrelated to a return probe, we're
 		 * just here to reset the AST flag since we deferred a signal
 		 * until after we logically single-stepped the instruction we
 		 * copied out.
 		 */
 		if (ret == 0) {
 			rp->r_pc = npc;
 			return;
 		}
 
 		/*
 		 * We need to wait until after we've called the
 		 * dtrace_return_probe_ptr function pointer to set %pc.
 		 */
 		rwp = &CPU->cpu_ft_lock;
 		rw_enter(rwp, RW_READER);
 		if (dtrace_return_probe_ptr != NULL)
 			(void) (*dtrace_return_probe_ptr)(rp);
 		rw_exit(rwp);
 		rp->r_pc = npc;
 
 	} else if (rp->r_trapno == T_DTRACE_PROBE) {
 		rwp = &CPU->cpu_ft_lock;
 		rw_enter(rwp, RW_READER);
 		if (dtrace_fasttrap_probe_ptr != NULL)
 			(void) (*dtrace_fasttrap_probe_ptr)(rp);
 		rw_exit(rwp);
 
 	} else if (rp->r_trapno == T_BPTFLT) {
 		uint8_t instr;
 		rwp = &CPU->cpu_ft_lock;
 
 		/*
 		 * The DTrace fasttrap provider uses the breakpoint trap
 		 * (int 3). We let DTrace take the first crack at handling
 		 * this trap; if it's not a probe that DTrace knowns about,
 		 * we call into the trap() routine to handle it like a
 		 * breakpoint placed by a conventional debugger.
 		 */
 		rw_enter(rwp, RW_READER);
 		if (dtrace_pid_probe_ptr != NULL &&
 		    (*dtrace_pid_probe_ptr)(rp) == 0) {
 			rw_exit(rwp);
 			return;
 		}
 		rw_exit(rwp);
 
 		/*
 		 * If the instruction that caused the breakpoint trap doesn't
 		 * look like an int 3 anymore, it may be that this tracepoint
 		 * was removed just after the user thread executed it. In
 		 * that case, return to user land to retry the instuction.
 		 */
 		if (fuword8((void *)(rp->r_pc - 1), &instr) == 0 &&
 		    instr != FASTTRAP_INSTR) {
 			rp->r_pc--;
 			return;
 		}
 
 		trap(rp, addr, cpuid);
 
 	} else {
 		trap(rp, addr, cpuid);
 	}
 }
 
 void
 dtrace_safe_synchronous_signal(void)
 {
 	kthread_t *t = curthread;
 	struct regs *rp = lwptoregs(ttolwp(t));
 	size_t isz = t->t_dtrace_npc - t->t_dtrace_pc;
 
 	ASSERT(t->t_dtrace_on);
 
 	/*
 	 * If we're not in the range of scratch addresses, we're not actually
 	 * tracing user instructions so turn off the flags. If the instruction
 	 * we copied out caused a synchonous trap, reset the pc back to its
 	 * original value and turn off the flags.
 	 */
 	if (rp->r_pc < t->t_dtrace_scrpc ||
 	    rp->r_pc > t->t_dtrace_astpc + isz) {
 		t->t_dtrace_ft = 0;
 	} else if (rp->r_pc == t->t_dtrace_scrpc ||
 	    rp->r_pc == t->t_dtrace_astpc) {
 		rp->r_pc = t->t_dtrace_pc;
 		t->t_dtrace_ft = 0;
 	}
 }
 
 int
 dtrace_safe_defer_signal(void)
 {
 	kthread_t *t = curthread;
 	struct regs *rp = lwptoregs(ttolwp(t));
 	size_t isz = t->t_dtrace_npc - t->t_dtrace_pc;
 
 	ASSERT(t->t_dtrace_on);
 
 	/*
 	 * If we're not in the range of scratch addresses, we're not actually
 	 * tracing user instructions so turn off the flags.
 	 */
 	if (rp->r_pc < t->t_dtrace_scrpc ||
 	    rp->r_pc > t->t_dtrace_astpc + isz) {
 		t->t_dtrace_ft = 0;
 		return (0);
 	}
 
 	/*
 	 * If we've executed the original instruction, but haven't performed
 	 * the jmp back to t->t_dtrace_npc or the clean up of any registers
 	 * used to emulate %rip-relative instructions in 64-bit mode, do that
 	 * here and take the signal right away. We detect this condition by
 	 * seeing if the program counter is the range [scrpc + isz, astpc).
 	 */
 	if (t->t_dtrace_astpc - rp->r_pc <
 	    t->t_dtrace_astpc - t->t_dtrace_scrpc - isz) {
 #ifdef __amd64
 		/*
 		 * If there is a scratch register and we're on the
 		 * instruction immediately after the modified instruction,
 		 * restore the value of that scratch register.
 		 */
 		if (t->t_dtrace_reg != 0 &&
 		    rp->r_pc == t->t_dtrace_scrpc + isz) {
 			switch (t->t_dtrace_reg) {
 			case REG_RAX:
 				rp->r_rax = t->t_dtrace_regv;
 				break;
 			case REG_RCX:
 				rp->r_rcx = t->t_dtrace_regv;
 				break;
 			case REG_R8:
 				rp->r_r8 = t->t_dtrace_regv;
 				break;
 			case REG_R9:
 				rp->r_r9 = t->t_dtrace_regv;
 				break;
 			}
 		}
 #endif
 		rp->r_pc = t->t_dtrace_npc;
 		t->t_dtrace_ft = 0;
 		return (0);
 	}
 
 	/*
 	 * Otherwise, make sure we'll return to the kernel after executing
 	 * the copied out instruction and defer the signal.
 	 */
 	if (!t->t_dtrace_step) {
 		ASSERT(rp->r_pc < t->t_dtrace_astpc);
 		rp->r_pc += t->t_dtrace_astpc - t->t_dtrace_scrpc;
 		t->t_dtrace_step = 1;
 	}
 
 	t->t_dtrace_ast = 1;
 
 	return (1);
 }
 #endif
 
 static int64_t	tgt_cpu_tsc;
 static int64_t	hst_cpu_tsc;
 static int64_t	tsc_skew[MAXCPU];
 static uint64_t	nsec_scale;
 
 /* See below for the explanation of this macro. */
 #define SCALE_SHIFT	28
 
 static void
 dtrace_gethrtime_init_cpu(void *arg)
 {
 	uintptr_t cpu = (uintptr_t) arg;
 
 	if (cpu == curcpu)
 		tgt_cpu_tsc = rdtsc();
 	else
 		hst_cpu_tsc = rdtsc();
 }
 
 static void
 dtrace_gethrtime_init(void *arg)
 {
 	struct pcpu *pc;
 	uint64_t tsc_f;
-	cpumask_t map;
+	cpuset_t map;
 	int i;
 
 	/*
 	 * Get TSC frequency known at this moment.
 	 * This should be constant if TSC is invariant.
 	 * Otherwise tick->time conversion will be inaccurate, but
 	 * will preserve monotonic property of TSC.
 	 */
 	tsc_f = atomic_load_acq_64(&tsc_freq);
 
 	/*
 	 * The following line checks that nsec_scale calculated below
 	 * doesn't overflow 32-bit unsigned integer, so that it can multiply
 	 * another 32-bit integer without overflowing 64-bit.
 	 * Thus minimum supported TSC frequency is 62.5MHz.
 	 */
 	KASSERT(tsc_f > (NANOSEC >> (32 - SCALE_SHIFT)), ("TSC frequency is too low"));
 
 	/*
 	 * We scale up NANOSEC/tsc_f ratio to preserve as much precision
 	 * as possible.
 	 * 2^28 factor was chosen quite arbitrarily from practical
 	 * considerations:
 	 * - it supports TSC frequencies as low as 62.5MHz (see above);
 	 * - it provides quite good precision (e < 0.01%) up to THz
 	 *   (terahertz) values;
 	 */
 	nsec_scale = ((uint64_t)NANOSEC << SCALE_SHIFT) / tsc_f;
 
 	/* The current CPU is the reference one. */
 	sched_pin();
 	tsc_skew[curcpu] = 0;
 	CPU_FOREACH(i) {
 		if (i == curcpu)
 			continue;
 
 		pc = pcpu_find(i);
-		map = PCPU_GET(cpumask) | pc->pc_cpumask;
+		map = PCPU_GET(cpumask);
+		CPU_OR(&map, &pc->pc_cpumask);
 
 		smp_rendezvous_cpus(map, NULL,
 		    dtrace_gethrtime_init_cpu,
 		    smp_no_rendevous_barrier, (void *)(uintptr_t) i);
 
 		tsc_skew[i] = tgt_cpu_tsc - hst_cpu_tsc;
 	}
 	sched_unpin();
 }
 
 SYSINIT(dtrace_gethrtime_init, SI_SUB_SMP, SI_ORDER_ANY, dtrace_gethrtime_init, NULL);
 
 /*
  * DTrace needs a high resolution time function which can
  * be called from a probe context and guaranteed not to have
  * instrumented with probes itself.
  *
  * Returns nanoseconds since boot.
  */
 uint64_t
 dtrace_gethrtime()
 {
 	uint64_t tsc;
 	uint32_t lo;
 	uint32_t hi;
 
 	/*
 	 * We split TSC value into lower and higher 32-bit halves and separately
 	 * scale them with nsec_scale, then we scale them down by 2^28
 	 * (see nsec_scale calculations) taking into account 32-bit shift of
 	 * the higher half and finally add.
 	 */
 	tsc = rdtsc() + tsc_skew[curcpu];
 	lo = tsc;
 	hi = tsc >> 32;
 	return (((lo * nsec_scale) >> SCALE_SHIFT) +
 	    ((hi * nsec_scale) << (32 - SCALE_SHIFT)));
 }
 
 uint64_t
 dtrace_gethrestime(void)
 {
 	printf("%s(%d): XXX\n",__func__,__LINE__);
 	return (0);
 }
 
 /* Function to handle DTrace traps during probes. See amd64/amd64/trap.c */
 int
 dtrace_trap(struct trapframe *frame, u_int type)
 {
 	/*
 	 * A trap can occur while DTrace executes a probe. Before
 	 * executing the probe, DTrace blocks re-scheduling and sets
 	 * a flag in it's per-cpu flags to indicate that it doesn't
 	 * want to fault. On returning from the probe, the no-fault
 	 * flag is cleared and finally re-scheduling is enabled.
 	 *
 	 * Check if DTrace has enabled 'no-fault' mode:
 	 *
 	 */
 	if ((cpu_core[curcpu].cpuc_dtrace_flags & CPU_DTRACE_NOFAULT) != 0) {
 		/*
 		 * There are only a couple of trap types that are expected.
 		 * All the rest will be handled in the usual way.
 		 */
 		switch (type) {
 		/* Privilieged instruction fault. */
 		case T_PRIVINFLT:
 			break;
 		/* General protection fault. */
 		case T_PROTFLT:
 			/* Flag an illegal operation. */
 			cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_ILLOP;
 
 			/*
 			 * Offset the instruction pointer to the instruction
 			 * following the one causing the fault.
 			 */
 			frame->tf_rip += dtrace_instr_size((u_char *) frame->tf_rip);
 			return (1);
 		/* Page fault. */
 		case T_PAGEFLT:
 			/* Flag a bad address. */
 			cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_BADADDR;
 			cpu_core[curcpu].cpuc_dtrace_illval = frame->tf_addr;
 
 			/*
 			 * Offset the instruction pointer to the instruction
 			 * following the one causing the fault.
 			 */
 			frame->tf_rip += dtrace_instr_size((u_char *) frame->tf_rip);
 			return (1);
 		default:
 			/* Handle all other traps in the usual way. */
 			break;
 		}
 	}
 
 	/* Handle the trap in the usual way. */
 	return (0);
 }
Index: head/sys/cddl/dev/dtrace/i386/dtrace_subr.c
===================================================================
--- head/sys/cddl/dev/dtrace/i386/dtrace_subr.c	(revision 222812)
+++ head/sys/cddl/dev/dtrace/i386/dtrace_subr.c	(revision 222813)
@@ -1,513 +1,515 @@
 /*
  * CDDL HEADER START
  *
  * The contents of this file are subject to the terms of the
  * Common Development and Distribution License, Version 1.0 only
  * (the "License").  You may not use this file except in compliance
  * with the License.
  *
  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
  * or http://www.opensolaris.org/os/licensing.
  * See the License for the specific language governing permissions
  * and limitations under the License.
  *
  * When distributing Covered Code, include this CDDL HEADER in each
  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  * If applicable, add the following below this CDDL HEADER, with the
  * fields enclosed by brackets "[]" replaced with your own identifying
  * information: Portions Copyright [yyyy] [name of copyright owner]
  *
  * CDDL HEADER END
  *
  * $FreeBSD$
  *
  */
 /*
  * Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/types.h>
+#include <sys/cpuset.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/kmem.h>
 #include <sys/smp.h>
 #include <sys/dtrace_impl.h>
 #include <sys/dtrace_bsd.h>
 #include <machine/clock.h>
 #include <machine/frame.h>
 #include <vm/pmap.h>
 
 extern uintptr_t 	kernelbase;
 extern uintptr_t 	dtrace_in_probe_addr;
 extern int		dtrace_in_probe;
 
 int dtrace_invop(uintptr_t, uintptr_t *, uintptr_t);
 
 typedef struct dtrace_invop_hdlr {
 	int (*dtih_func)(uintptr_t, uintptr_t *, uintptr_t);
 	struct dtrace_invop_hdlr *dtih_next;
 } dtrace_invop_hdlr_t;
 
 dtrace_invop_hdlr_t *dtrace_invop_hdlr;
 
 int
 dtrace_invop(uintptr_t addr, uintptr_t *stack, uintptr_t eax)
 {
 	dtrace_invop_hdlr_t *hdlr;
 	int rval;
 
 	for (hdlr = dtrace_invop_hdlr; hdlr != NULL; hdlr = hdlr->dtih_next)
 		if ((rval = hdlr->dtih_func(addr, stack, eax)) != 0)
 			return (rval);
 
 	return (0);
 }
 
 void
 dtrace_invop_add(int (*func)(uintptr_t, uintptr_t *, uintptr_t))
 {
 	dtrace_invop_hdlr_t *hdlr;
 
 	hdlr = kmem_alloc(sizeof (dtrace_invop_hdlr_t), KM_SLEEP);
 	hdlr->dtih_func = func;
 	hdlr->dtih_next = dtrace_invop_hdlr;
 	dtrace_invop_hdlr = hdlr;
 }
 
 void
 dtrace_invop_remove(int (*func)(uintptr_t, uintptr_t *, uintptr_t))
 {
 	dtrace_invop_hdlr_t *hdlr = dtrace_invop_hdlr, *prev = NULL;
 
 	for (;;) {
 		if (hdlr == NULL)
 			panic("attempt to remove non-existent invop handler");
 
 		if (hdlr->dtih_func == func)
 			break;
 
 		prev = hdlr;
 		hdlr = hdlr->dtih_next;
 	}
 
 	if (prev == NULL) {
 		ASSERT(dtrace_invop_hdlr == hdlr);
 		dtrace_invop_hdlr = hdlr->dtih_next;
 	} else {
 		ASSERT(dtrace_invop_hdlr != hdlr);
 		prev->dtih_next = hdlr->dtih_next;
 	}
 
 	kmem_free(hdlr, 0);
 }
 
 void
 dtrace_toxic_ranges(void (*func)(uintptr_t base, uintptr_t limit))
 {
 	(*func)(0, kernelbase);
 }
 
 void
 dtrace_xcall(processorid_t cpu, dtrace_xcall_t func, void *arg)
 {
-	cpumask_t cpus;
+	cpuset_t cpus;
 
 	if (cpu == DTRACE_CPUALL)
 		cpus = all_cpus;
 	else
-		cpus = (cpumask_t)1 << cpu;
+		CPU_SETOF(cpu, &cpus);
 
 	smp_rendezvous_cpus(cpus, smp_no_rendevous_barrier, func,
 	    smp_no_rendevous_barrier, arg);
 }
 
 static void
 dtrace_sync_func(void)
 {
 }
 
 void
 dtrace_sync(void)
 {
         dtrace_xcall(DTRACE_CPUALL, (dtrace_xcall_t)dtrace_sync_func, NULL);
 }
 
 #ifdef notyet
 int (*dtrace_fasttrap_probe_ptr)(struct regs *);
 int (*dtrace_pid_probe_ptr)(struct regs *);
 int (*dtrace_return_probe_ptr)(struct regs *);
 
 void
 dtrace_user_probe(struct regs *rp, caddr_t addr, processorid_t cpuid)
 {
 	krwlock_t *rwp;
 	proc_t *p = curproc;
 	extern void trap(struct regs *, caddr_t, processorid_t);
 
 	if (USERMODE(rp->r_cs) || (rp->r_ps & PS_VM)) {
 		if (curthread->t_cred != p->p_cred) {
 			cred_t *oldcred = curthread->t_cred;
 			/*
 			 * DTrace accesses t_cred in probe context.  t_cred
 			 * must always be either NULL, or point to a valid,
 			 * allocated cred structure.
 			 */
 			curthread->t_cred = crgetcred();
 			crfree(oldcred);
 		}
 	}
 
 	if (rp->r_trapno == T_DTRACE_RET) {
 		uint8_t step = curthread->t_dtrace_step;
 		uint8_t ret = curthread->t_dtrace_ret;
 		uintptr_t npc = curthread->t_dtrace_npc;
 
 		if (curthread->t_dtrace_ast) {
 			aston(curthread);
 			curthread->t_sig_check = 1;
 		}
 
 		/*
 		 * Clear all user tracing flags.
 		 */
 		curthread->t_dtrace_ft = 0;
 
 		/*
 		 * If we weren't expecting to take a return probe trap, kill
 		 * the process as though it had just executed an unassigned
 		 * trap instruction.
 		 */
 		if (step == 0) {
 			tsignal(curthread, SIGILL);
 			return;
 		}
 
 		/*
 		 * If we hit this trap unrelated to a return probe, we're
 		 * just here to reset the AST flag since we deferred a signal
 		 * until after we logically single-stepped the instruction we
 		 * copied out.
 		 */
 		if (ret == 0) {
 			rp->r_pc = npc;
 			return;
 		}
 
 		/*
 		 * We need to wait until after we've called the
 		 * dtrace_return_probe_ptr function pointer to set %pc.
 		 */
 		rwp = &CPU->cpu_ft_lock;
 		rw_enter(rwp, RW_READER);
 		if (dtrace_return_probe_ptr != NULL)
 			(void) (*dtrace_return_probe_ptr)(rp);
 		rw_exit(rwp);
 		rp->r_pc = npc;
 
 	} else if (rp->r_trapno == T_DTRACE_PROBE) {
 		rwp = &CPU->cpu_ft_lock;
 		rw_enter(rwp, RW_READER);
 		if (dtrace_fasttrap_probe_ptr != NULL)
 			(void) (*dtrace_fasttrap_probe_ptr)(rp);
 		rw_exit(rwp);
 
 	} else if (rp->r_trapno == T_BPTFLT) {
 		uint8_t instr;
 		rwp = &CPU->cpu_ft_lock;
 
 		/*
 		 * The DTrace fasttrap provider uses the breakpoint trap
 		 * (int 3). We let DTrace take the first crack at handling
 		 * this trap; if it's not a probe that DTrace knowns about,
 		 * we call into the trap() routine to handle it like a
 		 * breakpoint placed by a conventional debugger.
 		 */
 		rw_enter(rwp, RW_READER);
 		if (dtrace_pid_probe_ptr != NULL &&
 		    (*dtrace_pid_probe_ptr)(rp) == 0) {
 			rw_exit(rwp);
 			return;
 		}
 		rw_exit(rwp);
 
 		/*
 		 * If the instruction that caused the breakpoint trap doesn't
 		 * look like an int 3 anymore, it may be that this tracepoint
 		 * was removed just after the user thread executed it. In
 		 * that case, return to user land to retry the instuction.
 		 */
 		if (fuword8((void *)(rp->r_pc - 1), &instr) == 0 &&
 		    instr != FASTTRAP_INSTR) {
 			rp->r_pc--;
 			return;
 		}
 
 		trap(rp, addr, cpuid);
 
 	} else {
 		trap(rp, addr, cpuid);
 	}
 }
 
 void
 dtrace_safe_synchronous_signal(void)
 {
 	kthread_t *t = curthread;
 	struct regs *rp = lwptoregs(ttolwp(t));
 	size_t isz = t->t_dtrace_npc - t->t_dtrace_pc;
 
 	ASSERT(t->t_dtrace_on);
 
 	/*
 	 * If we're not in the range of scratch addresses, we're not actually
 	 * tracing user instructions so turn off the flags. If the instruction
 	 * we copied out caused a synchonous trap, reset the pc back to its
 	 * original value and turn off the flags.
 	 */
 	if (rp->r_pc < t->t_dtrace_scrpc ||
 	    rp->r_pc > t->t_dtrace_astpc + isz) {
 		t->t_dtrace_ft = 0;
 	} else if (rp->r_pc == t->t_dtrace_scrpc ||
 	    rp->r_pc == t->t_dtrace_astpc) {
 		rp->r_pc = t->t_dtrace_pc;
 		t->t_dtrace_ft = 0;
 	}
 }
 
 int
 dtrace_safe_defer_signal(void)
 {
 	kthread_t *t = curthread;
 	struct regs *rp = lwptoregs(ttolwp(t));
 	size_t isz = t->t_dtrace_npc - t->t_dtrace_pc;
 
 	ASSERT(t->t_dtrace_on);
 
 	/*
 	 * If we're not in the range of scratch addresses, we're not actually
 	 * tracing user instructions so turn off the flags.
 	 */
 	if (rp->r_pc < t->t_dtrace_scrpc ||
 	    rp->r_pc > t->t_dtrace_astpc + isz) {
 		t->t_dtrace_ft = 0;
 		return (0);
 	}
 
 	/*
 	 * If we've executed the original instruction, but haven't performed
 	 * the jmp back to t->t_dtrace_npc or the clean up of any registers
 	 * used to emulate %rip-relative instructions in 64-bit mode, do that
 	 * here and take the signal right away. We detect this condition by
 	 * seeing if the program counter is the range [scrpc + isz, astpc).
 	 */
 	if (t->t_dtrace_astpc - rp->r_pc <
 	    t->t_dtrace_astpc - t->t_dtrace_scrpc - isz) {
 #ifdef __amd64
 		/*
 		 * If there is a scratch register and we're on the
 		 * instruction immediately after the modified instruction,
 		 * restore the value of that scratch register.
 		 */
 		if (t->t_dtrace_reg != 0 &&
 		    rp->r_pc == t->t_dtrace_scrpc + isz) {
 			switch (t->t_dtrace_reg) {
 			case REG_RAX:
 				rp->r_rax = t->t_dtrace_regv;
 				break;
 			case REG_RCX:
 				rp->r_rcx = t->t_dtrace_regv;
 				break;
 			case REG_R8:
 				rp->r_r8 = t->t_dtrace_regv;
 				break;
 			case REG_R9:
 				rp->r_r9 = t->t_dtrace_regv;
 				break;
 			}
 		}
 #endif
 		rp->r_pc = t->t_dtrace_npc;
 		t->t_dtrace_ft = 0;
 		return (0);
 	}
 
 	/*
 	 * Otherwise, make sure we'll return to the kernel after executing
 	 * the copied out instruction and defer the signal.
 	 */
 	if (!t->t_dtrace_step) {
 		ASSERT(rp->r_pc < t->t_dtrace_astpc);
 		rp->r_pc += t->t_dtrace_astpc - t->t_dtrace_scrpc;
 		t->t_dtrace_step = 1;
 	}
 
 	t->t_dtrace_ast = 1;
 
 	return (1);
 }
 #endif
 
 static int64_t	tgt_cpu_tsc;
 static int64_t	hst_cpu_tsc;
 static int64_t	tsc_skew[MAXCPU];
 static uint64_t	nsec_scale;
 
 /* See below for the explanation of this macro. */
 #define SCALE_SHIFT	28
 
 static void
 dtrace_gethrtime_init_cpu(void *arg)
 {
 	uintptr_t cpu = (uintptr_t) arg;
 
 	if (cpu == curcpu)
 		tgt_cpu_tsc = rdtsc();
 	else
 		hst_cpu_tsc = rdtsc();
 }
 
 static void
 dtrace_gethrtime_init(void *arg)
 {
+	cpuset_t map;
 	struct pcpu *pc;
 	uint64_t tsc_f;
-	cpumask_t map;
 	int i;
 
 	/*
 	 * Get TSC frequency known at this moment.
 	 * This should be constant if TSC is invariant.
 	 * Otherwise tick->time conversion will be inaccurate, but
 	 * will preserve monotonic property of TSC.
 	 */
 	tsc_f = atomic_load_acq_64(&tsc_freq);
 
 	/*
 	 * The following line checks that nsec_scale calculated below
 	 * doesn't overflow 32-bit unsigned integer, so that it can multiply
 	 * another 32-bit integer without overflowing 64-bit.
 	 * Thus minimum supported TSC frequency is 62.5MHz.
 	 */
 	KASSERT(tsc_f > (NANOSEC >> (32 - SCALE_SHIFT)), ("TSC frequency is too low"));
 
 	/*
 	 * We scale up NANOSEC/tsc_f ratio to preserve as much precision
 	 * as possible.
 	 * 2^28 factor was chosen quite arbitrarily from practical
 	 * considerations:
 	 * - it supports TSC frequencies as low as 62.5MHz (see above);
 	 * - it provides quite good precision (e < 0.01%) up to THz
 	 *   (terahertz) values;
 	 */
 	nsec_scale = ((uint64_t)NANOSEC << SCALE_SHIFT) / tsc_f;
 
 	/* The current CPU is the reference one. */
 	sched_pin();
 	tsc_skew[curcpu] = 0;
 	CPU_FOREACH(i) {
 		if (i == curcpu)
 			continue;
 
 		pc = pcpu_find(i);
-		map = PCPU_GET(cpumask) | pc->pc_cpumask;
+		map = PCPU_GET(cpumask);
+		CPU_OR(&map, &pc->pc_cpumask);
 
 		smp_rendezvous_cpus(map, NULL,
 		    dtrace_gethrtime_init_cpu,
 		    smp_no_rendevous_barrier, (void *)(uintptr_t) i);
 
 		tsc_skew[i] = tgt_cpu_tsc - hst_cpu_tsc;
 	}
 	sched_unpin();
 }
 
 SYSINIT(dtrace_gethrtime_init, SI_SUB_SMP, SI_ORDER_ANY, dtrace_gethrtime_init, NULL);
 
 /*
  * DTrace needs a high resolution time function which can
  * be called from a probe context and guaranteed not to have
  * instrumented with probes itself.
  *
  * Returns nanoseconds since boot.
  */
 uint64_t
 dtrace_gethrtime()
 {
 	uint64_t tsc;
 	uint32_t lo;
 	uint32_t hi;
 
 	/*
 	 * We split TSC value into lower and higher 32-bit halves and separately
 	 * scale them with nsec_scale, then we scale them down by 2^28
 	 * (see nsec_scale calculations) taking into account 32-bit shift of
 	 * the higher half and finally add.
 	 */
 	tsc = rdtsc() + tsc_skew[curcpu];
 	lo = tsc;
 	hi = tsc >> 32;
 	return (((lo * nsec_scale) >> SCALE_SHIFT) +
 	    ((hi * nsec_scale) << (32 - SCALE_SHIFT)));
 }
 
 uint64_t
 dtrace_gethrestime(void)
 {
 	printf("%s(%d): XXX\n",__func__,__LINE__);
 	return (0);
 }
 
 /* Function to handle DTrace traps during probes. See i386/i386/trap.c */
 int
 dtrace_trap(struct trapframe *frame, u_int type)
 {
 	/*
 	 * A trap can occur while DTrace executes a probe. Before
 	 * executing the probe, DTrace blocks re-scheduling and sets
 	 * a flag in it's per-cpu flags to indicate that it doesn't
 	 * want to fault. On returning from the probe, the no-fault
 	 * flag is cleared and finally re-scheduling is enabled.
 	 *
 	 * Check if DTrace has enabled 'no-fault' mode:
 	 *
 	 */
 	if ((cpu_core[curcpu].cpuc_dtrace_flags & CPU_DTRACE_NOFAULT) != 0) {
 		/*
 		 * There are only a couple of trap types that are expected.
 		 * All the rest will be handled in the usual way.
 		 */
 		switch (type) {
 		/* General protection fault. */
 		case T_PROTFLT:
 			/* Flag an illegal operation. */
 			cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_ILLOP;
 
 			/*
 			 * Offset the instruction pointer to the instruction
 			 * following the one causing the fault.
 			 */
 			frame->tf_eip += dtrace_instr_size((u_char *) frame->tf_eip);
 			return (1);
 		/* Page fault. */
 		case T_PAGEFLT:
 			/* Flag a bad address. */
 			cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_BADADDR;
 			cpu_core[curcpu].cpuc_dtrace_illval = rcr2();
 
 			/*
 			 * Offset the instruction pointer to the instruction
 			 * following the one causing the fault.
 			 */
 			frame->tf_eip += dtrace_instr_size((u_char *) frame->tf_eip);
 			return (1);
 		default:
 			/* Handle all other traps in the usual way. */
 			break;
 		}
 	}
 
 	/* Handle the trap in the usual way. */
 	return (0);
 }
Index: head/sys/conf/NOTES
===================================================================
--- head/sys/conf/NOTES	(revision 222812)
+++ head/sys/conf/NOTES	(revision 222813)
@@ -1,2960 +1,2963 @@
 # $FreeBSD$
 #
 # NOTES -- Lines that can be cut/pasted into kernel and hints configs.
 #
 # Lines that begin with 'device', 'options', 'machine', 'ident', 'maxusers',
 # 'makeoptions', 'hints', etc. go into the kernel configuration that you
 # run config(8) with.
 #
 # Lines that begin with 'hint.' are NOT for config(8), they go into your
 # hints file.  See /boot/device.hints and/or the 'hints' config(8) directive.
 #
 # Please use ``make LINT'' to create an old-style LINT file if you want to
 # do kernel test-builds.
 #
 # This file contains machine independent kernel configuration notes.  For
 # machine dependent notes, look in /sys/<arch>/conf/NOTES.
 #
 
 #
 # NOTES conventions and style guide:
 #
 # Large block comments should begin and end with a line containing only a
 # comment character.
 #
 # To describe a particular object, a block comment (if it exists) should
 # come first.  Next should come device, options, and hints lines in that
 # order.  All device and option lines must be described by a comment that
 # doesn't just expand the device or option name.  Use only a concise
 # comment on the same line if possible.  Very detailed descriptions of
 # devices and subsystems belong in man pages.
 #
 # A space followed by a tab separates 'options' from an option name.  Two
 # spaces followed by a tab separate 'device' from a device name.  Comments
 # after an option or device should use one space after the comment character.
 # To comment out a negative option that disables code and thus should not be
 # enabled for LINT builds, precede 'options' with "#!".
 #
 
 #
 # This is the ``identification'' of the kernel.  Usually this should
 # be the same as the name of your kernel.
 #
 ident		LINT
 
 #
 # The `maxusers' parameter controls the static sizing of a number of
 # internal system tables by a formula defined in subr_param.c.
 # Omitting this parameter or setting it to 0 will cause the system to
 # auto-size based on physical memory.
 #
 maxusers	10
 
 # To statically compile in device wiring instead of /boot/device.hints
 #hints		"LINT.hints"		# Default places to look for devices.
 
 # Use the following to compile in values accessible to the kernel
 # through getenv() (or kenv(1) in userland). The format of the file
 # is 'variable=value', see kenv(1)
 #
 #env		"LINT.env"
 
 #
 # The `makeoptions' parameter allows variables to be passed to the
 # generated Makefile in the build area.
 #
 # CONF_CFLAGS gives some extra compiler flags that are added to ${CFLAGS}
 # after most other flags.  Here we use it to inhibit use of non-optimal
 # gcc built-in functions (e.g., memcmp).
 #
 # DEBUG happens to be magic.
 # The following is equivalent to 'config -g KERNELNAME' and creates
 # 'kernel.debug' compiled with -g debugging as well as a normal
 # 'kernel'.  Use 'make install.debug' to install the debug kernel
 # but that isn't normally necessary as the debug symbols are not loaded
 # by the kernel and are not useful there anyway.
 #
 # KERNEL can be overridden so that you can change the default name of your
 # kernel.
 #
 # MODULES_OVERRIDE can be used to limit modules built to a specific list.
 #
 makeoptions	CONF_CFLAGS=-fno-builtin  #Don't allow use of memcmp, etc.
 #makeoptions	DEBUG=-g		#Build kernel with gdb(1) debug symbols
 #makeoptions	KERNEL=foo		#Build kernel "foo" and install "/foo"
 # Only build ext2fs module plus those parts of the sound system I need.
 #makeoptions	MODULES_OVERRIDE="ext2fs sound/sound sound/driver/maestro3"
 makeoptions	DESTDIR=/tmp
 
 #
 # FreeBSD processes are subject to certain limits to their consumption
 # of system resources.  See getrlimit(2) for more details.  Each
 # resource limit has two values, a "soft" limit and a "hard" limit.
 # The soft limits can be modified during normal system operation, but
 # the hard limits are set at boot time.  Their default values are
 # in sys/<arch>/include/vmparam.h.  There are two ways to change them:
 # 
 # 1.  Set the values at kernel build time.  The options below are one
 #     way to allow that limit to grow to 1GB.  They can be increased
 #     further by changing the parameters:
 #	
 # 2.  In /boot/loader.conf, set the tunables kern.maxswzone,
 #     kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.maxdsiz,
 #     kern.dflssiz, kern.maxssiz and kern.sgrowsiz.
 #
 # The options in /boot/loader.conf override anything in the kernel
 # configuration file.  See the function init_param1 in
 # sys/kern/subr_param.c for more details.
 #
 
 options 	MAXDSIZ=(1024UL*1024*1024)
 options 	MAXSSIZ=(128UL*1024*1024)
 options 	DFLDSIZ=(1024UL*1024*1024)
 
 #
 # BLKDEV_IOSIZE sets the default block size used in user block
 # device I/O.  Note that this value will be overridden by the label
 # when specifying a block device from a label with a non-0
 # partition blocksize.  The default is PAGE_SIZE.
 #
 options 	BLKDEV_IOSIZE=8192
 
 #
 # MAXPHYS and DFLTPHYS
 #
 # These are the maximal and safe 'raw' I/O block device access sizes.
 # Reads and writes will be split into MAXPHYS chunks for known good
 # devices and DFLTPHYS for the rest. Some applications have better
 # performance with larger raw I/O access sizes. Note that certain VM
 # parameters are derived from these values and making them too large
 # can make an an unbootable kernel.
 #
 # The defaults are 64K and 128K respectively.
 options 	DFLTPHYS=(64*1024)
 options 	MAXPHYS=(128*1024)
 
 
 # This allows you to actually store this configuration file into
 # the kernel binary itself. See config(8) for more details.
 #
 options 	INCLUDE_CONFIG_FILE     # Include this file in kernel
 
 options 	GEOM_AES		# Don't use, use GEOM_BDE
 options 	GEOM_BDE		# Disk encryption.
 options 	GEOM_BSD		# BSD disklabels
 options 	GEOM_CACHE		# Disk cache.
 options 	GEOM_CONCAT		# Disk concatenation.
 options 	GEOM_ELI		# Disk encryption.
 options 	GEOM_FOX		# Redundant path mitigation
 options 	GEOM_GATE		# Userland services.
 options 	GEOM_JOURNAL		# Journaling.
 options 	GEOM_LABEL		# Providers labelization.
 options 	GEOM_LINUX_LVM		# Linux LVM2 volumes
 options 	GEOM_MBR		# DOS/MBR partitioning
 options 	GEOM_MIRROR		# Disk mirroring.
 options 	GEOM_MULTIPATH		# Disk multipath
 options 	GEOM_NOP		# Test class.
 options 	GEOM_PART_APM		# Apple partitioning
 options 	GEOM_PART_BSD		# BSD disklabel
 options 	GEOM_PART_EBR		# Extended Boot Records
 options 	GEOM_PART_EBR_COMPAT	# Backward compatible partition names
 options 	GEOM_PART_GPT		# GPT partitioning
 options 	GEOM_PART_MBR		# MBR partitioning
 options 	GEOM_PART_PC98		# PC-9800 disk partitioning
 options 	GEOM_PART_VTOC8		# SMI VTOC8 disk label
 options 	GEOM_PC98		# NEC PC9800 partitioning
 options 	GEOM_RAID		# Soft RAID functionality.
 options 	GEOM_RAID3		# RAID3 functionality.
 options 	GEOM_SHSEC		# Shared secret.
 options 	GEOM_STRIPE		# Disk striping.
 options 	GEOM_SUNLABEL		# Sun/Solaris partitioning
 options 	GEOM_UZIP		# Read-only compressed disks
 options 	GEOM_VIRSTOR		# Virtual storage.
 options 	GEOM_VOL		# Volume names from UFS superblock
 options 	GEOM_ZERO		# Performance testing helper.
 
 #
 # The root device and filesystem type can be compiled in;
 # this provides a fallback option if the root device cannot
 # be correctly guessed by the bootstrap code, or an override if
 # the RB_DFLTROOT flag (-r) is specified when booting the kernel.
 #
 options 	ROOTDEVNAME=\"ufs:da0s2e\"
 
 
 #####################################################################
 # Scheduler options:
 #
 # Specifying one of SCHED_4BSD or SCHED_ULE is mandatory.  These options
 # select which scheduler is compiled in.
 #
 # SCHED_4BSD is the historical, proven, BSD scheduler.  It has a global run
 # queue and no CPU affinity which makes it suboptimal for SMP.  It has very
 # good interactivity and priority selection.
 #
 # SCHED_ULE provides significant performance advantages over 4BSD on many
 # workloads on SMP machines.  It supports cpu-affinity, per-cpu runqueues
 # and scheduler locks.  It also has a stronger notion of interactivity 
 # which leads to better responsiveness even on uniprocessor machines.  This
 # is the default scheduler.
 #
 # SCHED_STATS is a debugging option which keeps some stats in the sysctl
 # tree at 'kern.sched.stats' and is useful for debugging scheduling decisions.
 #
 options 	SCHED_4BSD
 options 	SCHED_STATS
 #options 	SCHED_ULE
 
 #####################################################################
 # SMP OPTIONS:
 #
 # SMP enables building of a Symmetric MultiProcessor Kernel.
 
 # Mandatory:
 options 	SMP			# Symmetric MultiProcessor Kernel
 
 # ADAPTIVE_MUTEXES changes the behavior of blocking mutexes to spin
 # if the thread that currently owns the mutex is executing on another
 # CPU.  This behaviour is enabled by default, so this option can be used
 # to disable it.
 options 	NO_ADAPTIVE_MUTEXES
 
 # ADAPTIVE_RWLOCKS changes the behavior of reader/writer locks to spin
 # if the thread that currently owns the rwlock is executing on another
 # CPU.  This behaviour is enabled by default, so this option can be used
 # to disable it.
 options 	NO_ADAPTIVE_RWLOCKS
 
 # ADAPTIVE_SX changes the behavior of sx locks to spin if the thread that
 # currently owns the sx lock is executing on another CPU.
 # This behaviour is enabled by default, so this option can be used to
 # disable it.
 options 	NO_ADAPTIVE_SX
 
 # MUTEX_NOINLINE forces mutex operations to call functions to perform each
 # operation rather than inlining the simple cases.  This can be used to
 # shrink the size of the kernel text segment.  Note that this behavior is
 # already implied by the INVARIANT_SUPPORT, INVARIANTS, KTR, LOCK_PROFILING,
 # and WITNESS options.
 options 	MUTEX_NOINLINE
 
 # RWLOCK_NOINLINE forces rwlock operations to call functions to perform each
 # operation rather than inlining the simple cases.  This can be used to
 # shrink the size of the kernel text segment.  Note that this behavior is
 # already implied by the INVARIANT_SUPPORT, INVARIANTS, KTR, LOCK_PROFILING,
 # and WITNESS options.
 options 	RWLOCK_NOINLINE
 
 # SX_NOINLINE forces sx lock operations to call functions to perform each
 # operation rather than inlining the simple cases.  This can be used to
 # shrink the size of the kernel text segment.  Note that this behavior is
 # already implied by the INVARIANT_SUPPORT, INVARIANTS, KTR, LOCK_PROFILING,
 # and WITNESS options.
 options 	SX_NOINLINE
 
 # SMP Debugging Options:
 #
 # PREEMPTION allows the threads that are in the kernel to be preempted by
 #	  higher priority [interrupt] threads.  It helps with interactivity
 #	  and allows interrupt threads to run sooner rather than waiting.
 #	  WARNING! Only tested on amd64 and i386.
 # FULL_PREEMPTION instructs the kernel to preempt non-realtime kernel
 #	  threads.  Its sole use is to expose race conditions and other
 #	  bugs during development.  Enabling this option will reduce
 #	  performance and increase the frequency of kernel panics by
 #	  design.  If you aren't sure that you need it then you don't.
 #	  Relies on the PREEMPTION option.  DON'T TURN THIS ON.
 # MUTEX_DEBUG enables various extra assertions in the mutex code.
 # SLEEPQUEUE_PROFILING enables rudimentary profiling of the hash table
 #	  used to hold active sleep queues as well as sleep wait message
 #	  frequency.
 # TURNSTILE_PROFILING enables rudimentary profiling of the hash table
 #	  used to hold active lock queues.
 # WITNESS enables the witness code which detects deadlocks and cycles
 #         during locking operations.
 # WITNESS_KDB causes the witness code to drop into the kernel debugger if
 #	  a lock hierarchy violation occurs or if locks are held when going to
 #	  sleep.
 # WITNESS_SKIPSPIN disables the witness checks on spin mutexes.
 options 	PREEMPTION
 options 	FULL_PREEMPTION
 options 	MUTEX_DEBUG
 options 	WITNESS
 options 	WITNESS_KDB
 options 	WITNESS_SKIPSPIN
 
 # LOCK_PROFILING - Profiling locks.  See LOCK_PROFILING(9) for details.
 options 	LOCK_PROFILING
 # Set the number of buffers and the hash size.  The hash size MUST be larger
 # than the number of buffers.  Hash size should be prime.
 options 	MPROF_BUFFERS="1536"
 options 	MPROF_HASH_SIZE="1543"
 
 # Profiling for internal hash tables.
 options 	SLEEPQUEUE_PROFILING
 options 	TURNSTILE_PROFILING
 
 
 #####################################################################
 # COMPATIBILITY OPTIONS
 
 #
 # Implement system calls compatible with 4.3BSD and older versions of
 # FreeBSD.  You probably do NOT want to remove this as much current code
 # still relies on the 4.3 emulation.  Note that some architectures that
 # are supported by FreeBSD do not include support for certain important
 # aspects of this compatibility option, namely those related to the
 # signal delivery mechanism.
 #
 options 	COMPAT_43
 
 # Old tty interface.
 options 	COMPAT_43TTY
 
 # Note that as a general rule, COMPAT_FREEBSD<n> depends on
 # COMPAT_FREEBSD<n+1>, COMPAT_FREEBSD<n+2>, etc.
 
 # Enable FreeBSD4 compatibility syscalls
 options 	COMPAT_FREEBSD4
 
 # Enable FreeBSD5 compatibility syscalls
 options 	COMPAT_FREEBSD5
 
 # Enable FreeBSD6 compatibility syscalls
 options 	COMPAT_FREEBSD6
 
 # Enable FreeBSD7 compatibility syscalls
 options 	COMPAT_FREEBSD7
 
 #
 # These three options provide support for System V Interface
 # Definition-style interprocess communication, in the form of shared
 # memory, semaphores, and message queues, respectively.
 #
 options 	SYSVSHM
 options 	SYSVSEM
 options 	SYSVMSG
 
 
 #####################################################################
 # DEBUGGING OPTIONS
 
 #
 # Compile with kernel debugger related code.
 #
 options 	KDB
 
 #
 # Print a stack trace of the current thread on the console for a panic.
 #
 options 	KDB_TRACE
 
 #
 # Don't enter the debugger for a panic. Intended for unattended operation
 # where you may want to enter the debugger from the console, but still want
 # the machine to recover from a panic.
 #
 options 	KDB_UNATTENDED
 
 #
 # Enable the ddb debugger backend.
 #
 options 	DDB
 
 #
 # Print the numerical value of symbols in addition to the symbolic
 # representation.
 #
 options 	DDB_NUMSYM
 
 #
 # Enable the remote gdb debugger backend.
 #
 options 	GDB
 
 #
 # SYSCTL_DEBUG enables a 'sysctl' debug tree that can be used to dump the
 # contents of the registered sysctl nodes on the console.  It is disabled by
 # default because it generates excessively verbose console output that can
 # interfere with serial console operation.
 #
 options 	SYSCTL_DEBUG
 
 #
 # NO_SYSCTL_DESCR omits the sysctl node descriptions to save space in the
 # resulting kernel.
 options		NO_SYSCTL_DESCR
 
 #
 # MALLOC_DEBUG_MAXZONES enables multiple uma zones for malloc(9)
 # allocations that are smaller than a page.  The purpose is to isolate
 # different malloc types into hash classes, so that any buffer
 # overruns or use-after-free will usually only affect memory from
 # malloc types in that hash class.  This is purely a debugging tool;
 # by varying the hash function and tracking which hash class was
 # corrupted, the intersection of the hash classes from each instance
 # will point to a single malloc type that is being misused.  At this
 # point inspection or memguard(9) can be used to catch the offending
 # code.
 #
 options 	MALLOC_DEBUG_MAXZONES=8
 
 #
 # DEBUG_MEMGUARD builds and enables memguard(9), a replacement allocator
 # for the kernel used to detect modify-after-free scenarios.  See the
 # memguard(9) man page for more information on usage.
 #
 options 	DEBUG_MEMGUARD
 
 #
 # DEBUG_REDZONE enables buffer underflows and buffer overflows detection for
 # malloc(9).
 #
 options 	DEBUG_REDZONE
 
 #
 # KTRACE enables the system-call tracing facility ktrace(2).  To be more
 # SMP-friendly, KTRACE uses a worker thread to process most trace events
 # asynchronously to the thread generating the event.  This requires a
 # pre-allocated store of objects representing trace events.  The
 # KTRACE_REQUEST_POOL option specifies the initial size of this store.
 # The size of the pool can be adjusted both at boottime and runtime via
 # the kern.ktrace_request_pool tunable and sysctl.
 #
 options 	KTRACE			#kernel tracing
 options 	KTRACE_REQUEST_POOL=101
 
 #
 # KTR is a kernel tracing facility imported from BSD/OS.  It is
 # enabled with the KTR option.  KTR_ENTRIES defines the number of
 # entries in the circular trace buffer; it must be a power of two.
 # KTR_COMPILE defines the mask of events to compile into the kernel as
 # defined by the KTR_* constants in <sys/ktr.h>.  KTR_MASK defines the
 # initial value of the ktr_mask variable which determines at runtime
 # what events to trace.  KTR_CPUMASK determines which CPU's log
-# events, with bit X corresponding to CPU X.  KTR_VERBOSE enables
+# events, with bit X corresponding to CPU X.  The layout of the string
+# passed as KTR_CPUMASK must match a serie of bitmasks each of them
+# separated by the ", " characters (ie:
+# KTR_CPUMASK=("0xAF, 0xFFFFFFFFFFFFFFFF")).  KTR_VERBOSE enables
 # dumping of KTR events to the console by default.  This functionality
 # can be toggled via the debug.ktr_verbose sysctl and defaults to off
 # if KTR_VERBOSE is not defined.  See ktr(4) and ktrdump(8) for details.
 #
 options 	KTR
 options 	KTR_ENTRIES=1024
 options 	KTR_COMPILE=(KTR_INTR|KTR_PROC)
 options 	KTR_MASK=KTR_INTR
-options 	KTR_CPUMASK=0x3
+options 	KTR_CPUMASK=("0x3")
 options 	KTR_VERBOSE
 
 #
 # ALQ(9) is a facility for the asynchronous queuing of records from the kernel
 # to a vnode, and is employed by services such as ktr(4) to produce trace
 # files based on a kernel event stream.  Records are written asynchronously
 # in a worker thread.
 #
 options 	ALQ
 options 	KTR_ALQ
 
 #
 # The INVARIANTS option is used in a number of source files to enable
 # extra sanity checking of internal structures.  This support is not
 # enabled by default because of the extra time it would take to check
 # for these conditions, which can only occur as a result of
 # programming errors.
 #
 options 	INVARIANTS
 
 #
 # The INVARIANT_SUPPORT option makes us compile in support for
 # verifying some of the internal structures.  It is a prerequisite for
 # 'INVARIANTS', as enabling 'INVARIANTS' will make these functions be
 # called.  The intent is that you can set 'INVARIANTS' for single
 # source files (by changing the source file or specifying it on the
 # command line) if you have 'INVARIANT_SUPPORT' enabled.  Also, if you
 # wish to build a kernel module with 'INVARIANTS', then adding
 # 'INVARIANT_SUPPORT' to your kernel will provide all the necessary
 # infrastructure without the added overhead.
 #
 options 	INVARIANT_SUPPORT
 
 #
 # The DIAGNOSTIC option is used to enable extra debugging information
 # from some parts of the kernel.  As this makes everything more noisy,
 # it is disabled by default.
 #
 options 	DIAGNOSTIC
 
 #
 # REGRESSION causes optional kernel interfaces necessary only for regression
 # testing to be enabled.  These interfaces may constitute security risks
 # when enabled, as they permit processes to easily modify aspects of the
 # run-time environment to reproduce unlikely or unusual (possibly normally
 # impossible) scenarios.
 #
 options 	REGRESSION
 
 #
 # RESTARTABLE_PANICS allows one to continue from a panic as if it were
 # a call to the debugger to continue from a panic as instead.  It is only
 # useful if a kernel debugger is present.  To restart from a panic, reset
 # the panicstr variable to NULL and continue execution.  This option is
 # for development use only and should NOT be used in production systems
 # to "workaround" a panic.
 #
 #options 	RESTARTABLE_PANICS
 
 #
 # This option lets some drivers co-exist that can't co-exist in a running
 # system.  This is used to be able to compile all kernel code in one go for
 # quality assurance purposes (like this file, which the option takes it name
 # from.)
 #
 options 	COMPILING_LINT
 
 #
 # STACK enables the stack(9) facility, allowing the capture of kernel stack
 # for the purpose of procinfo(1), etc.  stack(9) will also be compiled in
 # automatically if DDB(4) is compiled into the kernel.
 #
 options 	STACK
 
 
 #####################################################################
 # PERFORMANCE MONITORING OPTIONS
 
 #
 # The hwpmc driver that allows the use of in-CPU performance monitoring
 # counters for performance monitoring.  The base kernel needs to be configured
 # with the 'options' line, while the hwpmc device can be either compiled
 # in or loaded as a loadable kernel module.
 #
 # Additional configuration options may be required on specific architectures,
 # please see hwpmc(4).
 
 device		hwpmc			# Driver (also a loadable module)
 options 	HWPMC_HOOKS		# Other necessary kernel hooks
 
 
 #####################################################################
 # NETWORKING OPTIONS
 
 #
 # Protocol families
 #
 options 	INET			#Internet communications protocols
 options 	INET6			#IPv6 communications protocols
 
 options 	ROUTETABLES=2		# max 16. 1 is back compatible.
 
 # In order to enable IPSEC you MUST also add device crypto to 
 # your kernel configuration
 options 	IPSEC			#IP security (requires device crypto)
 #options 	IPSEC_DEBUG		#debug for IP security
 #
 # #DEPRECATED#
 # Set IPSEC_FILTERTUNNEL to change the default of the sysctl to force packets
 # coming through a tunnel to be processed by any configured packet filtering
 # twice. The default is that packets coming out of a tunnel are _not_ processed;
 # they are assumed trusted.
 #
 # IPSEC history is preserved for such packets, and can be filtered
 # using ipfw(8)'s 'ipsec' keyword, when this option is enabled.
 #
 #options 	IPSEC_FILTERTUNNEL	#filter ipsec packets from a tunnel
 #
 # Set IPSEC_NAT_T to enable NAT-Traversal support.  This enables
 # optional UDP encapsulation of ESP packets.
 #
 options		IPSEC_NAT_T		#NAT-T support, UDP encap of ESP
 
 options 	IPX			#IPX/SPX communications protocols
 
 options 	NCP			#NetWare Core protocol
 
 options 	NETATALK		#Appletalk communications protocols
 options 	NETATALKDEBUG		#Appletalk debugging
 
 #
 # SMB/CIFS requester
 # NETSMB enables support for SMB protocol, it requires LIBMCHAIN and LIBICONV
 # options.
 options 	NETSMB			#SMB/CIFS requester
 
 # mchain library. It can be either loaded as KLD or compiled into kernel
 options 	LIBMCHAIN
 
 # libalias library, performing NAT
 options 	LIBALIAS
 
 # flowtable cache
 options 	FLOWTABLE
 
 #
 # SCTP is a NEW transport protocol defined by
 # RFC2960 updated by RFC3309 and RFC3758.. and
 # soon to have a new base RFC and many many more
 # extensions. This release supports all the extensions
 # including many drafts (most about to become RFC's).
 # It is the reference implementation of SCTP
 # and is quite well tested.
 #
 # Note YOU MUST have both INET and INET6 defined.
 # You don't have to enable V6, but SCTP is 
 # dual stacked and so far we have not torn apart
 # the V6 and V4.. since an association can span
 # both a V6 and V4 address at the SAME time :-)
 #
 options 	SCTP
 # There are bunches of options:
 # this one turns on all sorts of
 # nastly printing that you can
 # do. It's all controlled by a
 # bit mask (settable by socket opt and
 # by sysctl). Including will not cause
 # logging until you set the bits.. but it
 # can be quite verbose.. so without this
 # option we don't do any of the tests for
 # bits and prints.. which makes the code run
 # faster.. if you are not debugging don't use.
 options 	SCTP_DEBUG
 #
 # This option turns off the CRC32c checksum. Basically,
 # you will not be able to talk to anyone else who
 # has not done this. Its more for experimentation to
 # see how much CPU the CRC32c really takes. Most new
 # cards for TCP support checksum offload.. so this 
 # option gives you a "view" into what SCTP would be
 # like with such an offload (which only exists in
 # high in iSCSI boards so far). With the new
 # splitting 8's algorithm its not as bad as it used
 # to be.. but it does speed things up try only
 # for in a captured lab environment :-)
 options 	SCTP_WITH_NO_CSUM
 #
 
 #
 # All that options after that turn on specific types of
 # logging. You can monitor CWND growth, flight size
 # and all sorts of things. Go look at the code and
 # see. I have used this to produce interesting 
 # charts and graphs as well :->
 # 
 # I have not yet committed the tools to get and print
 # the logs, I will do that eventually .. before then
 # if you want them send me an email rrs@freebsd.org
 # You basically must have ktr(4) enabled for these
 # and you then set the sysctl to turn on/off various
 # logging bits. Use ktrdump(8) to pull the log and run
 # it through a display program.. and graphs and other
 # things too.
 #
 options 	SCTP_LOCK_LOGGING
 options 	SCTP_MBUF_LOGGING
 options 	SCTP_MBCNT_LOGGING
 options 	SCTP_PACKET_LOGGING
 options 	SCTP_LTRACE_CHUNKS
 options 	SCTP_LTRACE_ERRORS
 
 
 # altq(9). Enable the base part of the hooks with the ALTQ option.
 # Individual disciplines must be built into the base system and can not be
 # loaded as modules at this point. ALTQ requires a stable TSC so if yours is
 # broken or changes with CPU throttling then you must also have the ALTQ_NOPCC
 # option.
 options 	ALTQ
 options 	ALTQ_CBQ	# Class Based Queueing
 options 	ALTQ_RED	# Random Early Detection
 options 	ALTQ_RIO	# RED In/Out
 options 	ALTQ_HFSC	# Hierarchical Packet Scheduler
 options 	ALTQ_CDNR	# Traffic conditioner
 options 	ALTQ_PRIQ	# Priority Queueing
 options 	ALTQ_NOPCC	# Required if the TSC is unusable
 options 	ALTQ_DEBUG
 
 # netgraph(4). Enable the base netgraph code with the NETGRAPH option.
 # Individual node types can be enabled with the corresponding option
 # listed below; however, this is not strictly necessary as netgraph
 # will automatically load the corresponding KLD module if the node type
 # is not already compiled into the kernel. Each type below has a
 # corresponding man page, e.g., ng_async(8).
 options 	NETGRAPH		# netgraph(4) system
 options 	NETGRAPH_DEBUG		# enable extra debugging, this
 					# affects netgraph(4) and nodes
 # Node types
 options 	NETGRAPH_ASYNC
 options 	NETGRAPH_ATMLLC
 options 	NETGRAPH_ATM_ATMPIF
 options 	NETGRAPH_BLUETOOTH		# ng_bluetooth(4)
 options 	NETGRAPH_BLUETOOTH_BT3C		# ng_bt3c(4)
 options 	NETGRAPH_BLUETOOTH_HCI		# ng_hci(4)
 options 	NETGRAPH_BLUETOOTH_L2CAP	# ng_l2cap(4)
 options 	NETGRAPH_BLUETOOTH_SOCKET	# ng_btsocket(4)
 options 	NETGRAPH_BLUETOOTH_UBT		# ng_ubt(4)
 options 	NETGRAPH_BLUETOOTH_UBTBCMFW	# ubtbcmfw(4)
 options 	NETGRAPH_BPF
 options 	NETGRAPH_BRIDGE
 options 	NETGRAPH_CAR
 options 	NETGRAPH_CISCO
 options 	NETGRAPH_DEFLATE
 options 	NETGRAPH_DEVICE
 options 	NETGRAPH_ECHO
 options 	NETGRAPH_EIFACE
 options 	NETGRAPH_ETHER
 options 	NETGRAPH_FEC
 options 	NETGRAPH_FRAME_RELAY
 options 	NETGRAPH_GIF
 options 	NETGRAPH_GIF_DEMUX
 options 	NETGRAPH_HOLE
 options 	NETGRAPH_IFACE
 options 	NETGRAPH_IP_INPUT
 options 	NETGRAPH_IPFW
 options 	NETGRAPH_KSOCKET
 options 	NETGRAPH_L2TP
 options 	NETGRAPH_LMI
 # MPPC compression requires proprietary files (not included)
 #options 	NETGRAPH_MPPC_COMPRESSION
 options 	NETGRAPH_MPPC_ENCRYPTION
 options 	NETGRAPH_NETFLOW
 options 	NETGRAPH_NAT
 options 	NETGRAPH_ONE2MANY
 options 	NETGRAPH_PATCH
 options 	NETGRAPH_PIPE
 options 	NETGRAPH_PPP
 options 	NETGRAPH_PPPOE
 options 	NETGRAPH_PPTPGRE
 options 	NETGRAPH_PRED1
 options 	NETGRAPH_RFC1490
 options 	NETGRAPH_SOCKET
 options 	NETGRAPH_SPLIT
 options 	NETGRAPH_SPPP
 options 	NETGRAPH_TAG
 options 	NETGRAPH_TCPMSS
 options 	NETGRAPH_TEE
 options 	NETGRAPH_UI
 options 	NETGRAPH_VJC
 options 	NETGRAPH_VLAN
 
 # NgATM - Netgraph ATM
 options 	NGATM_ATM
 options 	NGATM_ATMBASE
 options 	NGATM_SSCOP
 options 	NGATM_SSCFU
 options 	NGATM_UNI
 options 	NGATM_CCATM
 
 device		mn	# Munich32x/Falc54 Nx64kbit/sec cards.
 
 #
 # Network interfaces:
 #  The `loop' device is MANDATORY when networking is enabled.
 device		loop
 
 #  The `ether' device provides generic code to handle
 #  Ethernets; it is MANDATORY when an Ethernet device driver is
 #  configured or token-ring is enabled.
 device		ether
 
 #  The `vlan' device implements the VLAN tagging of Ethernet frames
 #  according to IEEE 802.1Q.
 device		vlan
 
 #  The `wlan' device provides generic code to support 802.11
 #  drivers, including host AP mode; it is MANDATORY for the wi,
 #  and ath drivers and will eventually be required by all 802.11 drivers.
 device		wlan
 options 	IEEE80211_DEBUG		#enable debugging msgs
 options 	IEEE80211_AMPDU_AGE	#age frames in AMPDU reorder q's
 options 	IEEE80211_SUPPORT_MESH	#enable 802.11s D3.0 support
 options 	IEEE80211_SUPPORT_TDMA	#enable TDMA support
 
 #  The `wlan_wep', `wlan_tkip', and `wlan_ccmp' devices provide
 #  support for WEP, TKIP, and AES-CCMP crypto protocols optionally
 #  used with 802.11 devices that depend on the `wlan' module.
 device		wlan_wep
 device		wlan_ccmp
 device		wlan_tkip
 
 #  The `wlan_xauth' device provides support for external (i.e. user-mode)
 #  authenticators for use with 802.11 drivers that use the `wlan'
 #  module and support 802.1x and/or WPA security protocols.
 device		wlan_xauth
 
 #  The `wlan_acl' device provides a MAC-based access control mechanism
 #  for use with 802.11 drivers operating in ap mode and using the
 #  `wlan' module.
 #  The 'wlan_amrr' device provides AMRR transmit rate control algorithm
 device		wlan_acl
 device		wlan_amrr
 
 # Generic TokenRing
 device		token
 
 #  The `fddi' device provides generic code to support FDDI.
 device		fddi
 
 #  The `arcnet' device provides generic code to support Arcnet.
 device		arcnet
 
 #  The `sppp' device serves a similar role for certain types
 #  of synchronous PPP links (like `cx', `ar').
 device		sppp
 
 #  The `bpf' device enables the Berkeley Packet Filter.  Be
 #  aware of the legal and administrative consequences of enabling this
 #  option.  DHCP requires bpf.
 device		bpf
 
 #  The `disc' device implements a minimal network interface,
 #  which throws away all packets sent and never receives any.  It is
 #  included for testing and benchmarking purposes.
 device		disc
 
 # The `epair' device implements a virtual back-to-back connected Ethernet
 # like interface pair.
 device		epair
 
 #  The `edsc' device implements a minimal Ethernet interface,
 #  which discards all packets sent and receives none.
 device		edsc
 
 #  The `tap' device is a pty-like virtual Ethernet interface
 device		tap
 
 #  The `tun' device implements (user-)ppp and nos-tun(8)
 device		tun
 
 #  The `gif' device implements IPv6 over IP4 tunneling,
 #  IPv4 over IPv6 tunneling, IPv4 over IPv4 tunneling and
 #  IPv6 over IPv6 tunneling.
 #  The `gre' device implements two types of IP4 over IP4 tunneling:
 #  GRE and MOBILE, as specified in the RFC1701 and RFC2004.
 #  The XBONEHACK option allows the same pair of addresses to be configured on
 #  multiple gif interfaces.
 device		gif
 device		gre
 options 	XBONEHACK
 
 #  The `faith' device captures packets sent to it and diverts them
 #  to the IPv4/IPv6 translation daemon.
 #  The `stf' device implements 6to4 encapsulation.
 device		faith
 device		stf
 
 #  The `ef' device provides support for multiple ethernet frame types
 #  specified via ETHER_* options. See ef(4) for details.
 device		ef
 options 	ETHER_II		# enable Ethernet_II frame
 options 	ETHER_8023		# enable Ethernet_802.3 (Novell) frame
 options 	ETHER_8022		# enable Ethernet_802.2 frame
 options 	ETHER_SNAP		# enable Ethernet_802.2/SNAP frame
 
 # The pf packet filter consists of three devices:
 #  The `pf' device provides /dev/pf and the firewall code itself.
 #  The `pflog' device provides the pflog0 interface which logs packets.
 #  The `pfsync' device provides the pfsync0 interface used for
 #   synchronization of firewall state tables (over the net).
 device		pf
 device		pflog
 device		pfsync
 
 # Bridge interface.
 device		if_bridge
 
 # Common Address Redundancy Protocol. See carp(4) for more details.
 device		carp
 
 # IPsec interface.
 device		enc
 
 # Link aggregation interface.
 device		lagg
 
 #
 # Internet family options:
 #
 # MROUTING enables the kernel multicast packet forwarder, which works
 # with mrouted and XORP.
 #
 # IPFIREWALL enables support for IP firewall construction, in
 # conjunction with the `ipfw' program.  IPFIREWALL_VERBOSE sends
 # logged packets to the system logger.  IPFIREWALL_VERBOSE_LIMIT
 # limits the number of times a matching entry can be logged.
 #
 # WARNING:  IPFIREWALL defaults to a policy of "deny ip from any to any"
 # and if you do not add other rules during startup to allow access,
 # YOU WILL LOCK YOURSELF OUT.  It is suggested that you set firewall_type=open
 # in /etc/rc.conf when first enabling this feature, then refining the
 # firewall rules in /etc/rc.firewall after you've tested that the new kernel
 # feature works properly.
 #
 # IPFIREWALL_DEFAULT_TO_ACCEPT causes the default rule (at boot) to
 # allow everything.  Use with care, if a cracker can crash your
 # firewall machine, they can get to your protected machines.  However,
 # if you are using it as an as-needed filter for specific problems as
 # they arise, then this may be for you.  Changing the default to 'allow'
 # means that you won't get stuck if the kernel and /sbin/ipfw binary get
 # out of sync.
 #
 # IPDIVERT enables the divert IP sockets, used by ``ipfw divert''.  It
 # depends on IPFIREWALL if compiled into the kernel.
 #
 # IPFIREWALL_FORWARD enables changing of the packet destination either
 # to do some sort of policy routing or transparent proxying.  Used by
 # ``ipfw forward''. All  redirections apply to locally generated
 # packets too.  Because of this great care is required when
 # crafting the ruleset.
 #
 # IPFIREWALL_NAT adds support for in kernel nat in ipfw, and it requires
 # LIBALIAS.
 #
 # IPSTEALTH enables code to support stealth forwarding (i.e., forwarding
 # packets without touching the TTL).  This can be useful to hide firewalls
 # from traceroute and similar tools.
 #
 # TCPDEBUG enables code which keeps traces of the TCP state machine
 # for sockets with the SO_DEBUG option set, which can then be examined
 # using the trpt(8) utility.
 #
 options 	MROUTING		# Multicast routing
 options 	IPFIREWALL		#firewall
 options 	IPFIREWALL_VERBOSE	#enable logging to syslogd(8)
 options 	IPFIREWALL_VERBOSE_LIMIT=100	#limit verbosity
 options 	IPFIREWALL_DEFAULT_TO_ACCEPT	#allow everything by default
 options 	IPFIREWALL_FORWARD	#packet destination changes
 options 	IPFIREWALL_NAT		#ipfw kernel nat support
 options 	IPDIVERT		#divert sockets
 options 	IPFILTER		#ipfilter support
 options 	IPFILTER_LOG		#ipfilter logging
 options 	IPFILTER_LOOKUP		#ipfilter pools
 options 	IPFILTER_DEFAULT_BLOCK	#block all packets by default
 options 	IPSTEALTH		#support for stealth forwarding
 options 	TCPDEBUG
 
 # The MBUF_STRESS_TEST option enables options which create
 # various random failures / extreme cases related to mbuf
 # functions.  See mbuf(9) for a list of available test cases.
 # MBUF_PROFILING enables code to profile the mbuf chains
 # exiting the system (via participating interfaces) and
 # return a logarithmic histogram of monitored parameters
 # (e.g. packet size, wasted space, number of mbufs in chain).
 options 	MBUF_STRESS_TEST
 options 	MBUF_PROFILING
 
 # Statically link in accept filters
 options 	ACCEPT_FILTER_DATA
 options 	ACCEPT_FILTER_DNS
 options 	ACCEPT_FILTER_HTTP
 
 # TCP_SIGNATURE adds support for RFC 2385 (TCP-MD5) digests. These are
 # carried in TCP option 19. This option is commonly used to protect
 # TCP sessions (e.g. BGP) where IPSEC is not available nor desirable.
 # This is enabled on a per-socket basis using the TCP_MD5SIG socket option.
 # This requires the use of 'device crypto', 'options IPSEC'
 # or 'device cryptodev'.
 options 	TCP_SIGNATURE		#include support for RFC 2385
 
 # DUMMYNET enables the "dummynet" bandwidth limiter.  You need IPFIREWALL
 # as well.  See dummynet(4) and ipfw(8) for more info.  When you run
 # DUMMYNET it is advisable to also have at least "options HZ=1000" to achieve
 # a smooth scheduling of the traffic.
 options 	DUMMYNET
 
 # Zero copy sockets support.  This enables "zero copy" for sending and
 # receiving data via a socket.  The send side works for any type of NIC,
 # the receive side only works for NICs that support MTUs greater than the
 # page size of your architecture and that support header splitting.  See
 # zero_copy(9) for more details.
 options 	ZERO_COPY_SOCKETS
 
 #####################################################################
 # FILESYSTEM OPTIONS
 
 #
 # Only the root, /usr, and /tmp filesystems need be statically
 # compiled; everything else will be automatically loaded at mount
 # time.  (Exception: the UFS family--- FFS --- cannot
 # currently be demand-loaded.)  Some people still prefer to statically
 # compile other filesystems as well.
 #
 # NB: The PORTAL filesystem is known to be buggy, and WILL panic your
 # system if you attempt to do anything with it.  It is included here
 # as an incentive for some enterprising soul to sit down and fix it.
 # The UNION filesystem was known to be buggy in the past.  It is now
 # being actively maintained, although there are still some issues being
 # resolved.
 #
 
 # One of these is mandatory:
 options 	FFS			#Fast filesystem
 options 	NFSCLIENT		#Network File System client
 
 # The rest are optional:
 options 	CD9660			#ISO 9660 filesystem
 options 	FDESCFS			#File descriptor filesystem
 options 	HPFS			#OS/2 File system
 options 	MSDOSFS			#MS DOS File System (FAT, FAT32)
 options 	NFSSERVER		#Network File System server
 options 	NFSLOCKD		#Network Lock Manager
 options 	NFSCL			#experimental NFS client with NFSv4
 options 	NFSD			#experimental NFS server with NFSv4
 options 	KGSSAPI			#Kernel GSSAPI implementation
 
 # NT File System. Read-mostly, see mount_ntfs(8) for details.
 # For a full read-write NTFS support consider sysutils/fusefs-ntfs
 # port/package.
 options 	NTFS
 
 options 	NULLFS			#NULL filesystem
 # Broken (depends on NCP):
 #options 	NWFS			#NetWare filesystem
 options 	PORTALFS		#Portal filesystem
 options 	PROCFS			#Process filesystem (requires PSEUDOFS)
 options 	PSEUDOFS		#Pseudo-filesystem framework
 options 	PSEUDOFS_TRACE		#Debugging support for PSEUDOFS
 options 	SMBFS			#SMB/CIFS filesystem
 options 	TMPFS			#Efficient memory filesystem
 options 	UDF			#Universal Disk Format
 options 	UNIONFS			#Union filesystem
 # The xFS_ROOT options REQUIRE the associated ``options xFS''
 options 	NFS_ROOT		#NFS usable as root device
 
 # Soft updates is a technique for improving filesystem speed and
 # making abrupt shutdown less risky.
 #
 options 	SOFTUPDATES
 
 # Extended attributes allow additional data to be associated with files,
 # and is used for ACLs, Capabilities, and MAC labels.
 # See src/sys/ufs/ufs/README.extattr for more information.
 options 	UFS_EXTATTR
 options 	UFS_EXTATTR_AUTOSTART
 
 # Access Control List support for UFS filesystems.  The current ACL
 # implementation requires extended attribute support, UFS_EXTATTR,
 # for the underlying filesystem.
 # See src/sys/ufs/ufs/README.acls for more information.
 options 	UFS_ACL
 
 # Directory hashing improves the speed of operations on very large
 # directories at the expense of some memory.
 options 	UFS_DIRHASH
 
 # Gjournal-based UFS journaling support.
 options 	UFS_GJOURNAL
 
 # Make space in the kernel for a root filesystem on a md device.
 # Define to the number of kilobytes to reserve for the filesystem.
 options 	MD_ROOT_SIZE=10
 
 # Make the md device a potential root device, either with preloaded
 # images of type mfs_root or md_root.
 options 	MD_ROOT
 
 # Disk quotas are supported when this option is enabled.
 options 	QUOTA			#enable disk quotas
 
 # If you are running a machine just as a fileserver for PC and MAC
 # users, using SAMBA or Netatalk, you may consider setting this option
 # and keeping all those users' directories on a filesystem that is
 # mounted with the suiddir option. This gives new files the same
 # ownership as the directory (similar to group). It's a security hole
 # if you let these users run programs, so confine it to file-servers
 # (but it'll save you lots of headaches in those cases). Root owned
 # directories are exempt and X bits are cleared. The suid bit must be
 # set on the directory as well; see chmod(1). PC owners can't see/set
 # ownerships so they keep getting their toes trodden on. This saves
 # you all the support calls as the filesystem it's used on will act as
 # they expect: "It's my dir so it must be my file".
 #
 options 	SUIDDIR
 
 # NFS options:
 options 	NFS_MINATTRTIMO=3	# VREG attrib cache timeout in sec
 options 	NFS_MAXATTRTIMO=60
 options 	NFS_MINDIRATTRTIMO=30	# VDIR attrib cache timeout in sec
 options 	NFS_MAXDIRATTRTIMO=60
 options 	NFS_GATHERDELAY=10	# Default write gather delay (msec)
 options 	NFS_WDELAYHASHSIZ=16	# and with this
 options 	NFS_DEBUG		# Enable NFS Debugging
 
 # Coda stuff:
 options 	CODA			#CODA filesystem.
 device		vcoda			#coda minicache <-> venus comm.
 # Use the old Coda 5.x venus<->kernel interface instead of the new
 # realms-aware 6.x protocol.
 #options 	CODA_COMPAT_5
 
 #
 # Add support for the EXT2FS filesystem of Linux fame.  Be a bit
 # careful with this - the ext2fs code has a tendency to lag behind
 # changes and not be exercised very much, so mounting read/write could
 # be dangerous (and even mounting read only could result in panics.)
 #
 options 	EXT2FS
 
 #
 # Add support for the ReiserFS filesystem (used in Linux). Currently,
 # this is limited to read-only access.
 #
 options 	REISERFS
 
 #
 # Add support for the SGI XFS filesystem. Currently,
 # this is limited to read-only access.
 #
 options 	XFS
 
 # Use real implementations of the aio_* system calls.  There are numerous
 # stability and security issues in the current aio code that make it
 # unsuitable for inclusion on machines with untrusted local users.
 options 	VFS_AIO
 
 # Cryptographically secure random number generator; /dev/random
 device		random
 
 # The system memory devices; /dev/mem, /dev/kmem
 device		mem
 
 # The kernel symbol table device; /dev/ksyms
 device		ksyms
 
 # Optional character code conversion support with LIBICONV.
 # Each option requires their base file system and LIBICONV.
 options 	CD9660_ICONV
 options 	MSDOSFS_ICONV
 options 	NTFS_ICONV
 options 	UDF_ICONV
 
 
 #####################################################################
 # POSIX P1003.1B
 
 # Real time extensions added in the 1993 POSIX
 # _KPOSIX_PRIORITY_SCHEDULING: Build in _POSIX_PRIORITY_SCHEDULING
 
 options 	_KPOSIX_PRIORITY_SCHEDULING
 # p1003_1b_semaphores are very experimental,
 # user should be ready to assist in debugging if problems arise.
 options 	P1003_1B_SEMAPHORES
 
 # POSIX message queue
 options 	P1003_1B_MQUEUE
 
 #####################################################################
 # SECURITY POLICY PARAMETERS
 
 # Support for BSM audit
 options 	AUDIT
 
 # Support for Mandatory Access Control (MAC):
 options 	MAC
 options 	MAC_BIBA
 options 	MAC_BSDEXTENDED
 options 	MAC_IFOFF
 options 	MAC_LOMAC
 options 	MAC_MLS
 options 	MAC_NONE
 options 	MAC_PARTITION
 options 	MAC_PORTACL
 options 	MAC_SEEOTHERUIDS
 options 	MAC_STUB
 options 	MAC_TEST
 
 # Support for Capsicum
 options 	CAPABILITIES
 
 
 #####################################################################
 # CLOCK OPTIONS
 
 # The granularity of operation is controlled by the kernel option HZ whose
 # default value (1000 on most architectures) means a granularity of 1ms
 # (1s/HZ).  Historically, the default was 100, but finer granularity is
 # required for DUMMYNET and other systems on modern hardware.  There are
 # reasonable arguments that HZ should, in fact, be 100 still; consider,
 # that reducing the granularity too much might cause excessive overhead in
 # clock interrupt processing, potentially causing ticks to be missed and thus
 # actually reducing the accuracy of operation.
 
 options 	HZ=100
 
 # Enable support for the kernel PLL to use an external PPS signal,
 # under supervision of [x]ntpd(8)
 # More info in ntpd documentation: http://www.eecis.udel.edu/~ntp
 
 options 	PPS_SYNC
 
 
 #####################################################################
 # SCSI DEVICES
 
 # SCSI DEVICE CONFIGURATION
 
 # The SCSI subsystem consists of the `base' SCSI code, a number of
 # high-level SCSI device `type' drivers, and the low-level host-adapter
 # device drivers.  The host adapters are listed in the ISA and PCI
 # device configuration sections below.
 #
 # It is possible to wire down your SCSI devices so that a given bus,
 # target, and LUN always come on line as the same device unit.  In
 # earlier versions the unit numbers were assigned in the order that
 # the devices were probed on the SCSI bus.  This means that if you
 # removed a disk drive, you may have had to rewrite your /etc/fstab
 # file, and also that you had to be careful when adding a new disk
 # as it may have been probed earlier and moved your device configuration
 # around.  (See also option GEOM_VOL for a different solution to this
 # problem.)
 
 # This old behavior is maintained as the default behavior.  The unit
 # assignment begins with the first non-wired down unit for a device
 # type.  For example, if you wire a disk as "da3" then the first
 # non-wired disk will be assigned da4.
 
 # The syntax for wiring down devices is:
 
 hint.scbus.0.at="ahc0"
 hint.scbus.1.at="ahc1"
 hint.scbus.1.bus="0"
 hint.scbus.3.at="ahc2"
 hint.scbus.3.bus="0"
 hint.scbus.2.at="ahc2"
 hint.scbus.2.bus="1"
 hint.da.0.at="scbus0"
 hint.da.0.target="0"
 hint.da.0.unit="0"
 hint.da.1.at="scbus3"
 hint.da.1.target="1"
 hint.da.2.at="scbus2"
 hint.da.2.target="3"
 hint.sa.1.at="scbus1"
 hint.sa.1.target="6"
 
 # "units" (SCSI logical unit number) that are not specified are
 # treated as if specified as LUN 0.
 
 # All SCSI devices allocate as many units as are required.
 
 # The ch driver drives SCSI Media Changer ("jukebox") devices.
 #
 # The da driver drives SCSI Direct Access ("disk") and Optical Media
 # ("WORM") devices.
 #
 # The sa driver drives SCSI Sequential Access ("tape") devices.
 #
 # The cd driver drives SCSI Read Only Direct Access ("cd") devices.
 #
 # The ses driver drives SCSI Environment Services ("ses") and
 # SAF-TE ("SCSI Accessible Fault-Tolerant Enclosure") devices.
 #
 # The pt driver drives SCSI Processor devices.
 #
 # The sg driver provides a passthrough API that is compatible with the
 # Linux SG driver.  It will work in conjunction with the COMPAT_LINUX
 # option to run linux SG apps.  It can also stand on its own and provide
 # source level API compatiblity for porting apps to FreeBSD.
 #
 # Target Mode support is provided here but also requires that a SIM
 # (SCSI Host Adapter Driver) provide support as well.
 #
 # The targ driver provides target mode support as a Processor type device.
 # It exists to give the minimal context necessary to respond to Inquiry
 # commands. There is a sample user application that shows how the rest
 # of the command support might be done in /usr/share/examples/scsi_target.
 #
 # The targbh driver provides target mode support and exists to respond
 # to incoming commands that do not otherwise have a logical unit assigned
 # to them.
 #
 # The "unknown" device (uk? in pre-2.0.5) is now part of the base SCSI
 # configuration as the "pass" driver.
 
 device		scbus		#base SCSI code
 device		ch		#SCSI media changers
 device		da		#SCSI direct access devices (aka disks)
 device		sa		#SCSI tapes
 device		cd		#SCSI CD-ROMs
 device		ses		#SCSI Environmental Services (and SAF-TE)
 device		pt		#SCSI processor
 device		targ		#SCSI Target Mode Code
 device		targbh		#SCSI Target Mode Blackhole Device
 device		pass		#CAM passthrough driver
 device		sg		#Linux SCSI passthrough
 
 # CAM OPTIONS:
 # debugging options:
 # -- NOTE --  If you specify one of the bus/target/lun options, you must
 #             specify them all!
 # CAMDEBUG: When defined enables debugging macros
 # CAM_DEBUG_BUS:  Debug the given bus.  Use -1 to debug all busses.
 # CAM_DEBUG_TARGET:  Debug the given target.  Use -1 to debug all targets.
 # CAM_DEBUG_LUN:  Debug the given lun.  Use -1 to debug all luns.
 # CAM_DEBUG_FLAGS:  OR together CAM_DEBUG_INFO, CAM_DEBUG_TRACE,
 #                   CAM_DEBUG_SUBTRACE, and CAM_DEBUG_CDB
 #
 # CAM_MAX_HIGHPOWER: Maximum number of concurrent high power (start unit) cmds
 # SCSI_NO_SENSE_STRINGS: When defined disables sense descriptions
 # SCSI_NO_OP_STRINGS: When defined disables opcode descriptions
 # SCSI_DELAY: The number of MILLISECONDS to freeze the SIM (scsi adapter)
 #             queue after a bus reset, and the number of milliseconds to
 #             freeze the device queue after a bus device reset.  This
 #             can be changed at boot and runtime with the
 #             kern.cam.scsi_delay tunable/sysctl.
 options 	CAMDEBUG
 options 	CAM_DEBUG_BUS=-1
 options 	CAM_DEBUG_TARGET=-1
 options 	CAM_DEBUG_LUN=-1
 options 	CAM_DEBUG_FLAGS=(CAM_DEBUG_INFO|CAM_DEBUG_TRACE|CAM_DEBUG_CDB)
 options 	CAM_MAX_HIGHPOWER=4
 options 	SCSI_NO_SENSE_STRINGS
 options 	SCSI_NO_OP_STRINGS
 options 	SCSI_DELAY=5000	# Be pessimistic about Joe SCSI device
 
 # Options for the CAM CDROM driver:
 # CHANGER_MIN_BUSY_SECONDS: Guaranteed minimum time quantum for a changer LUN
 # CHANGER_MAX_BUSY_SECONDS: Maximum time quantum per changer LUN, only
 #                           enforced if there is I/O waiting for another LUN
 # The compiled in defaults for these variables are 2 and 10 seconds,
 # respectively.
 #
 # These can also be changed on the fly with the following sysctl variables:
 # kern.cam.cd.changer.min_busy_seconds
 # kern.cam.cd.changer.max_busy_seconds
 #
 options 	CHANGER_MIN_BUSY_SECONDS=2
 options 	CHANGER_MAX_BUSY_SECONDS=10
 
 # Options for the CAM sequential access driver:
 # SA_IO_TIMEOUT: Timeout for read/write/wfm  operations, in minutes
 # SA_SPACE_TIMEOUT: Timeout for space operations, in minutes
 # SA_REWIND_TIMEOUT: Timeout for rewind operations, in minutes
 # SA_ERASE_TIMEOUT: Timeout for erase operations, in minutes
 # SA_1FM_AT_EOD: Default to model which only has a default one filemark at EOT.
 options 	SA_IO_TIMEOUT=4
 options 	SA_SPACE_TIMEOUT=60
 options 	SA_REWIND_TIMEOUT=(2*60)
 options 	SA_ERASE_TIMEOUT=(4*60)
 options 	SA_1FM_AT_EOD
 
 # Optional timeout for the CAM processor target (pt) device
 # This is specified in seconds.  The default is 60 seconds.
 options 	SCSI_PT_DEFAULT_TIMEOUT=60
 
 # Optional enable of doing SES passthrough on other devices (e.g., disks)
 #
 # Normally disabled because a lot of newer SCSI disks report themselves
 # as having SES capabilities, but this can then clot up attempts to build
 # a topology with the SES device that's on the box these drives are in....
 options 	SES_ENABLE_PASSTHROUGH
 
 
 #####################################################################
 # MISCELLANEOUS DEVICES AND OPTIONS
 
 device		pty		#BSD-style compatibility pseudo ttys
 device		nmdm		#back-to-back tty devices
 device		md		#Memory/malloc disk
 device		snp		#Snoop device - to look at pty/vty/etc..
 device		ccd		#Concatenated disk driver
 device		firmware	#firmware(9) support
 
 # Kernel side iconv library
 options 	LIBICONV
 
 # Size of the kernel message buffer.  Should be N * pagesize.
 options 	MSGBUF_SIZE=40960
 
 
 #####################################################################
 # HARDWARE DEVICE CONFIGURATION
 
 # For ISA the required hints are listed.
 # EISA, MCA, PCI, CardBus, SD/MMC and pccard are self identifying buses, so
 # no hints are needed.
 
 #
 # Mandatory devices:
 #
 
 # These options are valid for other keyboard drivers as well.
 options 	KBD_DISABLE_KEYMAP_LOAD	# refuse to load a keymap
 options 	KBD_INSTALL_CDEV	# install a CDEV entry in /dev
 
 options 	FB_DEBUG		# Frame buffer debugging
 
 device		splash			# Splash screen and screen saver support
 
 # Various screen savers.
 device		blank_saver
 device		daemon_saver
 device		dragon_saver
 device		fade_saver
 device		fire_saver
 device		green_saver
 device		logo_saver
 device		rain_saver
 device		snake_saver
 device		star_saver
 device		warp_saver
 
 # The syscons console driver (SCO color console compatible).
 device		sc
 hint.sc.0.at="isa"
 options 	MAXCONS=16		# number of virtual consoles
 options 	SC_ALT_MOUSE_IMAGE	# simplified mouse cursor in text mode
 options 	SC_DFLT_FONT		# compile font in
 makeoptions	SC_DFLT_FONT=cp850
 options 	SC_DISABLE_KDBKEY	# disable `debug' key
 options 	SC_DISABLE_REBOOT	# disable reboot key sequence
 options 	SC_HISTORY_SIZE=200	# number of history buffer lines
 options 	SC_MOUSE_CHAR=0x3	# char code for text mode mouse cursor
 options 	SC_PIXEL_MODE		# add support for the raster text mode
 
 # The following options will let you change the default colors of syscons.
 options 	SC_NORM_ATTR=(FG_GREEN|BG_BLACK)
 options 	SC_NORM_REV_ATTR=(FG_YELLOW|BG_GREEN)
 options 	SC_KERNEL_CONS_ATTR=(FG_RED|BG_BLACK)
 options 	SC_KERNEL_CONS_REV_ATTR=(FG_BLACK|BG_RED)
 
 # The following options will let you change the default behaviour of
 # cut-n-paste feature
 options 	SC_CUT_SPACES2TABS	# convert leading spaces into tabs
 options 	SC_CUT_SEPCHARS=\"x09\"	# set of characters that delimit words
 					# (default is single space - \"x20\")
 
 # If you have a two button mouse, you may want to add the following option
 # to use the right button of the mouse to paste text.
 options 	SC_TWOBUTTON_MOUSE
 
 # You can selectively disable features in syscons.
 options 	SC_NO_CUTPASTE
 options 	SC_NO_FONT_LOADING
 options 	SC_NO_HISTORY
 options 	SC_NO_MODE_CHANGE
 options 	SC_NO_SYSMOUSE
 options 	SC_NO_SUSPEND_VTYSWITCH
 
 # `flags' for sc
 #	0x80	Put the video card in the VESA 800x600 dots, 16 color mode
 #	0x100	Probe for a keyboard device periodically if one is not present
 
 # Enable experimental features of the syscons terminal emulator (teken).
 options 	TEKEN_CONS25		# cons25-style terminal emulation
 options 	TEKEN_UTF8		# UTF-8 output handling
 
 #
 # Optional devices:
 #
 
 #
 # SCSI host adapters:
 #
 # adv: All Narrow SCSI bus AdvanSys controllers.
 # adw: Second Generation AdvanSys controllers including the ADV940UW.
 # aha: Adaptec 154x/1535/1640
 # ahb: Adaptec 174x EISA controllers
 # ahc: Adaptec 274x/284x/2910/293x/294x/394x/3950x/3960x/398X/4944/
 #      19160x/29160x, aic7770/aic78xx
 # ahd: Adaptec 29320/39320 Controllers.
 # aic: Adaptec 6260/6360, APA-1460 (PC Card), NEC PC9801-100 (C-BUS)
 # amd: Support for the AMD 53C974 SCSI host adapter chip as found on devices
 #      such as the Tekram DC-390(T).
 # bt:  Most Buslogic controllers: including BT-445, BT-54x, BT-64x, BT-74x,
 #      BT-75x, BT-946, BT-948, BT-956, BT-958, SDC3211B, SDC3211F, SDC3222F
 # esp: NCR53c9x.  Only for SBUS hardware right now.
 # isp: Qlogic ISP 1020, 1040 and 1040B PCI SCSI host adapters,
 #      ISP 1240 Dual Ultra SCSI, ISP 1080 and 1280 (Dual) Ultra2,
 #      ISP 12160 Ultra3 SCSI,
 #      Qlogic ISP 2100 and ISP 2200 1Gb Fibre Channel host adapters.
 #      Qlogic ISP 2300 and ISP 2312 2Gb Fibre Channel host adapters.
 #      Qlogic ISP 2322 and ISP 6322 2Gb Fibre Channel host adapters.
 # ispfw: Firmware module for Qlogic host adapters
 # mpt: LSI-Logic MPT/Fusion 53c1020 or 53c1030 Ultra4
 #      or FC9x9 Fibre Channel host adapters.
 # ncr: NCR 53C810, 53C825 self-contained SCSI host adapters.
 # sym: Symbios/Logic 53C8XX family of PCI-SCSI I/O processors:
 #      53C810, 53C810A, 53C815, 53C825,  53C825A, 53C860, 53C875,
 #      53C876, 53C885,  53C895, 53C895A, 53C896,  53C897, 53C1510D,
 #      53C1010-33, 53C1010-66.
 # trm: Tekram DC395U/UW/F DC315U adapters.
 # wds: WD7000
 
 #
 # Note that the order is important in order for Buslogic ISA/EISA cards to be
 # probed correctly.
 #
 device		bt
 hint.bt.0.at="isa"
 hint.bt.0.port="0x330"
 device		adv
 hint.adv.0.at="isa"
 device		adw
 device		aha
 hint.aha.0.at="isa"
 device		aic
 hint.aic.0.at="isa"
 device		ahb
 device		ahc
 device		ahd
 device		amd
 device		esp
 device		iscsi_initiator
 device		isp
 hint.isp.0.disable="1"
 hint.isp.0.role="3"
 hint.isp.0.prefer_iomap="1"
 hint.isp.0.prefer_memmap="1"
 hint.isp.0.fwload_disable="1"
 hint.isp.0.ignore_nvram="1"
 hint.isp.0.fullduplex="1"
 hint.isp.0.topology="lport"
 hint.isp.0.topology="nport"
 hint.isp.0.topology="lport-only"
 hint.isp.0.topology="nport-only"
 # we can't get u_int64_t types, nor can we get strings if it's got
 # a leading 0x, hence this silly dodge.
 hint.isp.0.portwnn="w50000000aaaa0000"
 hint.isp.0.nodewnn="w50000000aaaa0001"
 device		ispfw
 device		mpt
 device		ncr
 device		sym
 device		trm
 device		wds
 hint.wds.0.at="isa"
 hint.wds.0.port="0x350"
 hint.wds.0.irq="11"
 hint.wds.0.drq="6"
 
 # The aic7xxx driver will attempt to use memory mapped I/O for all PCI
 # controllers that have it configured only if this option is set. Unfortunately,
 # this doesn't work on some motherboards, which prevents it from being the
 # default.
 options 	AHC_ALLOW_MEMIO
 
 # Dump the contents of the ahc controller configuration PROM.
 options 	AHC_DUMP_EEPROM
 
 # Bitmap of units to enable targetmode operations.
 options 	AHC_TMODE_ENABLE
 
 # Compile in Aic7xxx Debugging code.
 options 	AHC_DEBUG
 
 # Aic7xxx driver debugging options. See sys/dev/aic7xxx/aic7xxx.h
 options 	AHC_DEBUG_OPTS
 
 # Print register bitfields in debug output.  Adds ~128k to driver
 # See ahc(4).
 options 	AHC_REG_PRETTY_PRINT
 
 # Compile in aic79xx debugging code.
 options 	AHD_DEBUG
 
 # Aic79xx driver debugging options.  Adds ~215k to driver.  See ahd(4).
 options 	AHD_DEBUG_OPTS=0xFFFFFFFF
 
 # Print human-readable register definitions when debugging
 options 	AHD_REG_PRETTY_PRINT
 
 # Bitmap of units to enable targetmode operations.
 options 	AHD_TMODE_ENABLE
 
 # The adw driver will attempt to use memory mapped I/O for all PCI
 # controllers that have it configured only if this option is set.
 options 	ADW_ALLOW_MEMIO
 
 # Options used in dev/iscsi (Software iSCSI stack)
 #
 options 	ISCSI_INITIATOR_DEBUG=9
 
 # Options used in dev/isp/ (Qlogic SCSI/FC driver).
 #
 #	ISP_TARGET_MODE		-	enable target mode operation
 #
 options 	ISP_TARGET_MODE=1
 #
 #	ISP_DEFAULT_ROLES	-	default role
 #		none=0
 #		target=1
 #		initiator=2
 #		both=3			(not supported currently)
 #
 #	ISP_INTERNAL_TARGET		(trivial internal disk target, for testing)
 #
 options 	ISP_DEFAULT_ROLES=2
 
 # Options used in dev/sym/ (Symbios SCSI driver).
 #options 	SYM_SETUP_LP_PROBE_MAP	#-Low Priority Probe Map (bits)
 					# Allows the ncr to take precedence
 					# 1 (1<<0) -> 810a, 860
 					# 2 (1<<1) -> 825a, 875, 885, 895
 					# 4 (1<<2) -> 895a, 896, 1510d
 #options 	SYM_SETUP_SCSI_DIFF	#-HVD support for 825a, 875, 885
 					# disabled:0 (default), enabled:1
 #options 	SYM_SETUP_PCI_PARITY	#-PCI parity checking
 					# disabled:0, enabled:1 (default)
 #options 	SYM_SETUP_MAX_LUN	#-Number of LUNs supported
 					# default:8, range:[1..64]
 
 # The 'dpt' driver provides support for old DPT controllers (http://www.dpt.com/).
 # These have hardware RAID-{0,1,5} support, and do multi-initiator I/O.
 # The DPT controllers are commonly re-licensed under other brand-names -
 # some controllers by Olivetti, Dec, HP, AT&T, SNI, AST, Alphatronic, NEC and
 # Compaq are actually DPT controllers.
 #
 # See src/sys/dev/dpt for debugging and other subtle options.
 #   DPT_MEASURE_PERFORMANCE Enables a set of (semi)invasive metrics. Various
 #                           instruments are enabled.  The tools in
 #                           /usr/sbin/dpt_* assume these to be enabled.
 #   DPT_HANDLE_TIMEOUTS     Normally device timeouts are handled by the DPT.
 #                           If you want the driver to handle timeouts, enable
 #                           this option.  If your system is very busy, this
 #                           option will create more trouble than solve.
 #   DPT_TIMEOUT_FACTOR      Used to compute the excessive amount of time to
 #                           wait when timing out with the above option.
 #  DPT_DEBUG_xxxx           These are controllable from sys/dev/dpt/dpt.h
 #  DPT_LOST_IRQ             When enabled, will try, once per second, to catch
 #                           any interrupt that got lost.  Seems to help in some
 #                           DPT-firmware/Motherboard combinations.  Minimal
 #                           cost, great benefit.
 #  DPT_RESET_HBA            Make "reset" actually reset the controller
 #                           instead of fudging it.  Only enable this if you
 #			    are 100% certain you need it.
 
 device		dpt
 
 # DPT options
 #!CAM# options 	DPT_MEASURE_PERFORMANCE
 #!CAM# options 	DPT_HANDLE_TIMEOUTS
 options 	DPT_TIMEOUT_FACTOR=4
 options 	DPT_LOST_IRQ
 options 	DPT_RESET_HBA
 
 #
 # Compaq "CISS" RAID controllers (SmartRAID 5* series)
 # These controllers have a SCSI-like interface, and require the
 # CAM infrastructure.
 #
 device		ciss
 
 #
 # Intel Integrated RAID controllers.
 # This driver was developed and is maintained by Intel.  Contacts
 # at Intel for this driver are
 # "Kannanthanam, Boji T" <boji.t.kannanthanam@intel.com> and
 # "Leubner, Achim" <achim.leubner@intel.com>.
 #
 device		iir
 
 #
 # Mylex AcceleRAID and eXtremeRAID controllers with v6 and later
 # firmware.  These controllers have a SCSI-like interface, and require
 # the CAM infrastructure.
 #
 device		mly
 
 #
 # Compaq Smart RAID, Mylex DAC960 and AMI MegaRAID controllers.  Only
 # one entry is needed; the code will find and configure all supported
 # controllers.
 #
 device		ida		# Compaq Smart RAID
 device		mlx		# Mylex DAC960
 device		amr		# AMI MegaRAID
 device		amrp		# SCSI Passthrough interface (optional, CAM req.)
 device		mfi		# LSI MegaRAID SAS
 device		mfip		# LSI MegaRAID SAS passthrough, requires CAM
 options 	MFI_DEBUG
 
 #
 # 3ware ATA RAID
 #
 device		twe		# 3ware ATA RAID
 
 #
 # Serial ATA host controllers:
 #
 # ahci: Advanced Host Controller Interface (AHCI) compatible
 # mvs:  Marvell 88SX50XX/88SX60XX/88SX70XX/SoC controllers
 # siis: SiliconImage SiI3124/SiI3132/SiI3531 controllers
 #
 # These drivers are part of cam(4) subsystem. They supersede less featured
 # ata(4) subsystem drivers, supporting same hardware.
 
 device		ahci
 device		mvs
 device		siis
 
 #
 # The 'ATA' driver supports all ATA and ATAPI devices, including PC Card
 # devices. You only need one "device ata" for it to find all
 # PCI and PC Card ATA/ATAPI devices on modern machines.
 # Alternatively, individual bus and chipset drivers may be chosen by using
 # the 'atacore' driver then selecting the drivers on a per vendor basis.
 # For example to build a system which only supports a VIA chipset,
 # omit 'ata' and include the 'atacore', 'atapci' and 'atavia' drivers.
 device		ata
 device		atadisk		# ATA disk drives
 device		ataraid		# ATA RAID drives
 device		atapicd		# ATAPI CDROM drives
 device		atapifd		# ATAPI floppy drives
 device		atapist		# ATAPI tape drives
 device		atapicam	# emulate ATAPI devices as SCSI ditto via CAM
 				# needs CAM to be present (scbus & pass)
 
 # Modular ATA
 #device		atacore		# Core ATA functionality
 #device		atacard		# CARDBUS support
 #device		atabus		# PC98 cbus support
 #device		ataisa		# ISA bus support
 #device		atapci		# PCI bus support; only generic chipset support
 
 # PCI ATA chipsets
 #device		ataahci		# AHCI SATA
 #device		ataacard	# ACARD
 #device		ataacerlabs	# Acer Labs Inc. (ALI)
 #device		ataadaptec	# Adaptec
 #device		ataamd		# American Micro Devices (AMD)
 #device		ataati		# ATI
 #device		atacenatek	# Cenatek
 #device		atacypress	# Cypress
 #device		atacyrix	# Cyrix
 #device		atahighpoint	# HighPoint
 #device		ataintel	# Intel
 #device		ataite		# Integrated Technology Inc. (ITE)
 #device		atajmicron	# JMicron
 #device		atamarvell	# Marvell
 #device		atamicron	# Micron
 #device		atanational	# National
 #device		atanetcell	# NetCell
 #device		atanvidia	# nVidia
 #device		atapromise	# Promise
 #device		ataserverworks	# ServerWorks
 #device		atasiliconimage	# Silicon Image Inc. (SiI) (formerly CMD)
 #device		atasis		# Silicon Integrated Systems Corp.(SiS)
 #device		atavia		# VIA Technologies Inc.
 
 #
 # For older non-PCI, non-PnPBIOS systems, these are the hints lines to add:
 hint.ata.0.at="isa"
 hint.ata.0.port="0x1f0"
 hint.ata.0.irq="14"
 hint.ata.1.at="isa"
 hint.ata.1.port="0x170"
 hint.ata.1.irq="15"
 
 #
 # The following options are valid on the ATA driver:
 #
 # ATA_STATIC_ID:	controller numbering is static ie depends on location
 #			else the device numbers are dynamically allocated.
 # ATA_REQUEST_TIMEOUT:	the number of seconds to wait for an ATA request
 #			before timing out.
 # ATA_CAM:		Turn ata(4) subsystem controller drivers into cam(4)
 #			interface modules. This deprecates all ata(4)
 #			peripheral device drivers (atadisk, ataraid, atapicd,
 #			atapifd, atapist, atapicam) and all user-level APIs.
 #			cam(4) drivers and APIs will be connected instead.
 
 options 	ATA_STATIC_ID
 #options 	ATA_REQUEST_TIMEOUT=10
 options 	ATA_CAM
 
 #
 # Standard floppy disk controllers and floppy tapes, supports
 # the Y-E DATA External FDD (PC Card)
 #
 device		fdc
 hint.fdc.0.at="isa"
 hint.fdc.0.port="0x3F0"
 hint.fdc.0.irq="6"
 hint.fdc.0.drq="2"
 #
 # FDC_DEBUG enables floppy debugging.  Since the debug output is huge, you
 # gotta turn it actually on by setting the variable fd_debug with DDB,
 # however.
 options 	FDC_DEBUG
 #
 # Activate this line if you happen to have an Insight floppy tape.
 # Probing them proved to be dangerous for people with floppy disks only,
 # so it's "hidden" behind a flag:
 #hint.fdc.0.flags="1"
 
 # Specify floppy devices
 hint.fd.0.at="fdc0"
 hint.fd.0.drive="0"
 hint.fd.1.at="fdc0"
 hint.fd.1.drive="1"
 
 #
 # uart: newbusified driver for serial interfaces.  It consolidates the sio(4),
 #	sab(4) and zs(4) drivers.
 #
 device		uart
 
 # Options for uart(4)
 options 	UART_PPS_ON_CTS		# Do time pulse capturing using CTS
 					# instead of DCD.
 
 # The following hint should only be used for pure ISA devices.  It is not
 # needed otherwise.  Use of hints is strongly discouraged.
 hint.uart.0.at="isa"
 
 # The following 3 hints are used when the UART is a system device (i.e., a
 # console or debug port), but only on platforms that don't have any other
 # means to pass the information to the kernel.  The unit number of the hint
 # is only used to bundle the hints together.  There is no relation to the
 # unit number of the probed UART.
 hint.uart.0.port="0x3f8"
 hint.uart.0.flags="0x10"
 hint.uart.0.baud="115200"
 
 # `flags' for serial drivers that support consoles like sio(4) and uart(4):
 #	0x10	enable console support for this unit.  Other console flags
 #		(if applicable) are ignored unless this is set.  Enabling
 #		console support does not make the unit the preferred console.
 #		Boot with -h or set boot_serial=YES in the loader.  For sio(4)
 #		specifically, the 0x20 flag can also be set (see above).
 #		Currently, at most one unit can have console support; the
 #		first one (in config file order) with this flag set is
 #		preferred.  Setting this flag for sio0 gives the old behaviour.
 #	0x80	use this port for serial line gdb support in ddb.  Also known
 #		as debug port.
 #
 
 # Options for serial drivers that support consoles:
 options 	BREAK_TO_DEBUGGER	# A BREAK on a serial console goes to
 					# ddb, if available.
 
 # Solaris implements a new BREAK which is initiated by a character
 # sequence CR ~ ^b which is similar to a familiar pattern used on
 # Sun servers by the Remote Console.  There are FreeBSD extensions:
 # CR ~ ^p requests force panic and CR ~ ^r requests a clean reboot.
 options 	ALT_BREAK_TO_DEBUGGER
 
 # Serial Communications Controller
 # Supports the Siemens SAB 82532 and Zilog Z8530 multi-channel
 # communications controllers.
 device		scc
 
 # PCI Universal Communications driver
 # Supports various multi port PCI I/O cards.
 device		puc
 
 #
 # Network interfaces:
 #
 # MII bus support is required for many PCI Ethernet NICs,
 # namely those which use MII-compliant transceivers or implement
 # transceiver control interfaces that operate like an MII.  Adding
 # "device miibus" to the kernel config pulls in support for
 # the generic miibus API and all of the PHY drivers, including a
 # generic one for PHYs that aren't specifically handled by an
 # individual driver.  Support for specific PHYs may be built by adding
 # "device mii" then adding the appropriate PHY driver.
 device  	miibus		# MII support including all PHYs
 device  	mii		# Minimal MII support
 
 device  	acphy		# Altima Communications AC101
 device  	amphy		# AMD AM79c873 / Davicom DM910{1,2}
 device  	atphy		# Attansic/Atheros F1
 device  	axphy		# Asix Semiconductor AX88x9x
 device  	bmtphy		# Broadcom BCM5201/BCM5202 and 3Com 3c905C
 device  	brgphy		# Broadcom BCM54xx/57xx 1000baseTX
 device  	ciphy		# Cicada/Vitesse CS/VSC8xxx
 device  	e1000phy	# Marvell 88E1000 1000/100/10-BT
 device  	exphy		# 3Com internal PHY
 device  	gentbi		# Generic 10-bit 1000BASE-{LX,SX} fiber ifaces
 device  	icsphy		# ICS ICS1889-1893
 device  	inphy		# Intel 82553/82555
 device  	ip1000phy	# IC Plus IP1000A/IP1001
 device  	jmphy		# JMicron JMP211/JMP202
 device  	lxtphy		# Level One LXT-970
 device  	mlphy		# Micro Linear 6692
 device  	nsgphy		# NatSemi DP8361/DP83865/DP83891
 device  	nsphy		# NatSemi DP83840A
 device  	nsphyter	# NatSemi DP83843/DP83815
 device  	pnaphy		# HomePNA
 device  	qsphy		# Quality Semiconductor QS6612
 device  	rdcphy		# RDC Semiconductor R6040
 device  	rgephy		# RealTek 8169S/8110S/8211B/8211C
 device  	rlphy		# RealTek 8139
 device  	rlswitch	# RealTek 8305
 device  	ruephy		# RealTek RTL8150
 device  	smcphy		# SMSC LAN91C111
 device  	tdkphy		# TDK 89Q2120
 device  	tlphy		# Texas Instruments ThunderLAN
 device  	truephy		# LSI TruePHY
 device		xmphy		# XaQti XMAC II
 
 # an:   Aironet 4500/4800 802.11 wireless adapters. Supports the PCMCIA,
 #       PCI and ISA varieties.
 # ae:   Support for gigabit ethernet adapters based on the Attansic/Atheros
 #       L2 PCI-Express FastEthernet controllers.
 # age:  Support for gigabit ethernet adapters based on the Attansic/Atheros
 #       L1 PCI express gigabit ethernet controllers.
 # alc:  Support for Atheros AR8131/AR8132 PCIe ethernet controllers.
 # ale:  Support for Atheros AR8121/AR8113/AR8114 PCIe ethernet controllers.
 # ath:  Atheros a/b/g WiFi adapters (requires ath_hal and wlan)
 # bce:	Broadcom NetXtreme II (BCM5706/BCM5708) PCI/PCIe Gigabit Ethernet
 #       adapters.
 # bfe:	Broadcom BCM4401 Ethernet adapter.
 # bge:	Support for gigabit ethernet adapters based on the Broadcom
 #	BCM570x family of controllers, including the 3Com 3c996-T,
 #	the Netgear GA302T, the SysKonnect SK-9D21 and SK-9D41, and
 #	the embedded gigE NICs on Dell PowerEdge 2550 servers.
 # bxe:	Broadcom NetXtreme II (BCM57710/57711/57711E) PCIe 10b Ethernet
 #       adapters.
 # bwi:	Broadcom BCM430* and BCM431* family of wireless adapters.
 # bwn:	Broadcom BCM43xx family of wireless adapters.
 # cas:	Sun Cassini/Cassini+ and National Semiconductor DP83065 Saturn
 # cm:	Arcnet SMC COM90c26 / SMC COM90c56
 #	(and SMC COM90c66 in '56 compatibility mode) adapters.
 # cxgbe: Support for PCI express 10Gb/1Gb adapters based on the Chelsio T4
 #       (Terminator 4) ASIC.
 # dc:   Support for PCI fast ethernet adapters based on the DEC/Intel 21143
 #       and various workalikes including:
 #       the ADMtek AL981 Comet and AN985 Centaur, the ASIX Electronics
 #       AX88140A and AX88141, the Davicom DM9100 and DM9102, the Lite-On
 #       82c168 and 82c169 PNIC, the Lite-On/Macronix LC82C115 PNIC II
 #       and the Macronix 98713/98713A/98715/98715A/98725 PMAC. This driver
 #       replaces the old al, ax, dm, pn and mx drivers.  List of brands:
 #       Digital DE500-BA, Kingston KNE100TX, D-Link DFE-570TX, SOHOware SFA110,
 #       SVEC PN102-TX, CNet Pro110B, 120A, and 120B, Compex RL100-TX,
 #       LinkSys LNE100TX, LNE100TX V2.0, Jaton XpressNet, Alfa Inc GFC2204,
 #       KNE110TX.
 # de:   Digital Equipment DC21040
 # em:   Intel Pro/1000 Gigabit Ethernet 82542, 82543, 82544 based adapters.
 # igb:  Intel Pro/1000 PCI Express Gigabit Ethernet: 82575 and later adapters.
 # ep:   3Com 3C509, 3C529, 3C556, 3C562D, 3C563D, 3C572, 3C574X, 3C579, 3C589
 #       and PC Card devices using these chipsets.
 # ex:   Intel EtherExpress Pro/10 and other i82595-based adapters,
 #       Olicom Ethernet PC Card devices.
 # fe:   Fujitsu MB86960A/MB86965A Ethernet
 # fea:  DEC DEFEA EISA FDDI adapter
 # fpa:  Support for the Digital DEFPA PCI FDDI. `device fddi' is also needed.
 # fxp:  Intel EtherExpress Pro/100B
 #	(hint of prefer_iomap can be done to prefer I/O instead of Mem mapping)
 # gem:  Apple GMAC/Sun ERI/Sun GEM
 # hme:  Sun HME (Happy Meal Ethernet)
 # jme:  JMicron JMC260 Fast Ethernet/JMC250 Gigabit Ethernet based adapters.
 # le:   AMD Am7900 LANCE and Am79C9xx PCnet
 # lge:	Support for PCI gigabit ethernet adapters based on the Level 1
 #	LXT1001 NetCellerator chipset. This includes the D-Link DGE-500SX,
 #	SMC TigerCard 1000 (SMC9462SX), and some Addtron cards.
 # malo: Marvell Libertas wireless NICs.
 # mwl:  Marvell 88W8363 802.11n wireless NICs.
 # msk:	Support for gigabit ethernet adapters based on the Marvell/SysKonnect
 #	Yukon II Gigabit controllers, including 88E8021, 88E8022, 88E8061,
 #	88E8062, 88E8035, 88E8036, 88E8038, 88E8050, 88E8052, 88E8053,
 #	88E8055, 88E8056 and D-Link 560T/550SX.
 # lmc:	Support for the LMC/SBE wide-area network interface cards.
 # my:	Myson Fast Ethernet (MTD80X, MTD89X)
 # nge:	Support for PCI gigabit ethernet adapters based on the National
 #	Semiconductor DP83820 and DP83821 chipset. This includes the
 #	SMC EZ Card 1000 (SMC9462TX), D-Link DGE-500T, Asante FriendlyNet
 #	GigaNIX 1000TA and 1000TPC, the Addtron AEG320T, the Surecom
 #	EP-320G-TX and the Netgear GA622T.
 # pcn:	Support for PCI fast ethernet adapters based on the AMD Am79c97x
 #	PCnet-FAST, PCnet-FAST+, PCnet-FAST III, PCnet-PRO and PCnet-Home
 #	chipsets. These can also be handled by the le(4) driver if the
 #	pcn(4) driver is left out of the kernel. The le(4) driver does not
 #	support the additional features like the MII bus and burst mode of
 #	the PCnet-FAST and greater chipsets though.
 # ral:	Ralink Technology IEEE 802.11 wireless adapter
 # re:   RealTek 8139C+/8169/816xS/811xS/8101E PCI/PCIe Ethernet adapter
 # rl:   Support for PCI fast ethernet adapters based on the RealTek 8129/8139
 #       chipset.  Note that the RealTek driver defaults to using programmed
 #       I/O to do register accesses because memory mapped mode seems to cause
 #       severe lockups on SMP hardware.  This driver also supports the
 #       Accton EN1207D `Cheetah' adapter, which uses a chip called
 #       the MPX 5030/5038, which is either a RealTek in disguise or a
 #       RealTek workalike.  Note that the D-Link DFE-530TX+ uses the RealTek
 #       chipset and is supported by this driver, not the 'vr' driver.
 # sf:   Support for Adaptec Duralink PCI fast ethernet adapters based on the
 #       Adaptec AIC-6915 "starfire" controller.
 #       This includes dual and quad port cards, as well as one 100baseFX card.
 #       Most of these are 64-bit PCI devices, except for one single port
 #       card which is 32-bit.
 # sge:  Silicon Integrated Systems SiS190/191 Fast/Gigabit Ethernet adapter
 # sis:  Support for NICs based on the Silicon Integrated Systems SiS 900,
 #       SiS 7016 and NS DP83815 PCI fast ethernet controller chips.
 # sk:   Support for the SysKonnect SK-984x series PCI gigabit ethernet NICs.
 #       This includes the SK-9841 and SK-9842 single port cards (single mode
 #       and multimode fiber) and the SK-9843 and SK-9844 dual port cards
 #       (also single mode and multimode).
 #       The driver will autodetect the number of ports on the card and
 #       attach each one as a separate network interface.
 # sn:   Support for ISA and PC Card Ethernet devices using the
 #       SMC91C90/92/94/95 chips.
 # ste:  Sundance Technologies ST201 PCI fast ethernet controller, includes
 #       the D-Link DFE-550TX.
 # stge: Support for gigabit ethernet adapters based on the Sundance/Tamarack
 #       TC9021 family of controllers, including the Sundance ST2021/ST2023,
 #       the Sundance/Tamarack TC9021, the D-Link DL-4000 and ASUS NX1101.
 # ti:   Support for PCI gigabit ethernet NICs based on the Alteon Networks
 #       Tigon 1 and Tigon 2 chipsets.  This includes the Alteon AceNIC, the
 #       3Com 3c985, the Netgear GA620 and various others.  Note that you will
 #       probably want to bump up kern.ipc.nmbclusters a lot to use this driver.
 # tl:   Support for the Texas Instruments TNETE100 series 'ThunderLAN'
 #       cards and integrated ethernet controllers.  This includes several
 #       Compaq Netelligent 10/100 cards and the built-in ethernet controllers
 #       in several Compaq Prosignia, Proliant and Deskpro systems.  It also
 #       supports several Olicom 10Mbps and 10/100 boards.
 # tx:   SMC 9432 TX, BTX and FTX cards. (SMC EtherPower II series)
 # txp:	Support for 3Com 3cR990 cards with the "Typhoon" chipset
 # vr:   Support for various fast ethernet adapters based on the VIA
 #       Technologies VT3043 `Rhine I' and VT86C100A `Rhine II' chips,
 #       including the D-Link DFE520TX and D-Link DFE530TX (see 'rl' for
 #       DFE530TX+), the Hawking Technologies PN102TX, and the AOpen/Acer ALN-320.
 # vte:  DM&P Vortex86 RDC R6040 Fast Ethernet
 # vx:   3Com 3C590 and 3C595
 # wb:   Support for fast ethernet adapters based on the Winbond W89C840F chip.
 #       Note: this is not the same as the Winbond W89C940F, which is a
 #       NE2000 clone.
 # wi:   Lucent WaveLAN/IEEE 802.11 PCMCIA adapters. Note: this supports both
 #       the PCMCIA and ISA cards: the ISA card is really a PCMCIA to ISA
 #       bridge with a PCMCIA adapter plugged into it.
 # xe:   Xircom/Intel EtherExpress Pro100/16 PC Card ethernet controller,
 #       Accton Fast EtherCard-16, Compaq Netelligent 10/100 PC Card,
 #       Toshiba 10/100 Ethernet PC Card, Xircom 16-bit Ethernet + Modem 56
 # xl:   Support for the 3Com 3c900, 3c905, 3c905B and 3c905C (Fast)
 #       Etherlink XL cards and integrated controllers.  This includes the
 #       integrated 3c905B-TX chips in certain Dell Optiplex and Dell
 #       Precision desktop machines and the integrated 3c905-TX chips
 #       in Dell Latitude laptop docking stations.
 #       Also supported: 3Com 3c980(C)-TX, 3Com 3cSOHO100-TX, 3Com 3c450-TX
 
 # Order for ISA/EISA devices is important here
 
 device		cm
 hint.cm.0.at="isa"
 hint.cm.0.port="0x2e0"
 hint.cm.0.irq="9"
 hint.cm.0.maddr="0xdc000"
 device		ep
 device		ex
 device		fe
 hint.fe.0.at="isa"
 hint.fe.0.port="0x300"
 device		fea
 device		sn
 hint.sn.0.at="isa"
 hint.sn.0.port="0x300"
 hint.sn.0.irq="10"
 device		an
 device		wi
 device		xe
 
 # PCI Ethernet NICs that use the common MII bus controller code.
 device		ae		# Attansic/Atheros L2 FastEthernet
 device		age		# Attansic/Atheros L1 Gigabit Ethernet
 device		alc		# Atheros AR8131/AR8132 Ethernet
 device		ale		# Atheros AR8121/AR8113/AR8114 Ethernet
 device		bce		# Broadcom BCM5706/BCM5708 Gigabit Ethernet
 device		bfe		# Broadcom BCM440x 10/100 Ethernet
 device		bge		# Broadcom BCM570xx Gigabit Ethernet
 device		cas		# Sun Cassini/Cassini+ and NS DP83065 Saturn
 device		cxgb		# Chelsio T3 10 Gigabit Ethernet
 device		cxgb_t3fw	# Chelsio T3 10 Gigabit Ethernet firmware
 device		dc		# DEC/Intel 21143 and various workalikes
 device		et		# Agere ET1310 10/100/Gigabit Ethernet
 device		fxp		# Intel EtherExpress PRO/100B (82557, 82558)
 hint.fxp.0.prefer_iomap="0"
 device		gem		# Apple GMAC/Sun ERI/Sun GEM
 device		hme		# Sun HME (Happy Meal Ethernet)
 device		jme		# JMicron JMC250 Gigabit/JMC260 Fast Ethernet
 device		lge		# Level 1 LXT1001 gigabit Ethernet
 device		msk		# Marvell/SysKonnect Yukon II Gigabit Ethernet
 device		my		# Myson Fast Ethernet (MTD80X, MTD89X)
 device		nge		# NatSemi DP83820 gigabit Ethernet
 device		re		# RealTek 8139C+/8169/8169S/8110S
 device		rl		# RealTek 8129/8139
 device		pcn		# AMD Am79C97x PCI 10/100 NICs
 device		sf		# Adaptec AIC-6915 (``Starfire'')
 device		sge		# Silicon Integrated Systems SiS190/191
 device		sis		# Silicon Integrated Systems SiS 900/SiS 7016
 device		sk		# SysKonnect SK-984x & SK-982x gigabit Ethernet
 device		ste		# Sundance ST201 (D-Link DFE-550TX)
 device		stge		# Sundance/Tamarack TC9021 gigabit Ethernet
 device		tl		# Texas Instruments ThunderLAN
 device		tx		# SMC EtherPower II (83c170 ``EPIC'')
 device		vr		# VIA Rhine, Rhine II
 device		vte		# DM&P Vortex86 RDC R6040 Fast Ethernet
 device		wb		# Winbond W89C840F
 device		xl		# 3Com 3c90x (``Boomerang'', ``Cyclone'')
 
 # PCI Ethernet NICs.
 device		bxe		# Broadcom BCM57710/BCM57711/BCM57711E 10Gb Ethernet
 device		cxgbe		# Chelsio T4 10GbE PCIe adapter
 device		de		# DEC/Intel DC21x4x (``Tulip'')
 device		em		# Intel Pro/1000 Gigabit Ethernet
 device		igb		# Intel Pro/1000 PCIE Gigabit Ethernet
 device		ixgb		# Intel Pro/10Gbe PCI-X Ethernet
 device		ixgbe		# Intel Pro/10Gbe PCIE Ethernet
 device		le		# AMD Am7900 LANCE and Am79C9xx PCnet
 device		mxge		# Myricom Myri-10G 10GbE NIC
 device		nxge		# Neterion Xframe 10GbE Server/Storage Adapter
 device		ti		# Alteon Networks Tigon I/II gigabit Ethernet
 device		txp		# 3Com 3cR990 (``Typhoon'')
 device		vx		# 3Com 3c590, 3c595 (``Vortex'')
 device		vxge		# Exar/Neterion XFrame 3100 10GbE
 
 # PCI FDDI NICs.
 device		fpa
 
 # PCI WAN adapters.
 device		lmc
 
 # PCI IEEE 802.11 Wireless NICs
 device		ath		# Atheros pci/cardbus NIC's
 device		ath_hal		# pci/cardbus chip support
 #device		ath_ar5210	# AR5210 chips
 #device		ath_ar5211	# AR5211 chips
 #device		ath_ar5212	# AR5212 chips
 #device		ath_rf2413
 #device		ath_rf2417
 #device		ath_rf2425
 #device		ath_rf5111
 #device		ath_rf5112
 #device		ath_rf5413
 #device		ath_ar5416	# AR5416 chips
 options 	AH_SUPPORT_AR5416	# enable AR5416 tx/rx descriptors
 # All of the AR5212 parts have a problem when paired with the AR71xx
 # CPUS.  These parts have a bug that triggers a fatal bus error on the AR71xx
 # only.  Details of the exact nature of the bug are sketchy, but some can be
 # found at https://forum.openwrt.org/viewtopic.php?pid=70060 on pages 4, 5 and
 # 6.  This option enables this workaround.  There is a performance penalty
 # for this work around, but without it things don't work at all.  The DMA
 # from the card usually bursts 128 bytes, but on the affected CPUs, only
 # 4 are safe.
 options	   	AH_RXCFG_SDMAMW_4BYTES
 #device		ath_ar9160	# AR9160 chips
 #device		ath_ar9280	# AR9280 chips
 #device		ath_ar9285	# AR9285 chips
 device		ath_rate_sample	# SampleRate tx rate control for ath
 device		bwi		# Broadcom BCM430* BCM431*
 device		bwn		# Broadcom BCM43xx
 device		malo		# Marvell Libertas wireless NICs.
 device		mwl		# Marvell 88W8363 802.11n wireless NICs.
 device		ral		# Ralink Technology RT2500 wireless NICs.
 
 # Use "private" jumbo buffers allocated exclusively for the ti(4) driver.
 # This option is incompatible with the TI_JUMBO_HDRSPLIT option below.
 #options 	TI_PRIVATE_JUMBOS
 # Turn on the header splitting option for the ti(4) driver firmware.  This
 # only works for Tigon II chips, and has no effect for Tigon I chips.
 options 	TI_JUMBO_HDRSPLIT
 
 #
 # Use header splitting feature on bce(4) adapters.
 # This may help to reduce the amount of jumbo-sized memory buffers used.
 #
 options		BCE_JUMBO_HDRSPLIT
 
 # These two options allow manipulating the mbuf cluster size and mbuf size,
 # respectively.  Be very careful with NIC driver modules when changing
 # these from their default values, because that can potentially cause a
 # mismatch between the mbuf size assumed by the kernel and the mbuf size
 # assumed by a module.  The only driver that currently has the ability to
 # detect a mismatch is ti(4).
 options 	MCLSHIFT=12	# mbuf cluster shift in bits, 12 == 4KB
 options 	MSIZE=512	# mbuf size in bytes
 
 #
 # ATM related options (Cranor version)
 # (note: this driver cannot be used with the HARP ATM stack)
 #
 # The `en' device provides support for Efficient Networks (ENI)
 # ENI-155 PCI midway cards, and the Adaptec 155Mbps PCI ATM cards (ANA-59x0).
 #
 # The `hatm' device provides support for Fore/Marconi HE155 and HE622
 # ATM PCI cards.
 #
 # The `fatm' device provides support for Fore PCA200E ATM PCI cards.
 #
 # The `patm' device provides support for IDT77252 based cards like
 # ProSum's ProATM-155 and ProATM-25 and IDT's evaluation boards.
 #
 # atm device provides generic atm functions and is required for
 # atm devices.
 # NATM enables the netnatm protocol family that can be used to
 # bypass TCP/IP.
 #
 # utopia provides the access to the ATM PHY chips and is required for en,
 # hatm and fatm.
 #
 # the current driver supports only PVC operations (no atm-arp, no multicast).
 # for more details, please read the original documents at
 # http://www.ccrc.wustl.edu/pub/chuck/tech/bsdatm/bsdatm.html
 #
 device		atm
 device		en
 device		fatm			#Fore PCA200E
 device		hatm			#Fore/Marconi HE155/622
 device		patm			#IDT77252 cards (ProATM and IDT)
 device		utopia			#ATM PHY driver
 options 	NATM			#native ATM
 
 options 	LIBMBPOOL		#needed by patm, iatm
 
 #
 # Sound drivers
 #
 # sound: The generic sound driver.
 #
 
 device		sound
 
 #
 # snd_*: Device-specific drivers.
 #
 # The flags of the device tell the device a bit more info about the
 # device that normally is obtained through the PnP interface.
 #	bit  2..0   secondary DMA channel;
 #	bit  4      set if the board uses two dma channels;
 #	bit 15..8   board type, overrides autodetection; leave it
 #		    zero if don't know what to put in (and you don't,
 #		    since this is unsupported at the moment...).
 #
 # snd_ad1816:		Analog Devices AD1816 ISA PnP/non-PnP.
 # snd_als4000:		Avance Logic ALS4000 PCI.
 # snd_atiixp:		ATI IXP 200/300/400 PCI.
 # snd_audiocs:		Crystal Semiconductor CS4231 SBus/EBus. Only
 #			for sparc64.
 # snd_cmi:		CMedia CMI8338/CMI8738 PCI.
 # snd_cs4281:		Crystal Semiconductor CS4281 PCI.
 # snd_csa:		Crystal Semiconductor CS461x/428x PCI. (except
 #			4281)
 # snd_ds1:		Yamaha DS-1 PCI.
 # snd_emu10k1:		Creative EMU10K1 PCI and EMU10K2 (Audigy) PCI.
 # snd_emu10kx:		Creative SoundBlaster Live! and Audigy
 # snd_envy24:		VIA Envy24 and compatible, needs snd_spicds.
 # snd_envy24ht:		VIA Envy24HT and compatible, needs snd_spicds.
 # snd_es137x:		Ensoniq AudioPCI ES137x PCI.
 # snd_ess:		Ensoniq ESS ISA PnP/non-PnP, to be used in
 #			conjunction with snd_sbc.
 # snd_fm801:		Forte Media FM801 PCI.
 # snd_gusc:		Gravis UltraSound ISA PnP/non-PnP.
 # snd_hda:		Intel High Definition Audio (Controller) and
 #			compatible.
 # snd_ich:		Intel ICH AC'97 and some more audio controllers
 #			embedded in a chipset, for example nVidia
 #			nForce controllers.
 # snd_maestro:		ESS Technology Maestro-1/2x PCI.
 # snd_maestro3:		ESS Technology Maestro-3/Allegro PCI.
 # snd_mss:		Microsoft Sound System ISA PnP/non-PnP.
 # snd_neomagic:		Neomagic 256 AV/ZX PCI.
 # snd_sb16:		Creative SoundBlaster16, to be used in
 #			conjunction with snd_sbc.
 # snd_sb8:		Creative SoundBlaster (pre-16), to be used in
 #			conjunction with snd_sbc.
 # snd_sbc:		Creative SoundBlaster ISA PnP/non-PnP.
 #			Supports ESS and Avance ISA chips as well.
 # snd_spicds:		SPI codec driver, needed by Envy24/Envy24HT drivers.
 # snd_solo:		ESS Solo-1x PCI.
 # snd_t4dwave:		Trident 4DWave DX/NX PCI, Sis 7018 PCI and Acer Labs
 #			M5451 PCI.
 # snd_via8233:		VIA VT8233x PCI.
 # snd_via82c686:	VIA VT82C686A PCI.
 # snd_vibes:		S3 Sonicvibes PCI.
 # snd_uaudio:		USB audio.
 
 device		snd_ad1816
 device		snd_als4000
 device		snd_atiixp
 #device		snd_audiocs
 device		snd_cmi
 device		snd_cs4281
 device		snd_csa
 device		snd_ds1
 device		snd_emu10k1
 device		snd_emu10kx
 device		snd_envy24
 device		snd_envy24ht
 device		snd_es137x
 device		snd_ess
 device		snd_fm801
 device		snd_gusc
 device		snd_hda
 device		snd_ich
 device		snd_maestro
 device		snd_maestro3
 device		snd_mss
 device		snd_neomagic
 device		snd_sb16
 device		snd_sb8
 device		snd_sbc
 device		snd_solo
 device		snd_spicds
 device		snd_t4dwave
 device		snd_via8233
 device		snd_via82c686
 device		snd_vibes
 device		snd_uaudio
 
 # For non-PnP sound cards:
 hint.pcm.0.at="isa"
 hint.pcm.0.irq="10"
 hint.pcm.0.drq="1"
 hint.pcm.0.flags="0x0"
 hint.sbc.0.at="isa"
 hint.sbc.0.port="0x220"
 hint.sbc.0.irq="5"
 hint.sbc.0.drq="1"
 hint.sbc.0.flags="0x15"
 hint.gusc.0.at="isa"
 hint.gusc.0.port="0x220"
 hint.gusc.0.irq="5"
 hint.gusc.0.drq="1"
 hint.gusc.0.flags="0x13"
 
 #
 # Following options are intended for debugging/testing purposes:
 #
 # SND_DEBUG                    Enable extra debugging code that includes
 #                              sanity checking and possible increase of
 #                              verbosity.
 #
 # SND_DIAGNOSTIC               Simmilar in a spirit of INVARIANTS/DIAGNOSTIC,
 #                              zero tolerance against inconsistencies.
 #
 # SND_FEEDER_MULTIFORMAT       By default, only 16/32 bit feeders are compiled
 #                              in. This options enable most feeder converters
 #                              except for 8bit. WARNING: May bloat the kernel.
 #
 # SND_FEEDER_FULL_MULTIFORMAT  Ditto, but includes 8bit feeders as well.
 #
 # SND_FEEDER_RATE_HP           (feeder_rate) High precision 64bit arithmetic
 #                              as much as possible (the default trying to
 #                              avoid it). Possible slowdown.
 #
 # SND_PCM_64                   (Only applicable for i386/32bit arch)
 #                              Process 32bit samples through 64bit
 #                              integer/arithmetic. Slight increase of dynamic
 #                              range at a cost of possible slowdown.
 #
 # SND_OLDSTEREO                Only 2 channels are allowed, effectively
 #                              disabling multichannel processing.
 #
 options		SND_DEBUG
 options		SND_DIAGNOSTIC
 options		SND_FEEDER_MULTIFORMAT
 options		SND_FEEDER_FULL_MULTIFORMAT
 options		SND_FEEDER_RATE_HP
 options		SND_PCM_64
 options		SND_OLDSTEREO
 
 #
 # IEEE-488 hardware:
 # pcii:		PCIIA cards (uPD7210 based isa cards)
 # tnt4882:	National Instruments PCI-GPIB card.
 
 device	pcii
 hint.pcii.0.at="isa"
 hint.pcii.0.port="0x2e1"
 hint.pcii.0.irq="5"
 hint.pcii.0.drq="1"
 
 device	tnt4882
 
 #
 # Miscellaneous hardware:
 #
 # scd: Sony CD-ROM using proprietary (non-ATAPI) interface
 # mcd: Mitsumi CD-ROM using proprietary (non-ATAPI) interface
 # bktr: Brooktree bt848/848a/849a/878/879 video capture and TV Tuner board
 # joy: joystick (including IO DATA PCJOY PC Card joystick)
 # cmx: OmniKey CardMan 4040 pccard smartcard reader
 
 # Mitsumi CD-ROM
 device		mcd
 hint.mcd.0.at="isa"
 hint.mcd.0.port="0x300"
 # for the Sony CDU31/33A CDROM
 device		scd
 hint.scd.0.at="isa"
 hint.scd.0.port="0x230"
 device		joy			# PnP aware, hints for non-PnP only
 hint.joy.0.at="isa"
 hint.joy.0.port="0x201"
 device		cmx
 
 #
 # The 'bktr' device is a PCI video capture device using the Brooktree
 # bt848/bt848a/bt849a/bt878/bt879 chipset. When used with a TV Tuner it forms a
 # TV card, e.g. Miro PC/TV, Hauppauge WinCast/TV WinTV, VideoLogic Captivator,
 # Intel Smart Video III, AverMedia, IMS Turbo, FlyVideo.
 #
 # options 	OVERRIDE_CARD=xxx
 # options 	OVERRIDE_TUNER=xxx
 # options 	OVERRIDE_MSP=1
 # options 	OVERRIDE_DBX=1
 # These options can be used to override the auto detection
 # The current values for xxx are found in src/sys/dev/bktr/bktr_card.h
 # Using sysctl(8) run-time overrides on a per-card basis can be made
 #
 # options 	BROOKTREE_SYSTEM_DEFAULT=BROOKTREE_PAL
 # or
 # options 	BROOKTREE_SYSTEM_DEFAULT=BROOKTREE_NTSC
 # Specifies the default video capture mode.
 # This is required for Dual Crystal (28&35Mhz) boards where PAL is used
 # to prevent hangs during initialisation, e.g. VideoLogic Captivator PCI.
 #
 # options 	BKTR_USE_PLL
 # This is required for PAL or SECAM boards with a 28Mhz crystal and no 35Mhz
 # crystal, e.g. some new Bt878 cards.
 #
 # options 	BKTR_GPIO_ACCESS
 # This enables IOCTLs which give user level access to the GPIO port.
 #
 # options 	BKTR_NO_MSP_RESET
 # Prevents the MSP34xx reset. Good if you initialise the MSP in another OS first
 #
 # options 	BKTR_430_FX_MODE
 # Switch Bt878/879 cards into Intel 430FX chipset compatibility mode.
 #
 # options 	BKTR_SIS_VIA_MODE
 # Switch Bt878/879 cards into SIS/VIA chipset compatibility mode which is
 # needed for some old SiS and VIA chipset motherboards.
 # This also allows Bt878/879 chips to work on old OPTi (<1997) chipset
 # motherboards and motherboards with bad or incomplete PCI 2.1 support.
 # As a rough guess, old = before 1998
 #
 # options 	BKTR_NEW_MSP34XX_DRIVER
 # Use new, more complete initialization scheme for the msp34* soundchip.
 # Should fix stereo autodetection if the old driver does only output
 # mono sound.
 
 #
 # options 	BKTR_USE_FREEBSD_SMBUS
 # Compile with FreeBSD SMBus implementation
 #
 # Brooktree driver has been ported to the new I2C framework. Thus,
 # you'll need to have the following 3 lines in the kernel config.
 #     device smbus
 #     device iicbus
 #     device iicbb
 #     device iicsmb
 # The iic and smb devices are only needed if you want to control other
 # I2C slaves connected to the external connector of some cards.
 #
 device		bktr
  
 #
 # PC Card/PCMCIA and Cardbus
 #
 # cbb: pci/cardbus bridge implementing YENTA interface
 # pccard: pccard slots
 # cardbus: cardbus slots
 device		cbb
 device		pccard
 device		cardbus
 
 #
 # MMC/SD
 #
 # mmc 		MMC/SD bus
 # mmcsd		MMC/SD memory card
 # sdhci		Generic PCI SD Host Controller
 #
 device		mmc
 device		mmcsd
 device		sdhci
 
 #
 # SMB bus
 #
 # System Management Bus support is provided by the 'smbus' device.
 # Access to the SMBus device is via the 'smb' device (/dev/smb*),
 # which is a child of the 'smbus' device.
 #
 # Supported devices:
 # smb		standard I/O through /dev/smb*
 #
 # Supported SMB interfaces:
 # iicsmb	I2C to SMB bridge with any iicbus interface
 # bktr		brooktree848 I2C hardware interface
 # intpm		Intel PIIX4 (82371AB, 82443MX) Power Management Unit
 # alpm		Acer Aladdin-IV/V/Pro2 Power Management Unit
 # ichsmb	Intel ICH SMBus controller chips (82801AA, 82801AB, 82801BA)
 # viapm		VIA VT82C586B/596B/686A and VT8233 Power Management Unit
 # amdpm		AMD 756 Power Management Unit
 # amdsmb	AMD 8111 SMBus 2.0 Controller
 # nfpm		NVIDIA nForce Power Management Unit
 # nfsmb		NVIDIA nForce2/3/4 MCP SMBus 2.0 Controller
 #
 device		smbus		# Bus support, required for smb below.
 
 device		intpm
 device		alpm
 device		ichsmb
 device		viapm
 device		amdpm
 device		amdsmb
 device		nfpm
 device		nfsmb
 
 device		smb
 
 #
 # I2C Bus
 #
 # Philips i2c bus support is provided by the `iicbus' device.
 #
 # Supported devices:
 # ic	i2c network interface
 # iic	i2c standard io
 # iicsmb i2c to smb bridge. Allow i2c i/o with smb commands.
 #
 # Supported interfaces:
 # bktr	brooktree848 I2C software interface
 #
 # Other:
 # iicbb	generic I2C bit-banging code (needed by lpbb, bktr)
 #
 device		iicbus		# Bus support, required for ic/iic/iicsmb below.
 device		iicbb
 
 device		ic
 device		iic
 device		iicsmb		# smb over i2c bridge
 
 # I2C peripheral devices
 #
 # ds133x	Dallas Semiconductor DS1337, DS1338 and DS1339 RTC
 # ds1672	Dallas Semiconductor DS1672 RTC
 #
 device		ds133x
 device		ds1672
 
 # Parallel-Port Bus
 #
 # Parallel port bus support is provided by the `ppbus' device.
 # Multiple devices may be attached to the parallel port, devices
 # are automatically probed and attached when found.
 #
 # Supported devices:
 # vpo	Iomega Zip Drive
 #	Requires SCSI disk support ('scbus' and 'da'), best
 #	performance is achieved with ports in EPP 1.9 mode.
 # lpt	Parallel Printer
 # plip	Parallel network interface
 # ppi	General-purpose I/O ("Geek Port") + IEEE1284 I/O
 # pps	Pulse per second Timing Interface
 # lpbb	Philips official parallel port I2C bit-banging interface
 # pcfclock Parallel port clock driver.
 #
 # Supported interfaces:
 # ppc	ISA-bus parallel port interfaces.
 #
 
 options 	PPC_PROBE_CHIPSET # Enable chipset specific detection
 				  # (see flags in ppc(4))
 options 	DEBUG_1284	# IEEE1284 signaling protocol debug
 options 	PERIPH_1284	# Makes your computer act as an IEEE1284
 				# compliant peripheral
 options 	DONTPROBE_1284	# Avoid boot detection of PnP parallel devices
 options 	VP0_DEBUG	# ZIP/ZIP+ debug
 options 	LPT_DEBUG	# Printer driver debug
 options 	PPC_DEBUG	# Parallel chipset level debug
 options 	PLIP_DEBUG	# Parallel network IP interface debug
 options 	PCFCLOCK_VERBOSE         # Verbose pcfclock driver
 options 	PCFCLOCK_MAX_RETRIES=5   # Maximum read tries (default 10)
 
 device		ppc
 hint.ppc.0.at="isa"
 hint.ppc.0.irq="7"
 device		ppbus
 device		vpo
 device		lpt
 device		plip
 device		ppi
 device		pps
 device		lpbb
 device		pcfclock
 
 # Kernel BOOTP support
 
 options 	BOOTP		# Use BOOTP to obtain IP address/hostname
 				# Requires NFSCLIENT and NFS_ROOT
 options 	BOOTP_NFSROOT	# NFS mount root filesystem using BOOTP info
 options 	BOOTP_NFSV3	# Use NFS v3 to NFS mount root
 options 	BOOTP_COMPAT	# Workaround for broken bootp daemons.
 options 	BOOTP_WIRED_TO=fxp0 # Use interface fxp0 for BOOTP
 options 	BOOTP_BLOCKSIZE=8192 # Override NFS block size
 
 #
 # Add software watchdog routines.
 #
 options 	SW_WATCHDOG
 
 #
 # Add the software deadlock resolver thread.
 #
 options 	DEADLKRES
 
 #
 # Disable swapping of stack pages.  This option removes all
 # code which actually performs swapping, so it's not possible to turn
 # it back on at run-time.
 #
 # This is sometimes usable for systems which don't have any swap space
 # (see also sysctls "vm.defer_swapspace_pageouts" and
 # "vm.disable_swapspace_pageouts")
 #
 #options 	NO_SWAPPING
 
 # Set the number of sf_bufs to allocate. sf_bufs are virtual buffers
 # for sendfile(2) that are used to map file VM pages, and normally
 # default to a quantity that is roughly 16*MAXUSERS+512. You would
 # typically want about 4 of these for each simultaneous file send.
 #
 options 	NSFBUFS=1024
 
 #
 # Enable extra debugging code for locks.  This stores the filename and
 # line of whatever acquired the lock in the lock itself, and changes a
 # number of function calls to pass around the relevant data.  This is
 # not at all useful unless you are debugging lock code.  Also note
 # that it is likely to break e.g. fstat(1) unless you recompile your
 # userland with -DDEBUG_LOCKS as well.
 #
 options 	DEBUG_LOCKS
 
 
 #####################################################################
 # USB support
 # UHCI controller
 device		uhci
 # OHCI controller
 device		ohci
 # EHCI controller
 device		ehci
 # XHCI controller
 device		xhci
 # SL811 Controller
 #device		slhci
 # General USB code (mandatory for USB)
 device		usb
 #
 # USB Double Bulk Pipe devices
 device		udbp
 # USB Fm Radio
 device		ufm
 # Human Interface Device (anything with buttons and dials)
 device		uhid
 # USB keyboard
 device		ukbd
 # USB printer
 device		ulpt
 # USB mass storage driver (Requires scbus and da)
 device		umass
 # USB mass storage driver for device-side mode
 device		usfs
 # USB support for Belkin F5U109 and Magic Control Technology serial adapters
 device		umct
 # USB modem support
 device		umodem
 # USB mouse
 device		ums
 # eGalax USB touch screen
 device		uep
 # Diamond Rio 500 MP3 player
 device		urio
 #
 # USB serial support
 device		ucom
 # USB support for 3G modem cards by Option, Novatel, Huawei and Sierra
 device		u3g
 # USB support for Technologies ARK3116 based serial adapters
 device		uark
 # USB support for Belkin F5U103 and compatible serial adapters
 device		ubsa
 # USB support for serial adapters based on the FT8U100AX and FT8U232AM
 device		uftdi
 # USB support for some Windows CE based serial communication.
 device		uipaq
 # USB support for Prolific PL-2303 serial adapters
 device		uplcom
 # USB support for Silicon Laboratories CP2101/CP2102 based USB serial adapters
 device		uslcom
 # USB Visor and Palm devices
 device		uvisor
 # USB serial support for DDI pocket's PHS
 device		uvscom
 #
 # ADMtek USB ethernet. Supports the LinkSys USB100TX,
 # the Billionton USB100, the Melco LU-ATX, the D-Link DSB-650TX
 # and the SMC 2202USB. Also works with the ADMtek AN986 Pegasus
 # eval board.
 device		aue
 
 # ASIX Electronics AX88172 USB 2.0 ethernet driver. Used in the
 # LinkSys USB200M and various other adapters.
 device		axe
 
 #
 # Devices which communicate using Ethernet over USB, particularly
 # Communication Device Class (CDC) Ethernet specification. Supports
 # Sharp Zaurus PDAs, some DOCSIS cable modems and so on.
 device		cdce
 #
 # CATC USB-EL1201A USB ethernet. Supports the CATC Netmate
 # and Netmate II, and the Belkin F5U111.
 device		cue
 #
 # Kawasaki LSI ethernet. Supports the LinkSys USB10T,
 # Entrega USB-NET-E45, Peracom Ethernet Adapter, the
 # 3Com 3c19250, the ADS Technologies USB-10BT, the ATen UC10T,
 # the Netgear EA101, the D-Link DSB-650, the SMC 2102USB
 # and 2104USB, and the Corega USB-T.
 device		kue
 #
 # RealTek RTL8150 USB to fast ethernet. Supports the Melco LUA-KTX
 # and the GREEN HOUSE GH-USB100B.
 device		rue
 #
 # Davicom DM9601E USB to fast ethernet. Supports the Corega FEther USB-TXC.
 device		udav
 #
 # Moschip MCS7730/MCS7840 USB to fast ethernet. Supports the Sitecom LN030.
 device		mos
 #
 # HSxPA devices from Option N.V
 device		uhso
 
 #
 # Ralink Technology RT2501USB/RT2601USB wireless driver
 device		rum
 # Ralink Technology RT2700U/RT2800U/RT3000U wireless driver
 device		run
 #
 # Atheros AR5523 wireless driver
 device		uath
 #
 # Conexant/Intersil PrismGT wireless driver
 device		upgt
 #
 # Ralink Technology RT2500USB wireless driver
 device		ural
 #
 # Realtek RTL8187B/L wireless driver
 device		urtw
 #
 # ZyDas ZD1211/ZD1211B wireless driver
 device		zyd
 
 # 
 # debugging options for the USB subsystem
 #
 options 	USB_DEBUG
 options 	U3G_DEBUG
 
 # options for ukbd:
 options 	UKBD_DFLT_KEYMAP	# specify the built-in keymap
 makeoptions	UKBD_DFLT_KEYMAP=it.iso
 
 # options for uplcom:
 options 	UPLCOM_INTR_INTERVAL=100	# interrupt pipe interval
 						# in milliseconds
 
 # options for uvscom:
 options 	UVSCOM_DEFAULT_OPKTSIZE=8	# default output packet size
 options 	UVSCOM_INTR_INTERVAL=100	# interrupt pipe interval
 						# in milliseconds
 
 #####################################################################
 # FireWire support
 
 device		firewire	# FireWire bus code
 device		sbp		# SCSI over Firewire (Requires scbus and da)
 device		sbp_targ	# SBP-2 Target mode  (Requires scbus and targ)
 device		fwe		# Ethernet over FireWire (non-standard!)
 device		fwip		# IP over FireWire (RFC2734 and RFC3146)
 
 #####################################################################
 # dcons support (Dumb Console Device)
 
 device		dcons			# dumb console driver
 device		dcons_crom		# FireWire attachment
 options 	DCONS_BUF_SIZE=16384	# buffer size
 options 	DCONS_POLL_HZ=100	# polling rate
 options 	DCONS_FORCE_CONSOLE=0	# force to be the primary console
 options 	DCONS_FORCE_GDB=1	# force to be the gdb device
 
 #####################################################################
 # crypto subsystem
 #
 # This is a port of the OpenBSD crypto framework.  Include this when
 # configuring IPSEC and when you have a h/w crypto device to accelerate
 # user applications that link to OpenSSL.
 #
 # Drivers are ports from OpenBSD with some simple enhancements that have
 # been fed back to OpenBSD.
 
 device		crypto		# core crypto support
 device		cryptodev	# /dev/crypto for access to h/w
 
 device		rndtest		# FIPS 140-2 entropy tester
 
 device		hifn		# Hifn 7951, 7781, etc.
 options 	HIFN_DEBUG	# enable debugging support: hw.hifn.debug
 options 	HIFN_RNDTEST	# enable rndtest support
 
 device		ubsec		# Broadcom 5501, 5601, 58xx
 options 	UBSEC_DEBUG	# enable debugging support: hw.ubsec.debug
 options 	UBSEC_RNDTEST	# enable rndtest support
 
 #####################################################################
 
 
 #
 # Embedded system options:
 #
 # An embedded system might want to run something other than init.
 options 	INIT_PATH=/sbin/init:/stand/sysinstall
 
 # Debug options
 options 	BUS_DEBUG	# enable newbus debugging
 options 	DEBUG_VFS_LOCKS	# enable VFS lock debugging
 options 	SOCKBUF_DEBUG	# enable sockbuf last record/mb tail checking
 
 #
 # Verbose SYSINIT
 #
 # Make the SYSINIT process performed by mi_startup() verbose.  This is very
 # useful when porting to a new architecture.  If DDB is also enabled, this
 # will print function names instead of addresses.
 options 	VERBOSE_SYSINIT
 
 #####################################################################
 # SYSV IPC KERNEL PARAMETERS
 #
 # Maximum number of entries in a semaphore map.
 options 	SEMMAP=31
 
 # Maximum number of System V semaphores that can be used on the system at
 # one time.
 options 	SEMMNI=11
 
 # Total number of semaphores system wide
 options 	SEMMNS=61
 
 # Total number of undo structures in system
 options 	SEMMNU=31
 
 # Maximum number of System V semaphores that can be used by a single process
 # at one time.
 options 	SEMMSL=61
 
 # Maximum number of operations that can be outstanding on a single System V
 # semaphore at one time.
 options 	SEMOPM=101
 
 # Maximum number of undo operations that can be outstanding on a single
 # System V semaphore at one time.
 options 	SEMUME=11
 
 # Maximum number of shared memory pages system wide.
 options 	SHMALL=1025
 
 # Maximum size, in bytes, of a single System V shared memory region.
 options 	SHMMAX=(SHMMAXPGS*PAGE_SIZE+1)
 options 	SHMMAXPGS=1025
 
 # Minimum size, in bytes, of a single System V shared memory region.
 options 	SHMMIN=2
 
 # Maximum number of shared memory regions that can be used on the system
 # at one time.
 options 	SHMMNI=33
 
 # Maximum number of System V shared memory regions that can be attached to
 # a single process at one time.
 options 	SHMSEG=9
 
 # Compress user core dumps.
 options		COMPRESS_USER_CORES
 # required to compress file output from kernel for COMPRESS_USER_CORES.
 device		gzio	    
 
 # Set the amount of time (in seconds) the system will wait before
 # rebooting automatically when a kernel panic occurs.  If set to (-1),
 # the system will wait indefinitely until a key is pressed on the
 # console.
 options 	PANIC_REBOOT_WAIT_TIME=16
 
 # Attempt to bypass the buffer cache and put data directly into the
 # userland buffer for read operation when O_DIRECT flag is set on the
 # file.  Both offset and length of the read operation must be
 # multiples of the physical media sector size.
 #
 options 	DIRECTIO
 
 # Specify a lower limit for the number of swap I/O buffers.  They are
 # (among other things) used when bypassing the buffer cache due to
 # DIRECTIO kernel option enabled and O_DIRECT flag set on file.
 #
 options 	NSWBUF_MIN=120
 
 #####################################################################
 
 # More undocumented options for linting.
 # Note that documenting these is not considered an affront.
 
 options 	CAM_DEBUG_DELAY
 
 # VFS cluster debugging.
 options 	CLUSTERDEBUG
 
 options 	DEBUG
 
 # Kernel filelock debugging.
 options 	LOCKF_DEBUG
 
 # System V compatible message queues
 # Please note that the values provided here are used to test kernel
 # building.  The defaults in the sources provide almost the same numbers.
 # MSGSSZ must be a power of 2 between 8 and 1024.
 options 	MSGMNB=2049	# Max number of chars in queue
 options 	MSGMNI=41	# Max number of message queue identifiers
 options 	MSGSEG=2049	# Max number of message segments
 options 	MSGSSZ=16	# Size of a message segment
 options 	MSGTQL=41	# Max number of messages in system
 
 options 	NBUF=512	# Number of buffer headers
 
 options 	SCSI_NCR_DEBUG
 options 	SCSI_NCR_MAX_SYNC=10000
 options 	SCSI_NCR_MAX_WIDE=1
 options 	SCSI_NCR_MYADDR=7
 
 options 	SC_DEBUG_LEVEL=5	# Syscons debug level
 options 	SC_RENDER_DEBUG	# syscons rendering debugging
 
 options 	SHOW_BUSYBUFS	# List buffers that prevent root unmount
 options 	VFS_BIO_DEBUG	# VFS buffer I/O debugging
 
 options 	KSTACK_MAX_PAGES=32 # Maximum pages to give the kernel stack
 
 # Adaptec Array Controller driver options
 options 	AAC_DEBUG	# Debugging levels:
 				# 0 - quiet, only emit warnings
 				# 1 - noisy, emit major function
 				#     points and things done
 				# 2 - extremely noisy, emit trace
 				#     items in loops, etc.
 
 # Resource Accounting
 options 	RACCT
 
 # Resource Limits
 options 	RCTL
 
 # Yet more undocumented options for linting.
 # BKTR_ALLOC_PAGES has no effect except to cause warnings, and
 # BROOKTREE_ALLOC_PAGES hasn't actually been superseded by it, since the
 # driver still mostly spells this option BROOKTREE_ALLOC_PAGES.
 ##options 	BKTR_ALLOC_PAGES=(217*4+1)
 options 	BROOKTREE_ALLOC_PAGES=(217*4+1)
 options 	MAXFILES=999
 
Index: head/sys/conf
===================================================================
--- head/sys/conf	(revision 222812)
+++ head/sys/conf	(revision 222813)

Property changes on: head/sys/conf
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/conf:r221273-222812
Index: head/sys/contrib/dev/acpica
===================================================================
--- head/sys/contrib/dev/acpica	(revision 222812)
+++ head/sys/contrib/dev/acpica	(revision 222813)

Property changes on: head/sys/contrib/dev/acpica
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/contrib/dev/acpica:r221273-222812
Index: head/sys/contrib/octeon-sdk
===================================================================
--- head/sys/contrib/octeon-sdk	(revision 222812)
+++ head/sys/contrib/octeon-sdk	(revision 222813)

Property changes on: head/sys/contrib/octeon-sdk
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/contrib/octeon-sdk:r221273-222812
Index: head/sys/contrib/pf
===================================================================
--- head/sys/contrib/pf	(revision 222812)
+++ head/sys/contrib/pf	(revision 222813)

Property changes on: head/sys/contrib/pf
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/contrib/pf:r221273-222812
Index: head/sys/contrib/x86emu
===================================================================
--- head/sys/contrib/x86emu	(revision 222812)
+++ head/sys/contrib/x86emu	(revision 222813)

Property changes on: head/sys/contrib/x86emu
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys/contrib/x86emu:r221273-222812
Index: head/sys/dev/hwpmc/hwpmc_mod.c
===================================================================
--- head/sys/dev/hwpmc/hwpmc_mod.c	(revision 222812)
+++ head/sys/dev/hwpmc/hwpmc_mod.c	(revision 222813)
@@ -1,4949 +1,4949 @@
 /*-
  * Copyright (c) 2003-2008 Joseph Koshy
  * Copyright (c) 2007 The FreeBSD Foundation
  * All rights reserved.
  *
  * Portions of this software were developed by A. Joseph Koshy under
  * sponsorship from the FreeBSD Foundation and Google, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/eventhandler.h>
 #include <sys/jail.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/mount.h>
 #include <sys/mutex.h>
 #include <sys/pmc.h>
 #include <sys/pmckern.h>
 #include <sys/pmclog.h>
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/queue.h>
 #include <sys/resourcevar.h>
 #include <sys/sched.h>
 #include <sys/signalvar.h>
 #include <sys/smp.h>
 #include <sys/sx.h>
 #include <sys/sysctl.h>
 #include <sys/sysent.h>
 #include <sys/systm.h>
 #include <sys/vnode.h>
 
 #include <sys/linker.h>		/* needs to be after <sys/malloc.h> */
 
 #include <machine/atomic.h>
 #include <machine/md_var.h>
 
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/pmap.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 
 /*
  * Types
  */
 
 enum pmc_flags {
 	PMC_FLAG_NONE	  = 0x00, /* do nothing */
 	PMC_FLAG_REMOVE   = 0x01, /* atomically remove entry from hash */
 	PMC_FLAG_ALLOCATE = 0x02, /* add entry to hash if not found */
 };
 
 /*
  * The offset in sysent where the syscall is allocated.
  */
 
 static int pmc_syscall_num = NO_SYSCALL;
 struct pmc_cpu		**pmc_pcpu;	 /* per-cpu state */
 pmc_value_t		*pmc_pcpu_saved; /* saved PMC values: CSW handling */
 
 #define	PMC_PCPU_SAVED(C,R)	pmc_pcpu_saved[(R) + md->pmd_npmc*(C)]
 
 struct mtx_pool		*pmc_mtxpool;
 static int		*pmc_pmcdisp;	 /* PMC row dispositions */
 
 #define	PMC_ROW_DISP_IS_FREE(R)		(pmc_pmcdisp[(R)] == 0)
 #define	PMC_ROW_DISP_IS_THREAD(R)	(pmc_pmcdisp[(R)] > 0)
 #define	PMC_ROW_DISP_IS_STANDALONE(R)	(pmc_pmcdisp[(R)] < 0)
 
 #define	PMC_MARK_ROW_FREE(R) do {					  \
 	pmc_pmcdisp[(R)] = 0;						  \
 } while (0)
 
 #define	PMC_MARK_ROW_STANDALONE(R) do {					  \
 	KASSERT(pmc_pmcdisp[(R)] <= 0, ("[pmc,%d] row disposition error", \
 		    __LINE__));						  \
 	atomic_add_int(&pmc_pmcdisp[(R)], -1);				  \
 	KASSERT(pmc_pmcdisp[(R)] >= (-pmc_cpu_max_active()),		  \
 		("[pmc,%d] row disposition error", __LINE__));		  \
 } while (0)
 
 #define	PMC_UNMARK_ROW_STANDALONE(R) do { 				  \
 	atomic_add_int(&pmc_pmcdisp[(R)], 1);				  \
 	KASSERT(pmc_pmcdisp[(R)] <= 0, ("[pmc,%d] row disposition error", \
 		    __LINE__));						  \
 } while (0)
 
 #define	PMC_MARK_ROW_THREAD(R) do {					  \
 	KASSERT(pmc_pmcdisp[(R)] >= 0, ("[pmc,%d] row disposition error", \
 		    __LINE__));						  \
 	atomic_add_int(&pmc_pmcdisp[(R)], 1);				  \
 } while (0)
 
 #define	PMC_UNMARK_ROW_THREAD(R) do {					  \
 	atomic_add_int(&pmc_pmcdisp[(R)], -1);				  \
 	KASSERT(pmc_pmcdisp[(R)] >= 0, ("[pmc,%d] row disposition error", \
 		    __LINE__));						  \
 } while (0)
 
 
 /* various event handlers */
 static eventhandler_tag	pmc_exit_tag, pmc_fork_tag;
 
 /* Module statistics */
 struct pmc_op_getdriverstats pmc_stats;
 
 /* Machine/processor dependent operations */
 static struct pmc_mdep  *md;
 
 /*
  * Hash tables mapping owner processes and target threads to PMCs.
  */
 
 struct mtx pmc_processhash_mtx;		/* spin mutex */
 static u_long pmc_processhashmask;
 static LIST_HEAD(pmc_processhash, pmc_process)	*pmc_processhash;
 
 /*
  * Hash table of PMC owner descriptors.  This table is protected by
  * the shared PMC "sx" lock.
  */
 
 static u_long pmc_ownerhashmask;
 static LIST_HEAD(pmc_ownerhash, pmc_owner)	*pmc_ownerhash;
 
 /*
  * List of PMC owners with system-wide sampling PMCs.
  */
 
 static LIST_HEAD(, pmc_owner)			pmc_ss_owners;
 
 
 /*
  * A map of row indices to classdep structures.
  */
 static struct pmc_classdep **pmc_rowindex_to_classdep;
 
 /*
  * Prototypes
  */
 
 #ifdef	DEBUG
 static int	pmc_debugflags_sysctl_handler(SYSCTL_HANDLER_ARGS);
 static int	pmc_debugflags_parse(char *newstr, char *fence);
 #endif
 
 static int	load(struct module *module, int cmd, void *arg);
 static int	pmc_attach_process(struct proc *p, struct pmc *pm);
 static struct pmc *pmc_allocate_pmc_descriptor(void);
 static struct pmc_owner *pmc_allocate_owner_descriptor(struct proc *p);
 static int	pmc_attach_one_process(struct proc *p, struct pmc *pm);
 static int	pmc_can_allocate_rowindex(struct proc *p, unsigned int ri,
     int cpu);
 static int	pmc_can_attach(struct pmc *pm, struct proc *p);
 static void	pmc_capture_user_callchain(int cpu, struct trapframe *tf);
 static void	pmc_cleanup(void);
 static int	pmc_detach_process(struct proc *p, struct pmc *pm);
 static int	pmc_detach_one_process(struct proc *p, struct pmc *pm,
     int flags);
 static void	pmc_destroy_owner_descriptor(struct pmc_owner *po);
 static struct pmc_owner *pmc_find_owner_descriptor(struct proc *p);
 static int	pmc_find_pmc(pmc_id_t pmcid, struct pmc **pm);
 static struct pmc *pmc_find_pmc_descriptor_in_process(struct pmc_owner *po,
     pmc_id_t pmc);
 static struct pmc_process *pmc_find_process_descriptor(struct proc *p,
     uint32_t mode);
 static void	pmc_force_context_switch(void);
 static void	pmc_link_target_process(struct pmc *pm,
     struct pmc_process *pp);
 static void	pmc_log_all_process_mappings(struct pmc_owner *po);
 static void	pmc_log_kernel_mappings(struct pmc *pm);
 static void	pmc_log_process_mappings(struct pmc_owner *po, struct proc *p);
 static void	pmc_maybe_remove_owner(struct pmc_owner *po);
 static void	pmc_process_csw_in(struct thread *td);
 static void	pmc_process_csw_out(struct thread *td);
 static void	pmc_process_exit(void *arg, struct proc *p);
 static void	pmc_process_fork(void *arg, struct proc *p1,
     struct proc *p2, int n);
 static void	pmc_process_samples(int cpu);
 static void	pmc_release_pmc_descriptor(struct pmc *pmc);
 static void	pmc_remove_owner(struct pmc_owner *po);
 static void	pmc_remove_process_descriptor(struct pmc_process *pp);
 static void	pmc_restore_cpu_binding(struct pmc_binding *pb);
 static void	pmc_save_cpu_binding(struct pmc_binding *pb);
 static void	pmc_select_cpu(int cpu);
 static int	pmc_start(struct pmc *pm);
 static int	pmc_stop(struct pmc *pm);
 static int	pmc_syscall_handler(struct thread *td, void *syscall_args);
 static void	pmc_unlink_target_process(struct pmc *pmc,
     struct pmc_process *pp);
 
 /*
  * Kernel tunables and sysctl(8) interface.
  */
 
 SYSCTL_NODE(_kern, OID_AUTO, hwpmc, CTLFLAG_RW, 0, "HWPMC parameters");
 
 static int pmc_callchaindepth = PMC_CALLCHAIN_DEPTH;
 TUNABLE_INT(PMC_SYSCTL_NAME_PREFIX "callchaindepth", &pmc_callchaindepth);
 SYSCTL_INT(_kern_hwpmc, OID_AUTO, callchaindepth, CTLFLAG_TUN|CTLFLAG_RD,
     &pmc_callchaindepth, 0, "depth of call chain records");
 
 #ifdef	DEBUG
 struct pmc_debugflags pmc_debugflags = PMC_DEBUG_DEFAULT_FLAGS;
 char	pmc_debugstr[PMC_DEBUG_STRSIZE];
 TUNABLE_STR(PMC_SYSCTL_NAME_PREFIX "debugflags", pmc_debugstr,
     sizeof(pmc_debugstr));
 SYSCTL_PROC(_kern_hwpmc, OID_AUTO, debugflags,
     CTLTYPE_STRING|CTLFLAG_RW|CTLFLAG_TUN,
     0, 0, pmc_debugflags_sysctl_handler, "A", "debug flags");
 #endif
 
 /*
  * kern.hwpmc.hashrows -- determines the number of rows in the
  * of the hash table used to look up threads
  */
 
 static int pmc_hashsize = PMC_HASH_SIZE;
 TUNABLE_INT(PMC_SYSCTL_NAME_PREFIX "hashsize", &pmc_hashsize);
 SYSCTL_INT(_kern_hwpmc, OID_AUTO, hashsize, CTLFLAG_TUN|CTLFLAG_RD,
     &pmc_hashsize, 0, "rows in hash tables");
 
 /*
  * kern.hwpmc.nsamples --- number of PC samples/callchain stacks per CPU
  */
 
 static int pmc_nsamples = PMC_NSAMPLES;
 TUNABLE_INT(PMC_SYSCTL_NAME_PREFIX "nsamples", &pmc_nsamples);
 SYSCTL_INT(_kern_hwpmc, OID_AUTO, nsamples, CTLFLAG_TUN|CTLFLAG_RD,
     &pmc_nsamples, 0, "number of PC samples per CPU");
 
 
 /*
  * kern.hwpmc.mtxpoolsize -- number of mutexes in the mutex pool.
  */
 
 static int pmc_mtxpool_size = PMC_MTXPOOL_SIZE;
 TUNABLE_INT(PMC_SYSCTL_NAME_PREFIX "mtxpoolsize", &pmc_mtxpool_size);
 SYSCTL_INT(_kern_hwpmc, OID_AUTO, mtxpoolsize, CTLFLAG_TUN|CTLFLAG_RD,
     &pmc_mtxpool_size, 0, "size of spin mutex pool");
 
 
 /*
  * security.bsd.unprivileged_syspmcs -- allow non-root processes to
  * allocate system-wide PMCs.
  *
  * Allowing unprivileged processes to allocate system PMCs is convenient
  * if system-wide measurements need to be taken concurrently with other
  * per-process measurements.  This feature is turned off by default.
  */
 
 static int pmc_unprivileged_syspmcs = 0;
 TUNABLE_INT("security.bsd.unprivileged_syspmcs", &pmc_unprivileged_syspmcs);
 SYSCTL_INT(_security_bsd, OID_AUTO, unprivileged_syspmcs, CTLFLAG_RW,
     &pmc_unprivileged_syspmcs, 0,
     "allow unprivileged process to allocate system PMCs");
 
 /*
  * Hash function.  Discard the lower 2 bits of the pointer since
  * these are always zero for our uses.  The hash multiplier is
  * round((2^LONG_BIT) * ((sqrt(5)-1)/2)).
  */
 
 #if	LONG_BIT == 64
 #define	_PMC_HM		11400714819323198486u
 #elif	LONG_BIT == 32
 #define	_PMC_HM		2654435769u
 #else
 #error 	Must know the size of 'long' to compile
 #endif
 
 #define	PMC_HASH_PTR(P,M)	((((unsigned long) (P) >> 2) * _PMC_HM) & (M))
 
 /*
  * Syscall structures
  */
 
 /* The `sysent' for the new syscall */
 static struct sysent pmc_sysent = {
 	2,			/* sy_narg */
 	pmc_syscall_handler	/* sy_call */
 };
 
 static struct syscall_module_data pmc_syscall_mod = {
 	load,
 	NULL,
 	&pmc_syscall_num,
 	&pmc_sysent,
 	{ 0, NULL }
 };
 
 static moduledata_t pmc_mod = {
 	PMC_MODULE_NAME,
 	syscall_module_handler,
 	&pmc_syscall_mod
 };
 
 DECLARE_MODULE(pmc, pmc_mod, SI_SUB_SMP, SI_ORDER_ANY);
 MODULE_VERSION(pmc, PMC_VERSION);
 
 #ifdef	DEBUG
 enum pmc_dbgparse_state {
 	PMCDS_WS,		/* in whitespace */
 	PMCDS_MAJOR,		/* seen a major keyword */
 	PMCDS_MINOR
 };
 
 static int
 pmc_debugflags_parse(char *newstr, char *fence)
 {
 	char c, *p, *q;
 	struct pmc_debugflags *tmpflags;
 	int error, found, *newbits, tmp;
 	size_t kwlen;
 
 	tmpflags = malloc(sizeof(*tmpflags), M_PMC, M_WAITOK|M_ZERO);
 
 	p = newstr;
 	error = 0;
 
 	for (; p < fence && (c = *p); p++) {
 
 		/* skip white space */
 		if (c == ' ' || c == '\t')
 			continue;
 
 		/* look for a keyword followed by "=" */
 		for (q = p; p < fence && (c = *p) && c != '='; p++)
 			;
 		if (c != '=') {
 			error = EINVAL;
 			goto done;
 		}
 
 		kwlen = p - q;
 		newbits = NULL;
 
 		/* lookup flag group name */
 #define	DBG_SET_FLAG_MAJ(S,F)						\
 		if (kwlen == sizeof(S)-1 && strncmp(q, S, kwlen) == 0)	\
 			newbits = &tmpflags->pdb_ ## F;
 
 		DBG_SET_FLAG_MAJ("cpu",		CPU);
 		DBG_SET_FLAG_MAJ("csw",		CSW);
 		DBG_SET_FLAG_MAJ("logging",	LOG);
 		DBG_SET_FLAG_MAJ("module",	MOD);
 		DBG_SET_FLAG_MAJ("md", 		MDP);
 		DBG_SET_FLAG_MAJ("owner",	OWN);
 		DBG_SET_FLAG_MAJ("pmc",		PMC);
 		DBG_SET_FLAG_MAJ("process",	PRC);
 		DBG_SET_FLAG_MAJ("sampling", 	SAM);
 
 		if (newbits == NULL) {
 			error = EINVAL;
 			goto done;
 		}
 
 		p++;		/* skip the '=' */
 
 		/* Now parse the individual flags */
 		tmp = 0;
 	newflag:
 		for (q = p; p < fence && (c = *p); p++)
 			if (c == ' ' || c == '\t' || c == ',')
 				break;
 
 		/* p == fence or c == ws or c == "," or c == 0 */
 
 		if ((kwlen = p - q) == 0) {
 			*newbits = tmp;
 			continue;
 		}
 
 		found = 0;
 #define	DBG_SET_FLAG_MIN(S,F)						\
 		if (kwlen == sizeof(S)-1 && strncmp(q, S, kwlen) == 0)	\
 			tmp |= found = (1 << PMC_DEBUG_MIN_ ## F)
 
 		/* a '*' denotes all possible flags in the group */
 		if (kwlen == 1 && *q == '*')
 			tmp = found = ~0;
 		/* look for individual flag names */
 		DBG_SET_FLAG_MIN("allocaterow", ALR);
 		DBG_SET_FLAG_MIN("allocate",	ALL);
 		DBG_SET_FLAG_MIN("attach",	ATT);
 		DBG_SET_FLAG_MIN("bind",	BND);
 		DBG_SET_FLAG_MIN("config",	CFG);
 		DBG_SET_FLAG_MIN("exec",	EXC);
 		DBG_SET_FLAG_MIN("exit",	EXT);
 		DBG_SET_FLAG_MIN("find",	FND);
 		DBG_SET_FLAG_MIN("flush",	FLS);
 		DBG_SET_FLAG_MIN("fork",	FRK);
 		DBG_SET_FLAG_MIN("getbuf",	GTB);
 		DBG_SET_FLAG_MIN("hook",	PMH);
 		DBG_SET_FLAG_MIN("init",	INI);
 		DBG_SET_FLAG_MIN("intr",	INT);
 		DBG_SET_FLAG_MIN("linktarget",	TLK);
 		DBG_SET_FLAG_MIN("mayberemove", OMR);
 		DBG_SET_FLAG_MIN("ops",		OPS);
 		DBG_SET_FLAG_MIN("read",	REA);
 		DBG_SET_FLAG_MIN("register",	REG);
 		DBG_SET_FLAG_MIN("release",	REL);
 		DBG_SET_FLAG_MIN("remove",	ORM);
 		DBG_SET_FLAG_MIN("sample",	SAM);
 		DBG_SET_FLAG_MIN("scheduleio",	SIO);
 		DBG_SET_FLAG_MIN("select",	SEL);
 		DBG_SET_FLAG_MIN("signal",	SIG);
 		DBG_SET_FLAG_MIN("swi",		SWI);
 		DBG_SET_FLAG_MIN("swo",		SWO);
 		DBG_SET_FLAG_MIN("start",	STA);
 		DBG_SET_FLAG_MIN("stop",	STO);
 		DBG_SET_FLAG_MIN("syscall",	PMS);
 		DBG_SET_FLAG_MIN("unlinktarget", TUL);
 		DBG_SET_FLAG_MIN("write",	WRI);
 		if (found == 0) {
 			/* unrecognized flag name */
 			error = EINVAL;
 			goto done;
 		}
 
 		if (c == 0 || c == ' ' || c == '\t') {	/* end of flag group */
 			*newbits = tmp;
 			continue;
 		}
 
 		p++;
 		goto newflag;
 	}
 
 	/* save the new flag set */
 	bcopy(tmpflags, &pmc_debugflags, sizeof(pmc_debugflags));
 
  done:
 	free(tmpflags, M_PMC);
 	return error;
 }
 
 static int
 pmc_debugflags_sysctl_handler(SYSCTL_HANDLER_ARGS)
 {
 	char *fence, *newstr;
 	int error;
 	unsigned int n;
 
 	(void) arg1; (void) arg2; /* unused parameters */
 
 	n = sizeof(pmc_debugstr);
 	newstr = malloc(n, M_PMC, M_WAITOK|M_ZERO);
 	(void) strlcpy(newstr, pmc_debugstr, n);
 
 	error = sysctl_handle_string(oidp, newstr, n, req);
 
 	/* if there is a new string, parse and copy it */
 	if (error == 0 && req->newptr != NULL) {
 		fence = newstr + (n < req->newlen ? n : req->newlen + 1);
 		if ((error = pmc_debugflags_parse(newstr, fence)) == 0)
 			(void) strlcpy(pmc_debugstr, newstr,
 			    sizeof(pmc_debugstr));
 	}
 
 	free(newstr, M_PMC);
 
 	return error;
 }
 #endif
 
 /*
  * Map a row index to a classdep structure and return the adjusted row
  * index for the PMC class index.
  */
 static struct pmc_classdep *
 pmc_ri_to_classdep(struct pmc_mdep *md, int ri, int *adjri)
 {
 	struct pmc_classdep *pcd;
 
 	(void) md;
 
 	KASSERT(ri >= 0 && ri < md->pmd_npmc,
 	    ("[pmc,%d] illegal row-index %d", __LINE__, ri));
 
 	pcd = pmc_rowindex_to_classdep[ri];
 
 	KASSERT(pcd != NULL,
 	    ("[pmc,%d] ri %d null pcd", __LINE__, ri));
 
 	*adjri = ri - pcd->pcd_ri;
 
 	KASSERT(*adjri >= 0 && *adjri < pcd->pcd_num,
 	    ("[pmc,%d] adjusted row-index %d", __LINE__, *adjri));
 
 	return (pcd);
 }
 
 /*
  * Concurrency Control
  *
  * The driver manages the following data structures:
  *
  *   - target process descriptors, one per target process
  *   - owner process descriptors (and attached lists), one per owner process
  *   - lookup hash tables for owner and target processes
  *   - PMC descriptors (and attached lists)
  *   - per-cpu hardware state
  *   - the 'hook' variable through which the kernel calls into
  *     this module
  *   - the machine hardware state (managed by the MD layer)
  *
  * These data structures are accessed from:
  *
  * - thread context-switch code
  * - interrupt handlers (possibly on multiple cpus)
  * - kernel threads on multiple cpus running on behalf of user
  *   processes doing system calls
  * - this driver's private kernel threads
  *
  * = Locks and Locking strategy =
  *
  * The driver uses four locking strategies for its operation:
  *
  * - The global SX lock "pmc_sx" is used to protect internal
  *   data structures.
  *
  *   Calls into the module by syscall() start with this lock being
  *   held in exclusive mode.  Depending on the requested operation,
  *   the lock may be downgraded to 'shared' mode to allow more
  *   concurrent readers into the module.  Calls into the module from
  *   other parts of the kernel acquire the lock in shared mode.
  *
  *   This SX lock is held in exclusive mode for any operations that
  *   modify the linkages between the driver's internal data structures.
  *
  *   The 'pmc_hook' function pointer is also protected by this lock.
  *   It is only examined with the sx lock held in exclusive mode.  The
  *   kernel module is allowed to be unloaded only with the sx lock held
  *   in exclusive mode.  In normal syscall handling, after acquiring the
  *   pmc_sx lock we first check that 'pmc_hook' is non-null before
  *   proceeding.  This prevents races between the thread unloading the module
  *   and other threads seeking to use the module.
  *
  * - Lookups of target process structures and owner process structures
  *   cannot use the global "pmc_sx" SX lock because these lookups need
  *   to happen during context switches and in other critical sections
  *   where sleeping is not allowed.  We protect these lookup tables
  *   with their own private spin-mutexes, "pmc_processhash_mtx" and
  *   "pmc_ownerhash_mtx".
  *
  * - Interrupt handlers work in a lock free manner.  At interrupt
  *   time, handlers look at the PMC pointer (phw->phw_pmc) configured
  *   when the PMC was started.  If this pointer is NULL, the interrupt
  *   is ignored after updating driver statistics.  We ensure that this
  *   pointer is set (using an atomic operation if necessary) before the
  *   PMC hardware is started.  Conversely, this pointer is unset atomically
  *   only after the PMC hardware is stopped.
  *
  *   We ensure that everything needed for the operation of an
  *   interrupt handler is available without it needing to acquire any
  *   locks.  We also ensure that a PMC's software state is destroyed only
  *   after the PMC is taken off hardware (on all CPUs).
  *
  * - Context-switch handling with process-private PMCs needs more
  *   care.
  *
  *   A given process may be the target of multiple PMCs.  For example,
  *   PMCATTACH and PMCDETACH may be requested by a process on one CPU
  *   while the target process is running on another.  A PMC could also
  *   be getting released because its owner is exiting.  We tackle
  *   these situations in the following manner:
  *
  *   - each target process structure 'pmc_process' has an array
  *     of 'struct pmc *' pointers, one for each hardware PMC.
  *
  *   - At context switch IN time, each "target" PMC in RUNNING state
  *     gets started on hardware and a pointer to each PMC is copied into
  *     the per-cpu phw array.  The 'runcount' for the PMC is
  *     incremented.
  *
  *   - At context switch OUT time, all process-virtual PMCs are stopped
  *     on hardware.  The saved value is added to the PMCs value field
  *     only if the PMC is in a non-deleted state (the PMCs state could
  *     have changed during the current time slice).
  *
  *     Note that since in-between a switch IN on a processor and a switch
  *     OUT, the PMC could have been released on another CPU.  Therefore
  *     context switch OUT always looks at the hardware state to turn
  *     OFF PMCs and will update a PMC's saved value only if reachable
  *     from the target process record.
  *
  *   - OP PMCRELEASE could be called on a PMC at any time (the PMC could
  *     be attached to many processes at the time of the call and could
  *     be active on multiple CPUs).
  *
  *     We prevent further scheduling of the PMC by marking it as in
  *     state 'DELETED'.  If the runcount of the PMC is non-zero then
  *     this PMC is currently running on a CPU somewhere.  The thread
  *     doing the PMCRELEASE operation waits by repeatedly doing a
  *     pause() till the runcount comes to zero.
  *
  * The contents of a PMC descriptor (struct pmc) are protected using
  * a spin-mutex.  In order to save space, we use a mutex pool.
  *
  * In terms of lock types used by witness(4), we use:
  * - Type "pmc-sx", used by the global SX lock.
  * - Type "pmc-sleep", for sleep mutexes used by logger threads.
  * - Type "pmc-per-proc", for protecting PMC owner descriptors.
  * - Type "pmc-leaf", used for all other spin mutexes.
  */
 
 /*
  * save the cpu binding of the current kthread
  */
 
 static void
 pmc_save_cpu_binding(struct pmc_binding *pb)
 {
 	PMCDBG(CPU,BND,2, "%s", "save-cpu");
 	thread_lock(curthread);
 	pb->pb_bound = sched_is_bound(curthread);
 	pb->pb_cpu   = curthread->td_oncpu;
 	thread_unlock(curthread);
 	PMCDBG(CPU,BND,2, "save-cpu cpu=%d", pb->pb_cpu);
 }
 
 /*
  * restore the cpu binding of the current thread
  */
 
 static void
 pmc_restore_cpu_binding(struct pmc_binding *pb)
 {
 	PMCDBG(CPU,BND,2, "restore-cpu curcpu=%d restore=%d",
 	    curthread->td_oncpu, pb->pb_cpu);
 	thread_lock(curthread);
 	if (pb->pb_bound)
 		sched_bind(curthread, pb->pb_cpu);
 	else
 		sched_unbind(curthread);
 	thread_unlock(curthread);
 	PMCDBG(CPU,BND,2, "%s", "restore-cpu done");
 }
 
 /*
  * move execution over the specified cpu and bind it there.
  */
 
 static void
 pmc_select_cpu(int cpu)
 {
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[pmc,%d] bad cpu number %d", __LINE__, cpu));
 
 	/* Never move to an inactive CPU. */
 	KASSERT(pmc_cpu_is_active(cpu), ("[pmc,%d] selecting inactive "
 	    "CPU %d", __LINE__, cpu));
 
 	PMCDBG(CPU,SEL,2, "select-cpu cpu=%d", cpu);
 	thread_lock(curthread);
 	sched_bind(curthread, cpu);
 	thread_unlock(curthread);
 
 	KASSERT(curthread->td_oncpu == cpu,
 	    ("[pmc,%d] CPU not bound [cpu=%d, curr=%d]", __LINE__,
 		cpu, curthread->td_oncpu));
 
 	PMCDBG(CPU,SEL,2, "select-cpu cpu=%d ok", cpu);
 }
 
 /*
  * Force a context switch.
  *
  * We do this by pause'ing for 1 tick -- invoking mi_switch() is not
  * guaranteed to force a context switch.
  */
 
 static void
 pmc_force_context_switch(void)
 {
 
 	pause("pmcctx", 1);
 }
 
 /*
  * Get the file name for an executable.  This is a simple wrapper
  * around vn_fullpath(9).
  */
 
 static void
 pmc_getfilename(struct vnode *v, char **fullpath, char **freepath)
 {
 
 	*fullpath = "unknown";
 	*freepath = NULL;
 	vn_fullpath(curthread, v, fullpath, freepath);
 }
 
 /*
  * remove an process owning PMCs
  */
 
 void
 pmc_remove_owner(struct pmc_owner *po)
 {
 	struct pmc *pm, *tmp;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	PMCDBG(OWN,ORM,1, "remove-owner po=%p", po);
 
 	/* Remove descriptor from the owner hash table */
 	LIST_REMOVE(po, po_next);
 
 	/* release all owned PMC descriptors */
 	LIST_FOREACH_SAFE(pm, &po->po_pmcs, pm_next, tmp) {
 		PMCDBG(OWN,ORM,2, "pmc=%p", pm);
 		KASSERT(pm->pm_owner == po,
 		    ("[pmc,%d] owner %p != po %p", __LINE__, pm->pm_owner, po));
 
 		pmc_release_pmc_descriptor(pm);	/* will unlink from the list */
 	}
 
 	KASSERT(po->po_sscount == 0,
 	    ("[pmc,%d] SS count not zero", __LINE__));
 	KASSERT(LIST_EMPTY(&po->po_pmcs),
 	    ("[pmc,%d] PMC list not empty", __LINE__));
 
 	/* de-configure the log file if present */
 	if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 		pmclog_deconfigure_log(po);
 }
 
 /*
  * remove an owner process record if all conditions are met.
  */
 
 static void
 pmc_maybe_remove_owner(struct pmc_owner *po)
 {
 
 	PMCDBG(OWN,OMR,1, "maybe-remove-owner po=%p", po);
 
 	/*
 	 * Remove owner record if
 	 * - this process does not own any PMCs
 	 * - this process has not allocated a system-wide sampling buffer
 	 */
 
 	if (LIST_EMPTY(&po->po_pmcs) &&
 	    ((po->po_flags & PMC_PO_OWNS_LOGFILE) == 0)) {
 		pmc_remove_owner(po);
 		pmc_destroy_owner_descriptor(po);
 	}
 }
 
 /*
  * Add an association between a target process and a PMC.
  */
 
 static void
 pmc_link_target_process(struct pmc *pm, struct pmc_process *pp)
 {
 	int ri;
 	struct pmc_target *pt;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	KASSERT(pm != NULL && pp != NULL,
 	    ("[pmc,%d] Null pm %p or pp %p", __LINE__, pm, pp));
 	KASSERT(PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm)),
 	    ("[pmc,%d] Attaching a non-process-virtual pmc=%p to pid=%d",
 		__LINE__, pm, pp->pp_proc->p_pid));
 	KASSERT(pp->pp_refcnt >= 0 && pp->pp_refcnt <= ((int) md->pmd_npmc - 1),
 	    ("[pmc,%d] Illegal reference count %d for process record %p",
 		__LINE__, pp->pp_refcnt, (void *) pp));
 
 	ri = PMC_TO_ROWINDEX(pm);
 
 	PMCDBG(PRC,TLK,1, "link-target pmc=%p ri=%d pmc-process=%p",
 	    pm, ri, pp);
 
 #ifdef	DEBUG
 	LIST_FOREACH(pt, &pm->pm_targets, pt_next)
 	    if (pt->pt_process == pp)
 		    KASSERT(0, ("[pmc,%d] pp %p already in pmc %p targets",
 				__LINE__, pp, pm));
 #endif
 
 	pt = malloc(sizeof(struct pmc_target), M_PMC, M_WAITOK|M_ZERO);
 	pt->pt_process = pp;
 
 	LIST_INSERT_HEAD(&pm->pm_targets, pt, pt_next);
 
 	atomic_store_rel_ptr((uintptr_t *)&pp->pp_pmcs[ri].pp_pmc,
 	    (uintptr_t)pm);
 
 	if (pm->pm_owner->po_owner == pp->pp_proc)
 		pm->pm_flags |= PMC_F_ATTACHED_TO_OWNER;
 
 	/*
 	 * Initialize the per-process values at this row index.
 	 */
 	pp->pp_pmcs[ri].pp_pmcval = PMC_TO_MODE(pm) == PMC_MODE_TS ?
 	    pm->pm_sc.pm_reloadcount : 0;
 
 	pp->pp_refcnt++;
 
 }
 
 /*
  * Removes the association between a target process and a PMC.
  */
 
 static void
 pmc_unlink_target_process(struct pmc *pm, struct pmc_process *pp)
 {
 	int ri;
 	struct proc *p;
 	struct pmc_target *ptgt;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	KASSERT(pm != NULL && pp != NULL,
 	    ("[pmc,%d] Null pm %p or pp %p", __LINE__, pm, pp));
 
 	KASSERT(pp->pp_refcnt >= 1 && pp->pp_refcnt <= (int) md->pmd_npmc,
 	    ("[pmc,%d] Illegal ref count %d on process record %p",
 		__LINE__, pp->pp_refcnt, (void *) pp));
 
 	ri = PMC_TO_ROWINDEX(pm);
 
 	PMCDBG(PRC,TUL,1, "unlink-target pmc=%p ri=%d pmc-process=%p",
 	    pm, ri, pp);
 
 	KASSERT(pp->pp_pmcs[ri].pp_pmc == pm,
 	    ("[pmc,%d] PMC ri %d mismatch pmc %p pp->[ri] %p", __LINE__,
 		ri, pm, pp->pp_pmcs[ri].pp_pmc));
 
 	pp->pp_pmcs[ri].pp_pmc = NULL;
 	pp->pp_pmcs[ri].pp_pmcval = (pmc_value_t) 0;
 
 	/* Remove owner-specific flags */
 	if (pm->pm_owner->po_owner == pp->pp_proc) {
 		pp->pp_flags &= ~PMC_PP_ENABLE_MSR_ACCESS;
 		pm->pm_flags &= ~PMC_F_ATTACHED_TO_OWNER;
 	}
 
 	pp->pp_refcnt--;
 
 	/* Remove the target process from the PMC structure */
 	LIST_FOREACH(ptgt, &pm->pm_targets, pt_next)
 		if (ptgt->pt_process == pp)
 			break;
 
 	KASSERT(ptgt != NULL, ("[pmc,%d] process %p (pp: %p) not found "
 		    "in pmc %p", __LINE__, pp->pp_proc, pp, pm));
 
 	LIST_REMOVE(ptgt, pt_next);
 	free(ptgt, M_PMC);
 
 	/* if the PMC now lacks targets, send the owner a SIGIO */
 	if (LIST_EMPTY(&pm->pm_targets)) {
 		p = pm->pm_owner->po_owner;
 		PROC_LOCK(p);
 		psignal(p, SIGIO);
 		PROC_UNLOCK(p);
 
 		PMCDBG(PRC,SIG,2, "signalling proc=%p signal=%d", p,
 		    SIGIO);
 	}
 }
 
 /*
  * Check if PMC 'pm' may be attached to target process 't'.
  */
 
 static int
 pmc_can_attach(struct pmc *pm, struct proc *t)
 {
 	struct proc *o;		/* pmc owner */
 	struct ucred *oc, *tc;	/* owner, target credentials */
 	int decline_attach, i;
 
 	/*
 	 * A PMC's owner can always attach that PMC to itself.
 	 */
 
 	if ((o = pm->pm_owner->po_owner) == t)
 		return 0;
 
 	PROC_LOCK(o);
 	oc = o->p_ucred;
 	crhold(oc);
 	PROC_UNLOCK(o);
 
 	PROC_LOCK(t);
 	tc = t->p_ucred;
 	crhold(tc);
 	PROC_UNLOCK(t);
 
 	/*
 	 * The effective uid of the PMC owner should match at least one
 	 * of the {effective,real,saved} uids of the target process.
 	 */
 
 	decline_attach = oc->cr_uid != tc->cr_uid &&
 	    oc->cr_uid != tc->cr_svuid &&
 	    oc->cr_uid != tc->cr_ruid;
 
 	/*
 	 * Every one of the target's group ids, must be in the owner's
 	 * group list.
 	 */
 	for (i = 0; !decline_attach && i < tc->cr_ngroups; i++)
 		decline_attach = !groupmember(tc->cr_groups[i], oc);
 
 	/* check the read and saved gids too */
 	if (decline_attach == 0)
 		decline_attach = !groupmember(tc->cr_rgid, oc) ||
 		    !groupmember(tc->cr_svgid, oc);
 
 	crfree(tc);
 	crfree(oc);
 
 	return !decline_attach;
 }
 
 /*
  * Attach a process to a PMC.
  */
 
 static int
 pmc_attach_one_process(struct proc *p, struct pmc *pm)
 {
 	int ri;
 	char *fullpath, *freepath;
 	struct pmc_process	*pp;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	PMCDBG(PRC,ATT,2, "attach-one pm=%p ri=%d proc=%p (%d, %s)", pm,
 	    PMC_TO_ROWINDEX(pm), p, p->p_pid, p->p_comm);
 
 	/*
 	 * Locate the process descriptor corresponding to process 'p',
 	 * allocating space as needed.
 	 *
 	 * Verify that rowindex 'pm_rowindex' is free in the process
 	 * descriptor.
 	 *
 	 * If not, allocate space for a descriptor and link the
 	 * process descriptor and PMC.
 	 */
 	ri = PMC_TO_ROWINDEX(pm);
 
 	if ((pp = pmc_find_process_descriptor(p, PMC_FLAG_ALLOCATE)) == NULL)
 		return ENOMEM;
 
 	if (pp->pp_pmcs[ri].pp_pmc == pm) /* already present at slot [ri] */
 		return EEXIST;
 
 	if (pp->pp_pmcs[ri].pp_pmc != NULL)
 		return EBUSY;
 
 	pmc_link_target_process(pm, pp);
 
 	if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)) &&
 	    (pm->pm_flags & PMC_F_ATTACHED_TO_OWNER) == 0)
 		pm->pm_flags |= PMC_F_NEEDS_LOGFILE;
 
 	pm->pm_flags |= PMC_F_ATTACH_DONE; /* mark as attached */
 
 	/* issue an attach event to a configured log file */
 	if (pm->pm_owner->po_flags & PMC_PO_OWNS_LOGFILE) {
 		pmc_getfilename(p->p_textvp, &fullpath, &freepath);
 		if (p->p_flag & P_KTHREAD) {
 			fullpath = kernelname;
 			freepath = NULL;
 		} else
 			pmclog_process_pmcattach(pm, p->p_pid, fullpath);
 		if (freepath)
 			free(freepath, M_TEMP);
 		if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 			pmc_log_process_mappings(pm->pm_owner, p);
 	}
 	/* mark process as using HWPMCs */
 	PROC_LOCK(p);
 	p->p_flag |= P_HWPMC;
 	PROC_UNLOCK(p);
 
 	return 0;
 }
 
 /*
  * Attach a process and optionally its children
  */
 
 static int
 pmc_attach_process(struct proc *p, struct pmc *pm)
 {
 	int error;
 	struct proc *top;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	PMCDBG(PRC,ATT,1, "attach pm=%p ri=%d proc=%p (%d, %s)", pm,
 	    PMC_TO_ROWINDEX(pm), p, p->p_pid, p->p_comm);
 
 
 	/*
 	 * If this PMC successfully allowed a GETMSR operation
 	 * in the past, disallow further ATTACHes.
 	 */
 
 	if ((pm->pm_flags & PMC_PP_ENABLE_MSR_ACCESS) != 0)
 		return EPERM;
 
 	if ((pm->pm_flags & PMC_F_DESCENDANTS) == 0)
 		return pmc_attach_one_process(p, pm);
 
 	/*
 	 * Traverse all child processes, attaching them to
 	 * this PMC.
 	 */
 
 	sx_slock(&proctree_lock);
 
 	top = p;
 
 	for (;;) {
 		if ((error = pmc_attach_one_process(p, pm)) != 0)
 			break;
 		if (!LIST_EMPTY(&p->p_children))
 			p = LIST_FIRST(&p->p_children);
 		else for (;;) {
 			if (p == top)
 				goto done;
 			if (LIST_NEXT(p, p_sibling)) {
 				p = LIST_NEXT(p, p_sibling);
 				break;
 			}
 			p = p->p_pptr;
 		}
 	}
 
 	if (error)
 		(void) pmc_detach_process(top, pm);
 
  done:
 	sx_sunlock(&proctree_lock);
 	return error;
 }
 
 /*
  * Detach a process from a PMC.  If there are no other PMCs tracking
  * this process, remove the process structure from its hash table.  If
  * 'flags' contains PMC_FLAG_REMOVE, then free the process structure.
  */
 
 static int
 pmc_detach_one_process(struct proc *p, struct pmc *pm, int flags)
 {
 	int ri;
 	struct pmc_process *pp;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	KASSERT(pm != NULL,
 	    ("[pmc,%d] null pm pointer", __LINE__));
 
 	ri = PMC_TO_ROWINDEX(pm);
 
 	PMCDBG(PRC,ATT,2, "detach-one pm=%p ri=%d proc=%p (%d, %s) flags=0x%x",
 	    pm, ri, p, p->p_pid, p->p_comm, flags);
 
 	if ((pp = pmc_find_process_descriptor(p, 0)) == NULL)
 		return ESRCH;
 
 	if (pp->pp_pmcs[ri].pp_pmc != pm)
 		return EINVAL;
 
 	pmc_unlink_target_process(pm, pp);
 
 	/* Issue a detach entry if a log file is configured */
 	if (pm->pm_owner->po_flags & PMC_PO_OWNS_LOGFILE)
 		pmclog_process_pmcdetach(pm, p->p_pid);
 
 	/*
 	 * If there are no PMCs targetting this process, we remove its
 	 * descriptor from the target hash table and unset the P_HWPMC
 	 * flag in the struct proc.
 	 */
 	KASSERT(pp->pp_refcnt >= 0 && pp->pp_refcnt <= (int) md->pmd_npmc,
 	    ("[pmc,%d] Illegal refcnt %d for process struct %p",
 		__LINE__, pp->pp_refcnt, pp));
 
 	if (pp->pp_refcnt != 0)	/* still a target of some PMC */
 		return 0;
 
 	pmc_remove_process_descriptor(pp);
 
 	if (flags & PMC_FLAG_REMOVE)
 		free(pp, M_PMC);
 
 	PROC_LOCK(p);
 	p->p_flag &= ~P_HWPMC;
 	PROC_UNLOCK(p);
 
 	return 0;
 }
 
 /*
  * Detach a process and optionally its descendants from a PMC.
  */
 
 static int
 pmc_detach_process(struct proc *p, struct pmc *pm)
 {
 	struct proc *top;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	PMCDBG(PRC,ATT,1, "detach pm=%p ri=%d proc=%p (%d, %s)", pm,
 	    PMC_TO_ROWINDEX(pm), p, p->p_pid, p->p_comm);
 
 	if ((pm->pm_flags & PMC_F_DESCENDANTS) == 0)
 		return pmc_detach_one_process(p, pm, PMC_FLAG_REMOVE);
 
 	/*
 	 * Traverse all children, detaching them from this PMC.  We
 	 * ignore errors since we could be detaching a PMC from a
 	 * partially attached proc tree.
 	 */
 
 	sx_slock(&proctree_lock);
 
 	top = p;
 
 	for (;;) {
 		(void) pmc_detach_one_process(p, pm, PMC_FLAG_REMOVE);
 
 		if (!LIST_EMPTY(&p->p_children))
 			p = LIST_FIRST(&p->p_children);
 		else for (;;) {
 			if (p == top)
 				goto done;
 			if (LIST_NEXT(p, p_sibling)) {
 				p = LIST_NEXT(p, p_sibling);
 				break;
 			}
 			p = p->p_pptr;
 		}
 	}
 
  done:
 	sx_sunlock(&proctree_lock);
 
 	if (LIST_EMPTY(&pm->pm_targets))
 		pm->pm_flags &= ~PMC_F_ATTACH_DONE;
 
 	return 0;
 }
 
 
 /*
  * Thread context switch IN
  */
 
 static void
 pmc_process_csw_in(struct thread *td)
 {
 	int cpu;
 	unsigned int adjri, ri;
 	struct pmc *pm;
 	struct proc *p;
 	struct pmc_cpu *pc;
 	struct pmc_hw *phw;
 	pmc_value_t newvalue;
 	struct pmc_process *pp;
 	struct pmc_classdep *pcd;
 
 	p = td->td_proc;
 
 	if ((pp = pmc_find_process_descriptor(p, PMC_FLAG_NONE)) == NULL)
 		return;
 
 	KASSERT(pp->pp_proc == td->td_proc,
 	    ("[pmc,%d] not my thread state", __LINE__));
 
 	critical_enter(); /* no preemption from this point */
 
 	cpu = PCPU_GET(cpuid); /* td->td_oncpu is invalid */
 
 	PMCDBG(CSW,SWI,1, "cpu=%d proc=%p (%d, %s) pp=%p", cpu, p,
 	    p->p_pid, p->p_comm, pp);
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[pmc,%d] wierd CPU id %d", __LINE__, cpu));
 
 	pc = pmc_pcpu[cpu];
 
 	for (ri = 0; ri < md->pmd_npmc; ri++) {
 
 		if ((pm = pp->pp_pmcs[ri].pp_pmc) == NULL)
 			continue;
 
 		KASSERT(PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm)),
 		    ("[pmc,%d] Target PMC in non-virtual mode (%d)",
 			__LINE__, PMC_TO_MODE(pm)));
 
 		KASSERT(PMC_TO_ROWINDEX(pm) == ri,
 		    ("[pmc,%d] Row index mismatch pmc %d != ri %d",
 			__LINE__, PMC_TO_ROWINDEX(pm), ri));
 
 		/*
 		 * Only PMCs that are marked as 'RUNNING' need
 		 * be placed on hardware.
 		 */
 
 		if (pm->pm_state != PMC_STATE_RUNNING)
 			continue;
 
 		/* increment PMC runcount */
 		atomic_add_rel_int(&pm->pm_runcount, 1);
 
 		/* configure the HWPMC we are going to use. */
 		pcd = pmc_ri_to_classdep(md, ri, &adjri);
 		pcd->pcd_config_pmc(cpu, adjri, pm);
 
 		phw = pc->pc_hwpmcs[ri];
 
 		KASSERT(phw != NULL,
 		    ("[pmc,%d] null hw pointer", __LINE__));
 
 		KASSERT(phw->phw_pmc == pm,
 		    ("[pmc,%d] hw->pmc %p != pmc %p", __LINE__,
 			phw->phw_pmc, pm));
 
 		/*
 		 * Write out saved value and start the PMC.
 		 *
 		 * Sampling PMCs use a per-process value, while
 		 * counting mode PMCs use a per-pmc value that is
 		 * inherited across descendants.
 		 */
 		if (PMC_TO_MODE(pm) == PMC_MODE_TS) {
 			mtx_pool_lock_spin(pmc_mtxpool, pm);
 			newvalue = PMC_PCPU_SAVED(cpu,ri) =
 			    pp->pp_pmcs[ri].pp_pmcval;
 			mtx_pool_unlock_spin(pmc_mtxpool, pm);
 		} else {
 			KASSERT(PMC_TO_MODE(pm) == PMC_MODE_TC,
 			    ("[pmc,%d] illegal mode=%d", __LINE__,
 			    PMC_TO_MODE(pm)));
 			mtx_pool_lock_spin(pmc_mtxpool, pm);
 			newvalue = PMC_PCPU_SAVED(cpu, ri) =
 			    pm->pm_gv.pm_savedvalue;
 			mtx_pool_unlock_spin(pmc_mtxpool, pm);
 		}
 
 		PMCDBG(CSW,SWI,1,"cpu=%d ri=%d new=%jd", cpu, ri, newvalue);
 
 		pcd->pcd_write_pmc(cpu, adjri, newvalue);
 		pcd->pcd_start_pmc(cpu, adjri);
 	}
 
 	/*
 	 * perform any other architecture/cpu dependent thread
 	 * switch-in actions.
 	 */
 
 	(void) (*md->pmd_switch_in)(pc, pp);
 
 	critical_exit();
 
 }
 
 /*
  * Thread context switch OUT.
  */
 
 static void
 pmc_process_csw_out(struct thread *td)
 {
 	int cpu;
 	int64_t tmp;
 	struct pmc *pm;
 	struct proc *p;
 	enum pmc_mode mode;
 	struct pmc_cpu *pc;
 	pmc_value_t newvalue;
 	unsigned int adjri, ri;
 	struct pmc_process *pp;
 	struct pmc_classdep *pcd;
 
 
 	/*
 	 * Locate our process descriptor; this may be NULL if
 	 * this process is exiting and we have already removed
 	 * the process from the target process table.
 	 *
 	 * Note that due to kernel preemption, multiple
 	 * context switches may happen while the process is
 	 * exiting.
 	 *
 	 * Note also that if the target process cannot be
 	 * found we still need to deconfigure any PMCs that
 	 * are currently running on hardware.
 	 */
 
 	p = td->td_proc;
 	pp = pmc_find_process_descriptor(p, PMC_FLAG_NONE);
 
 	/*
 	 * save PMCs
 	 */
 
 	critical_enter();
 
 	cpu = PCPU_GET(cpuid); /* td->td_oncpu is invalid */
 
 	PMCDBG(CSW,SWO,1, "cpu=%d proc=%p (%d, %s) pp=%p", cpu, p,
 	    p->p_pid, p->p_comm, pp);
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[pmc,%d wierd CPU id %d", __LINE__, cpu));
 
 	pc = pmc_pcpu[cpu];
 
 	/*
 	 * When a PMC gets unlinked from a target PMC, it will
 	 * be removed from the target's pp_pmc[] array.
 	 *
 	 * However, on a MP system, the target could have been
 	 * executing on another CPU at the time of the unlink.
 	 * So, at context switch OUT time, we need to look at
 	 * the hardware to determine if a PMC is scheduled on
 	 * it.
 	 */
 
 	for (ri = 0; ri < md->pmd_npmc; ri++) {
 
 		pcd = pmc_ri_to_classdep(md, ri, &adjri);
 		pm  = NULL;
 		(void) (*pcd->pcd_get_config)(cpu, adjri, &pm);
 
 		if (pm == NULL)	/* nothing at this row index */
 			continue;
 
 		mode = PMC_TO_MODE(pm);
 		if (!PMC_IS_VIRTUAL_MODE(mode))
 			continue; /* not a process virtual PMC */
 
 		KASSERT(PMC_TO_ROWINDEX(pm) == ri,
 		    ("[pmc,%d] ri mismatch pmc(%d) ri(%d)",
 			__LINE__, PMC_TO_ROWINDEX(pm), ri));
 
 		/* Stop hardware if not already stopped */
 		if (pm->pm_stalled == 0)
 			pcd->pcd_stop_pmc(cpu, adjri);
 
 		/* reduce this PMC's runcount */
 		atomic_subtract_rel_int(&pm->pm_runcount, 1);
 
 		/*
 		 * If this PMC is associated with this process,
 		 * save the reading.
 		 */
 
 		if (pp != NULL && pp->pp_pmcs[ri].pp_pmc != NULL) {
 
 			KASSERT(pm == pp->pp_pmcs[ri].pp_pmc,
 			    ("[pmc,%d] pm %p != pp_pmcs[%d] %p", __LINE__,
 				pm, ri, pp->pp_pmcs[ri].pp_pmc));
 
 			KASSERT(pp->pp_refcnt > 0,
 			    ("[pmc,%d] pp refcnt = %d", __LINE__,
 				pp->pp_refcnt));
 
 			pcd->pcd_read_pmc(cpu, adjri, &newvalue);
 
 			tmp = newvalue - PMC_PCPU_SAVED(cpu,ri);
 
 			PMCDBG(CSW,SWO,1,"cpu=%d ri=%d tmp=%jd", cpu, ri,
 			    tmp);
 
 			if (mode == PMC_MODE_TS) {
 
 				/*
 				 * For sampling process-virtual PMCs,
 				 * we expect the count to be
 				 * decreasing as the 'value'
 				 * programmed into the PMC is the
 				 * number of events to be seen till
 				 * the next sampling interrupt.
 				 */
 				if (tmp < 0)
 					tmp += pm->pm_sc.pm_reloadcount;
 				mtx_pool_lock_spin(pmc_mtxpool, pm);
 				pp->pp_pmcs[ri].pp_pmcval -= tmp;
 				if ((int64_t) pp->pp_pmcs[ri].pp_pmcval < 0)
 					pp->pp_pmcs[ri].pp_pmcval +=
 					    pm->pm_sc.pm_reloadcount;
 				mtx_pool_unlock_spin(pmc_mtxpool, pm);
 
 			} else {
 
 				/*
 				 * For counting process-virtual PMCs,
 				 * we expect the count to be
 				 * increasing monotonically, modulo a 64
 				 * bit wraparound.
 				 */
 				KASSERT((int64_t) tmp >= 0,
 				    ("[pmc,%d] negative increment cpu=%d "
 				     "ri=%d newvalue=%jx saved=%jx "
 				     "incr=%jx", __LINE__, cpu, ri,
 				     newvalue, PMC_PCPU_SAVED(cpu,ri), tmp));
 
 				mtx_pool_lock_spin(pmc_mtxpool, pm);
 				pm->pm_gv.pm_savedvalue += tmp;
 				pp->pp_pmcs[ri].pp_pmcval += tmp;
 				mtx_pool_unlock_spin(pmc_mtxpool, pm);
 
 				if (pm->pm_flags & PMC_F_LOG_PROCCSW)
 					pmclog_process_proccsw(pm, pp, tmp);
 			}
 		}
 
 		/* mark hardware as free */
 		pcd->pcd_config_pmc(cpu, adjri, NULL);
 	}
 
 	/*
 	 * perform any other architecture/cpu dependent thread
 	 * switch out functions.
 	 */
 
 	(void) (*md->pmd_switch_out)(pc, pp);
 
 	critical_exit();
 }
 
 /*
  * Log a KLD operation.
  */
 
 static void
 pmc_process_kld_load(struct pmckern_map_in *pkm)
 {
 	struct pmc_owner *po;
 
 	sx_assert(&pmc_sx, SX_LOCKED);
 
 	/*
 	 * Notify owners of system sampling PMCs about KLD operations.
 	 */
 
 	LIST_FOREACH(po, &pmc_ss_owners, po_ssnext)
 	    if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 	    	pmclog_process_map_in(po, (pid_t) -1, pkm->pm_address,
 		    (char *) pkm->pm_file);
 
 	/*
 	 * TODO: Notify owners of (all) process-sampling PMCs too.
 	 */
 
 	return;
 }
 
 static void
 pmc_process_kld_unload(struct pmckern_map_out *pkm)
 {
 	struct pmc_owner *po;
 
 	sx_assert(&pmc_sx, SX_LOCKED);
 
 	LIST_FOREACH(po, &pmc_ss_owners, po_ssnext)
 	    if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 		pmclog_process_map_out(po, (pid_t) -1,
 		    pkm->pm_address, pkm->pm_address + pkm->pm_size);
 
 	/*
 	 * TODO: Notify owners of process-sampling PMCs.
 	 */
 }
 
 /*
  * A mapping change for a process.
  */
 
 static void
 pmc_process_mmap(struct thread *td, struct pmckern_map_in *pkm)
 {
 	int ri;
 	pid_t pid;
 	char *fullpath, *freepath;
 	const struct pmc *pm;
 	struct pmc_owner *po;
 	const struct pmc_process *pp;
 
 	freepath = fullpath = NULL;
 	pmc_getfilename((struct vnode *) pkm->pm_file, &fullpath, &freepath);
 
 	pid = td->td_proc->p_pid;
 
 	/* Inform owners of all system-wide sampling PMCs. */
 	LIST_FOREACH(po, &pmc_ss_owners, po_ssnext)
 	    if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 		pmclog_process_map_in(po, pid, pkm->pm_address, fullpath);
 
 	if ((pp = pmc_find_process_descriptor(td->td_proc, 0)) == NULL)
 		goto done;
 
 	/*
 	 * Inform sampling PMC owners tracking this process.
 	 */
 	for (ri = 0; ri < md->pmd_npmc; ri++)
 		if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL &&
 		    PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 			pmclog_process_map_in(pm->pm_owner,
 			    pid, pkm->pm_address, fullpath);
 
   done:
 	if (freepath)
 		free(freepath, M_TEMP);
 }
 
 
 /*
  * Log an munmap request.
  */
 
 static void
 pmc_process_munmap(struct thread *td, struct pmckern_map_out *pkm)
 {
 	int ri;
 	pid_t pid;
 	struct pmc_owner *po;
 	const struct pmc *pm;
 	const struct pmc_process *pp;
 
 	pid = td->td_proc->p_pid;
 
 	LIST_FOREACH(po, &pmc_ss_owners, po_ssnext)
 	    if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 		pmclog_process_map_out(po, pid, pkm->pm_address,
 		    pkm->pm_address + pkm->pm_size);
 
 	if ((pp = pmc_find_process_descriptor(td->td_proc, 0)) == NULL)
 		return;
 
 	for (ri = 0; ri < md->pmd_npmc; ri++)
 		if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL &&
 		    PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 			pmclog_process_map_out(pm->pm_owner, pid,
 			    pkm->pm_address, pkm->pm_address + pkm->pm_size);
 }
 
 /*
  * Log mapping information about the kernel.
  */
 
 static void
 pmc_log_kernel_mappings(struct pmc *pm)
 {
 	struct pmc_owner *po;
 	struct pmckern_map_in *km, *kmbase;
 
 	sx_assert(&pmc_sx, SX_LOCKED);
 	KASSERT(PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)),
 	    ("[pmc,%d] non-sampling PMC (%p) desires mapping information",
 		__LINE__, (void *) pm));
 
 	po = pm->pm_owner;
 
 	if (po->po_flags & PMC_PO_INITIAL_MAPPINGS_DONE)
 		return;
 
 	/*
 	 * Log the current set of kernel modules.
 	 */
 	kmbase = linker_hwpmc_list_objects();
 	for (km = kmbase; km->pm_file != NULL; km++) {
 		PMCDBG(LOG,REG,1,"%s %p", (char *) km->pm_file,
 		    (void *) km->pm_address);
 		pmclog_process_map_in(po, (pid_t) -1, km->pm_address,
 		    km->pm_file);
 	}
 	free(kmbase, M_LINKER);
 
 	po->po_flags |= PMC_PO_INITIAL_MAPPINGS_DONE;
 }
 
 /*
  * Log the mappings for a single process.
  */
 
 static void
 pmc_log_process_mappings(struct pmc_owner *po, struct proc *p)
 {
 	int locked;
 	vm_map_t map;
 	struct vnode *vp;
 	struct vmspace *vm;
 	vm_map_entry_t entry;
 	vm_offset_t last_end;
 	u_int last_timestamp;
 	struct vnode *last_vp;
 	vm_offset_t start_addr;
 	vm_object_t obj, lobj, tobj;
 	char *fullpath, *freepath;
 
 	last_vp = NULL;
 	last_end = (vm_offset_t) 0;
 	fullpath = freepath = NULL;
 
 	if ((vm = vmspace_acquire_ref(p)) == NULL)
 		return;
 
 	map = &vm->vm_map;
 	vm_map_lock_read(map);
 
 	for (entry = map->header.next; entry != &map->header; entry = entry->next) {
 
 		if (entry == NULL) {
 			PMCDBG(LOG,OPS,2, "hwpmc: vm_map entry unexpectedly "
 			    "NULL! pid=%d vm_map=%p\n", p->p_pid, map);
 			break;
 		}
 
 		/*
 		 * We only care about executable map entries.
 		 */
 		if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) ||
 		    !(entry->protection & VM_PROT_EXECUTE) ||
 		    (entry->object.vm_object == NULL)) {
 			continue;
 		}
 
 		obj = entry->object.vm_object;
 		VM_OBJECT_LOCK(obj);
 
 		/* 
 		 * Walk the backing_object list to find the base
 		 * (non-shadowed) vm_object.
 		 */
 		for (lobj = tobj = obj; tobj != NULL; tobj = tobj->backing_object) {
 			if (tobj != obj)
 				VM_OBJECT_LOCK(tobj);
 			if (lobj != obj)
 				VM_OBJECT_UNLOCK(lobj);
 			lobj = tobj;
 		}
 
 		/*
 		 * At this point lobj is the base vm_object and it is locked.
 		 */
 		if (lobj == NULL) {
 			PMCDBG(LOG,OPS,2, "hwpmc: lobj unexpectedly NULL! pid=%d "
 			    "vm_map=%p vm_obj=%p\n", p->p_pid, map, obj);
 			VM_OBJECT_UNLOCK(obj);
 			continue;
 		}
 
 		if (lobj->type != OBJT_VNODE || lobj->handle == NULL) {
 			if (lobj != obj)
 				VM_OBJECT_UNLOCK(lobj);
 			VM_OBJECT_UNLOCK(obj);
 			continue;
 		}
 
 		/*
 		 * Skip contiguous regions that point to the same
 		 * vnode, so we don't emit redundant MAP-IN
 		 * directives.
 		 */
 		if (entry->start == last_end && lobj->handle == last_vp) {
 			last_end = entry->end;
 			if (lobj != obj)
 				VM_OBJECT_UNLOCK(lobj);
 			VM_OBJECT_UNLOCK(obj);
 			continue;
 		}
 
 		/* 
 		 * We don't want to keep the proc's vm_map or this
 		 * vm_object locked while we walk the pathname, since
 		 * vn_fullpath() can sleep.  However, if we drop the
 		 * lock, it's possible for concurrent activity to
 		 * modify the vm_map list.  To protect against this,
 		 * we save the vm_map timestamp before we release the
 		 * lock, and check it after we reacquire the lock
 		 * below.
 		 */
 		start_addr = entry->start;
 		last_end = entry->end;
 		last_timestamp = map->timestamp;
 		vm_map_unlock_read(map);
 
 		vp = lobj->handle;
 		vref(vp);
 		if (lobj != obj)
 			VM_OBJECT_UNLOCK(lobj);
 
 		VM_OBJECT_UNLOCK(obj);
 
 		freepath = NULL;
 		pmc_getfilename(vp, &fullpath, &freepath);
 		last_vp = vp;
 
 		locked = VFS_LOCK_GIANT(vp->v_mount);
 		vrele(vp);
 		VFS_UNLOCK_GIANT(locked);
 
 		vp = NULL;
 		pmclog_process_map_in(po, p->p_pid, start_addr, fullpath);
 		if (freepath)
 			free(freepath, M_TEMP);
 
 		vm_map_lock_read(map);
 
 		/*
 		 * If our saved timestamp doesn't match, this means
 		 * that the vm_map was modified out from under us and
 		 * we can't trust our current "entry" pointer.  Do a
 		 * new lookup for this entry.  If there is no entry
 		 * for this address range, vm_map_lookup_entry() will
 		 * return the previous one, so we always want to go to
 		 * entry->next on the next loop iteration.
 		 * 
 		 * There is an edge condition here that can occur if
 		 * there is no entry at or before this address.  In
 		 * this situation, vm_map_lookup_entry returns
 		 * &map->header, which would cause our loop to abort
 		 * without processing the rest of the map.  However,
 		 * in practice this will never happen for process
 		 * vm_map.  This is because the executable's text
 		 * segment is the first mapping in the proc's address
 		 * space, and this mapping is never removed until the
 		 * process exits, so there will always be a non-header
 		 * entry at or before the requested address for
 		 * vm_map_lookup_entry to return.
 		 */
 		if (map->timestamp != last_timestamp)
 			vm_map_lookup_entry(map, last_end - 1, &entry);
 	}
 
 	vm_map_unlock_read(map);
 	vmspace_free(vm);
 	return;
 }
 
 /*
  * Log mappings for all processes in the system.
  */
 
 static void
 pmc_log_all_process_mappings(struct pmc_owner *po)
 {
 	struct proc *p, *top;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	if ((p = pfind(1)) == NULL)
 		panic("[pmc,%d] Cannot find init", __LINE__);
 
 	PROC_UNLOCK(p);
 
 	sx_slock(&proctree_lock);
 
 	top = p;
 
 	for (;;) {
 		pmc_log_process_mappings(po, p);
 		if (!LIST_EMPTY(&p->p_children))
 			p = LIST_FIRST(&p->p_children);
 		else for (;;) {
 			if (p == top)
 				goto done;
 			if (LIST_NEXT(p, p_sibling)) {
 				p = LIST_NEXT(p, p_sibling);
 				break;
 			}
 			p = p->p_pptr;
 		}
 	}
  done:
 	sx_sunlock(&proctree_lock);
 }
 
 /*
  * The 'hook' invoked from the kernel proper
  */
 
 
 #ifdef	DEBUG
 const char *pmc_hooknames[] = {
 	/* these strings correspond to PMC_FN_* in <sys/pmckern.h> */
 	"",
 	"EXEC",
 	"CSW-IN",
 	"CSW-OUT",
 	"SAMPLE",
 	"KLDLOAD",
 	"KLDUNLOAD",
 	"MMAP",
 	"MUNMAP",
 	"CALLCHAIN"
 };
 #endif
 
 static int
 pmc_hook_handler(struct thread *td, int function, void *arg)
 {
 
 	PMCDBG(MOD,PMH,1, "hook td=%p func=%d \"%s\" arg=%p", td, function,
 	    pmc_hooknames[function], arg);
 
 	switch (function)
 	{
 
 	/*
 	 * Process exec()
 	 */
 
 	case PMC_FN_PROCESS_EXEC:
 	{
 		char *fullpath, *freepath;
 		unsigned int ri;
 		int is_using_hwpmcs;
 		struct pmc *pm;
 		struct proc *p;
 		struct pmc_owner *po;
 		struct pmc_process *pp;
 		struct pmckern_procexec *pk;
 
 		sx_assert(&pmc_sx, SX_XLOCKED);
 
 		p = td->td_proc;
 		pmc_getfilename(p->p_textvp, &fullpath, &freepath);
 
 		pk = (struct pmckern_procexec *) arg;
 
 		/* Inform owners of SS mode PMCs of the exec event. */
 		LIST_FOREACH(po, &pmc_ss_owners, po_ssnext)
 		    if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 			    pmclog_process_procexec(po, PMC_ID_INVALID,
 				p->p_pid, pk->pm_entryaddr, fullpath);
 
 		PROC_LOCK(p);
 		is_using_hwpmcs = p->p_flag & P_HWPMC;
 		PROC_UNLOCK(p);
 
 		if (!is_using_hwpmcs) {
 			if (freepath)
 				free(freepath, M_TEMP);
 			break;
 		}
 
 		/*
 		 * PMCs are not inherited across an exec():  remove any
 		 * PMCs that this process is the owner of.
 		 */
 
 		if ((po = pmc_find_owner_descriptor(p)) != NULL) {
 			pmc_remove_owner(po);
 			pmc_destroy_owner_descriptor(po);
 		}
 
 		/*
 		 * If the process being exec'ed is not the target of any
 		 * PMC, we are done.
 		 */
 		if ((pp = pmc_find_process_descriptor(p, 0)) == NULL) {
 			if (freepath)
 				free(freepath, M_TEMP);
 			break;
 		}
 
 		/*
 		 * Log the exec event to all monitoring owners.  Skip
 		 * owners who have already recieved the event because
 		 * they had system sampling PMCs active.
 		 */
 		for (ri = 0; ri < md->pmd_npmc; ri++)
 			if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL) {
 				po = pm->pm_owner;
 				if (po->po_sscount == 0 &&
 				    po->po_flags & PMC_PO_OWNS_LOGFILE)
 					pmclog_process_procexec(po, pm->pm_id,
 					    p->p_pid, pk->pm_entryaddr,
 					    fullpath);
 			}
 
 		if (freepath)
 			free(freepath, M_TEMP);
 
 
 		PMCDBG(PRC,EXC,1, "exec proc=%p (%d, %s) cred-changed=%d",
 		    p, p->p_pid, p->p_comm, pk->pm_credentialschanged);
 
 		if (pk->pm_credentialschanged == 0) /* no change */
 			break;
 
 		/*
 		 * If the newly exec()'ed process has a different credential
 		 * than before, allow it to be the target of a PMC only if
 		 * the PMC's owner has sufficient priviledge.
 		 */
 
 		for (ri = 0; ri < md->pmd_npmc; ri++)
 			if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL)
 				if (pmc_can_attach(pm, td->td_proc) != 0)
 					pmc_detach_one_process(td->td_proc,
 					    pm, PMC_FLAG_NONE);
 
 		KASSERT(pp->pp_refcnt >= 0 && pp->pp_refcnt <= (int) md->pmd_npmc,
 		    ("[pmc,%d] Illegal ref count %d on pp %p", __LINE__,
 			pp->pp_refcnt, pp));
 
 		/*
 		 * If this process is no longer the target of any
 		 * PMCs, we can remove the process entry and free
 		 * up space.
 		 */
 
 		if (pp->pp_refcnt == 0) {
 			pmc_remove_process_descriptor(pp);
 			free(pp, M_PMC);
 			break;
 		}
 
 	}
 	break;
 
 	case PMC_FN_CSW_IN:
 		pmc_process_csw_in(td);
 		break;
 
 	case PMC_FN_CSW_OUT:
 		pmc_process_csw_out(td);
 		break;
 
 	/*
 	 * Process accumulated PC samples.
 	 *
 	 * This function is expected to be called by hardclock() for
 	 * each CPU that has accumulated PC samples.
 	 *
 	 * This function is to be executed on the CPU whose samples
 	 * are being processed.
 	 */
 	case PMC_FN_DO_SAMPLES:
 
 		/*
 		 * Clear the cpu specific bit in the CPU mask before
 		 * do the rest of the processing.  If the NMI handler
 		 * gets invoked after the "atomic_clear_int()" call
 		 * below but before "pmc_process_samples()" gets
 		 * around to processing the interrupt, then we will
 		 * come back here at the next hardclock() tick (and
 		 * may find nothing to do if "pmc_process_samples()"
 		 * had already processed the interrupt).  We don't
 		 * lose the interrupt sample.
 		 */
-		atomic_clear_int(&pmc_cpumask, (1 << PCPU_GET(cpuid)));
+		CPU_CLR_ATOMIC(PCPU_GET(cpuid), &pmc_cpumask);
 		pmc_process_samples(PCPU_GET(cpuid));
 		break;
 
 
 	case PMC_FN_KLD_LOAD:
 		sx_assert(&pmc_sx, SX_LOCKED);
 		pmc_process_kld_load((struct pmckern_map_in *) arg);
 		break;
 
 	case PMC_FN_KLD_UNLOAD:
 		sx_assert(&pmc_sx, SX_LOCKED);
 		pmc_process_kld_unload((struct pmckern_map_out *) arg);
 		break;
 
 	case PMC_FN_MMAP:
 		sx_assert(&pmc_sx, SX_LOCKED);
 		pmc_process_mmap(td, (struct pmckern_map_in *) arg);
 		break;
 
 	case PMC_FN_MUNMAP:
 		sx_assert(&pmc_sx, SX_LOCKED);
 		pmc_process_munmap(td, (struct pmckern_map_out *) arg);
 		break;
 
 	case PMC_FN_USER_CALLCHAIN:
 		/*
 		 * Record a call chain.
 		 */
 		KASSERT(td == curthread, ("[pmc,%d] td != curthread",
 		    __LINE__));
 		pmc_capture_user_callchain(PCPU_GET(cpuid),
 		    (struct trapframe *) arg);
 		td->td_pflags &= ~TDP_CALLCHAIN;
 		break;
 
 	default:
 #ifdef	DEBUG
 		KASSERT(0, ("[pmc,%d] unknown hook %d\n", __LINE__, function));
 #endif
 		break;
 
 	}
 
 	return 0;
 }
 
 /*
  * allocate a 'struct pmc_owner' descriptor in the owner hash table.
  */
 
 static struct pmc_owner *
 pmc_allocate_owner_descriptor(struct proc *p)
 {
 	uint32_t hindex;
 	struct pmc_owner *po;
 	struct pmc_ownerhash *poh;
 
 	hindex = PMC_HASH_PTR(p, pmc_ownerhashmask);
 	poh = &pmc_ownerhash[hindex];
 
 	/* allocate space for N pointers and one descriptor struct */
 	po = malloc(sizeof(struct pmc_owner), M_PMC, M_WAITOK|M_ZERO);
 	po->po_sscount = po->po_error = po->po_flags = po->po_logprocmaps = 0;
 	po->po_file  = NULL;
 	po->po_owner = p;
 	po->po_kthread = NULL;
 	LIST_INIT(&po->po_pmcs);
 	LIST_INSERT_HEAD(poh, po, po_next); /* insert into hash table */
 
 	TAILQ_INIT(&po->po_logbuffers);
 	mtx_init(&po->po_mtx, "pmc-owner-mtx", "pmc-per-proc", MTX_SPIN);
 
 	PMCDBG(OWN,ALL,1, "allocate-owner proc=%p (%d, %s) pmc-owner=%p",
 	    p, p->p_pid, p->p_comm, po);
 
 	return po;
 }
 
 static void
 pmc_destroy_owner_descriptor(struct pmc_owner *po)
 {
 
 	PMCDBG(OWN,REL,1, "destroy-owner po=%p proc=%p (%d, %s)",
 	    po, po->po_owner, po->po_owner->p_pid, po->po_owner->p_comm);
 
 	mtx_destroy(&po->po_mtx);
 	free(po, M_PMC);
 }
 
 /*
  * find the descriptor corresponding to process 'p', adding or removing it
  * as specified by 'mode'.
  */
 
 static struct pmc_process *
 pmc_find_process_descriptor(struct proc *p, uint32_t mode)
 {
 	uint32_t hindex;
 	struct pmc_process *pp, *ppnew;
 	struct pmc_processhash *pph;
 
 	hindex = PMC_HASH_PTR(p, pmc_processhashmask);
 	pph = &pmc_processhash[hindex];
 
 	ppnew = NULL;
 
 	/*
 	 * Pre-allocate memory in the FIND_ALLOCATE case since we
 	 * cannot call malloc(9) once we hold a spin lock.
 	 */
 	if (mode & PMC_FLAG_ALLOCATE)
 		ppnew = malloc(sizeof(struct pmc_process) + md->pmd_npmc *
 		    sizeof(struct pmc_targetstate), M_PMC, M_WAITOK|M_ZERO);
 
 	mtx_lock_spin(&pmc_processhash_mtx);
 	LIST_FOREACH(pp, pph, pp_next)
 	    if (pp->pp_proc == p)
 		    break;
 
 	if ((mode & PMC_FLAG_REMOVE) && pp != NULL)
 		LIST_REMOVE(pp, pp_next);
 
 	if ((mode & PMC_FLAG_ALLOCATE) && pp == NULL &&
 	    ppnew != NULL) {
 		ppnew->pp_proc = p;
 		LIST_INSERT_HEAD(pph, ppnew, pp_next);
 		pp = ppnew;
 		ppnew = NULL;
 	}
 	mtx_unlock_spin(&pmc_processhash_mtx);
 
 	if (pp != NULL && ppnew != NULL)
 		free(ppnew, M_PMC);
 
 	return pp;
 }
 
 /*
  * remove a process descriptor from the process hash table.
  */
 
 static void
 pmc_remove_process_descriptor(struct pmc_process *pp)
 {
 	KASSERT(pp->pp_refcnt == 0,
 	    ("[pmc,%d] Removing process descriptor %p with count %d",
 		__LINE__, pp, pp->pp_refcnt));
 
 	mtx_lock_spin(&pmc_processhash_mtx);
 	LIST_REMOVE(pp, pp_next);
 	mtx_unlock_spin(&pmc_processhash_mtx);
 }
 
 
 /*
  * find an owner descriptor corresponding to proc 'p'
  */
 
 static struct pmc_owner *
 pmc_find_owner_descriptor(struct proc *p)
 {
 	uint32_t hindex;
 	struct pmc_owner *po;
 	struct pmc_ownerhash *poh;
 
 	hindex = PMC_HASH_PTR(p, pmc_ownerhashmask);
 	poh = &pmc_ownerhash[hindex];
 
 	po = NULL;
 	LIST_FOREACH(po, poh, po_next)
 	    if (po->po_owner == p)
 		    break;
 
 	PMCDBG(OWN,FND,1, "find-owner proc=%p (%d, %s) hindex=0x%x -> "
 	    "pmc-owner=%p", p, p->p_pid, p->p_comm, hindex, po);
 
 	return po;
 }
 
 /*
  * pmc_allocate_pmc_descriptor
  *
  * Allocate a pmc descriptor and initialize its
  * fields.
  */
 
 static struct pmc *
 pmc_allocate_pmc_descriptor(void)
 {
 	struct pmc *pmc;
 
 	pmc = malloc(sizeof(struct pmc), M_PMC, M_WAITOK|M_ZERO);
 
 	if (pmc != NULL) {
 		pmc->pm_owner = NULL;
 		LIST_INIT(&pmc->pm_targets);
 	}
 
 	PMCDBG(PMC,ALL,1, "allocate-pmc -> pmc=%p", pmc);
 
 	return pmc;
 }
 
 /*
  * Destroy a pmc descriptor.
  */
 
 static void
 pmc_destroy_pmc_descriptor(struct pmc *pm)
 {
 	(void) pm;
 
 #ifdef	DEBUG
 	KASSERT(pm->pm_state == PMC_STATE_DELETED ||
 	    pm->pm_state == PMC_STATE_FREE,
 	    ("[pmc,%d] destroying non-deleted PMC", __LINE__));
 	KASSERT(LIST_EMPTY(&pm->pm_targets),
 	    ("[pmc,%d] destroying pmc with targets", __LINE__));
 	KASSERT(pm->pm_owner == NULL,
 	    ("[pmc,%d] destroying pmc attached to an owner", __LINE__));
 	KASSERT(pm->pm_runcount == 0,
 	    ("[pmc,%d] pmc has non-zero run count %d", __LINE__,
 		pm->pm_runcount));
 #endif
 }
 
 static void
 pmc_wait_for_pmc_idle(struct pmc *pm)
 {
 #ifdef	DEBUG
 	volatile int maxloop;
 
 	maxloop = 100 * pmc_cpu_max();
 #endif
 
 	/*
 	 * Loop (with a forced context switch) till the PMC's runcount
 	 * comes down to zero.
 	 */
 	while (atomic_load_acq_32(&pm->pm_runcount) > 0) {
 #ifdef	DEBUG
 		maxloop--;
 		KASSERT(maxloop > 0,
 		    ("[pmc,%d] (ri%d, rc%d) waiting too long for "
 			"pmc to be free", __LINE__,
 			PMC_TO_ROWINDEX(pm), pm->pm_runcount));
 #endif
 		pmc_force_context_switch();
 	}
 }
 
 /*
  * This function does the following things:
  *
  *  - detaches the PMC from hardware
  *  - unlinks all target threads that were attached to it
  *  - removes the PMC from its owner's list
  *  - destroy's the PMC private mutex
  *
  * Once this function completes, the given pmc pointer can be safely
  * FREE'd by the caller.
  */
 
 static void
 pmc_release_pmc_descriptor(struct pmc *pm)
 {
 	enum pmc_mode mode;
 	struct pmc_hw *phw;
 	u_int adjri, ri, cpu;
 	struct pmc_owner *po;
 	struct pmc_binding pb;
 	struct pmc_process *pp;
 	struct pmc_classdep *pcd;
 	struct pmc_target *ptgt, *tmp;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	KASSERT(pm, ("[pmc,%d] null pmc", __LINE__));
 
 	ri   = PMC_TO_ROWINDEX(pm);
 	pcd  = pmc_ri_to_classdep(md, ri, &adjri);
 	mode = PMC_TO_MODE(pm);
 
 	PMCDBG(PMC,REL,1, "release-pmc pmc=%p ri=%d mode=%d", pm, ri,
 	    mode);
 
 	/*
 	 * First, we take the PMC off hardware.
 	 */
 	cpu = 0;
 	if (PMC_IS_SYSTEM_MODE(mode)) {
 
 		/*
 		 * A system mode PMC runs on a specific CPU.  Switch
 		 * to this CPU and turn hardware off.
 		 */
 		pmc_save_cpu_binding(&pb);
 
 		cpu = PMC_TO_CPU(pm);
 
 		pmc_select_cpu(cpu);
 
 		/* switch off non-stalled CPUs */
 		if (pm->pm_state == PMC_STATE_RUNNING &&
 		    pm->pm_stalled == 0) {
 
 			phw = pmc_pcpu[cpu]->pc_hwpmcs[ri];
 
 			KASSERT(phw->phw_pmc == pm,
 			    ("[pmc, %d] pmc ptr ri(%d) hw(%p) pm(%p)",
 				__LINE__, ri, phw->phw_pmc, pm));
 			PMCDBG(PMC,REL,2, "stopping cpu=%d ri=%d", cpu, ri);
 
 			critical_enter();
 			pcd->pcd_stop_pmc(cpu, adjri);
 			critical_exit();
 		}
 
 		PMCDBG(PMC,REL,2, "decfg cpu=%d ri=%d", cpu, ri);
 
 		critical_enter();
 		pcd->pcd_config_pmc(cpu, adjri, NULL);
 		critical_exit();
 
 		/* adjust the global and process count of SS mode PMCs */
 		if (mode == PMC_MODE_SS && pm->pm_state == PMC_STATE_RUNNING) {
 			po = pm->pm_owner;
 			po->po_sscount--;
 			if (po->po_sscount == 0) {
 				atomic_subtract_rel_int(&pmc_ss_count, 1);
 				LIST_REMOVE(po, po_ssnext);
 			}
 		}
 
 		pm->pm_state = PMC_STATE_DELETED;
 
 		pmc_restore_cpu_binding(&pb);
 
 		/*
 		 * We could have references to this PMC structure in
 		 * the per-cpu sample queues.  Wait for the queue to
 		 * drain.
 		 */
 		pmc_wait_for_pmc_idle(pm);
 
 	} else if (PMC_IS_VIRTUAL_MODE(mode)) {
 
 		/*
 		 * A virtual PMC could be running on multiple CPUs at
 		 * a given instant.
 		 *
 		 * By marking its state as DELETED, we ensure that
 		 * this PMC is never further scheduled on hardware.
 		 *
 		 * Then we wait till all CPUs are done with this PMC.
 		 */
 		pm->pm_state = PMC_STATE_DELETED;
 
 
 		/* Wait for the PMCs runcount to come to zero. */
 		pmc_wait_for_pmc_idle(pm);
 
 		/*
 		 * At this point the PMC is off all CPUs and cannot be
 		 * freshly scheduled onto a CPU.  It is now safe to
 		 * unlink all targets from this PMC.  If a
 		 * process-record's refcount falls to zero, we remove
 		 * it from the hash table.  The module-wide SX lock
 		 * protects us from races.
 		 */
 		LIST_FOREACH_SAFE(ptgt, &pm->pm_targets, pt_next, tmp) {
 			pp = ptgt->pt_process;
 			pmc_unlink_target_process(pm, pp); /* frees 'ptgt' */
 
 			PMCDBG(PMC,REL,3, "pp->refcnt=%d", pp->pp_refcnt);
 
 			/*
 			 * If the target process record shows that no
 			 * PMCs are attached to it, reclaim its space.
 			 */
 
 			if (pp->pp_refcnt == 0) {
 				pmc_remove_process_descriptor(pp);
 				free(pp, M_PMC);
 			}
 		}
 
 		cpu = curthread->td_oncpu; /* setup cpu for pmd_release() */
 
 	}
 
 	/*
 	 * Release any MD resources
 	 */
 	(void) pcd->pcd_release_pmc(cpu, adjri, pm);
 
 	/*
 	 * Update row disposition
 	 */
 
 	if (PMC_IS_SYSTEM_MODE(PMC_TO_MODE(pm)))
 		PMC_UNMARK_ROW_STANDALONE(ri);
 	else
 		PMC_UNMARK_ROW_THREAD(ri);
 
 	/* unlink from the owner's list */
 	if (pm->pm_owner) {
 		LIST_REMOVE(pm, pm_next);
 		pm->pm_owner = NULL;
 	}
 
 	pmc_destroy_pmc_descriptor(pm);
 }
 
 /*
  * Register an owner and a pmc.
  */
 
 static int
 pmc_register_owner(struct proc *p, struct pmc *pmc)
 {
 	struct pmc_owner *po;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	if ((po = pmc_find_owner_descriptor(p)) == NULL)
 		if ((po = pmc_allocate_owner_descriptor(p)) == NULL)
 			return ENOMEM;
 
 	KASSERT(pmc->pm_owner == NULL,
 	    ("[pmc,%d] attempting to own an initialized PMC", __LINE__));
 	pmc->pm_owner  = po;
 
 	LIST_INSERT_HEAD(&po->po_pmcs, pmc, pm_next);
 
 	PROC_LOCK(p);
 	p->p_flag |= P_HWPMC;
 	PROC_UNLOCK(p);
 
 	if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 		pmclog_process_pmcallocate(pmc);
 
 	PMCDBG(PMC,REG,1, "register-owner pmc-owner=%p pmc=%p",
 	    po, pmc);
 
 	return 0;
 }
 
 /*
  * Return the current row disposition:
  * == 0 => FREE
  *  > 0 => PROCESS MODE
  *  < 0 => SYSTEM MODE
  */
 
 int
 pmc_getrowdisp(int ri)
 {
 	return pmc_pmcdisp[ri];
 }
 
 /*
  * Check if a PMC at row index 'ri' can be allocated to the current
  * process.
  *
  * Allocation can fail if:
  *   - the current process is already being profiled by a PMC at index 'ri',
  *     attached to it via OP_PMCATTACH.
  *   - the current process has already allocated a PMC at index 'ri'
  *     via OP_ALLOCATE.
  */
 
 static int
 pmc_can_allocate_rowindex(struct proc *p, unsigned int ri, int cpu)
 {
 	enum pmc_mode mode;
 	struct pmc *pm;
 	struct pmc_owner *po;
 	struct pmc_process *pp;
 
 	PMCDBG(PMC,ALR,1, "can-allocate-rowindex proc=%p (%d, %s) ri=%d "
 	    "cpu=%d", p, p->p_pid, p->p_comm, ri, cpu);
 
 	/*
 	 * We shouldn't have already allocated a process-mode PMC at
 	 * row index 'ri'.
 	 *
 	 * We shouldn't have allocated a system-wide PMC on the same
 	 * CPU and same RI.
 	 */
 	if ((po = pmc_find_owner_descriptor(p)) != NULL)
 		LIST_FOREACH(pm, &po->po_pmcs, pm_next) {
 		    if (PMC_TO_ROWINDEX(pm) == ri) {
 			    mode = PMC_TO_MODE(pm);
 			    if (PMC_IS_VIRTUAL_MODE(mode))
 				    return EEXIST;
 			    if (PMC_IS_SYSTEM_MODE(mode) &&
 				(int) PMC_TO_CPU(pm) == cpu)
 				    return EEXIST;
 		    }
 	        }
 
 	/*
 	 * We also shouldn't be the target of any PMC at this index
 	 * since otherwise a PMC_ATTACH to ourselves will fail.
 	 */
 	if ((pp = pmc_find_process_descriptor(p, 0)) != NULL)
 		if (pp->pp_pmcs[ri].pp_pmc)
 			return EEXIST;
 
 	PMCDBG(PMC,ALR,2, "can-allocate-rowindex proc=%p (%d, %s) ri=%d ok",
 	    p, p->p_pid, p->p_comm, ri);
 
 	return 0;
 }
 
 /*
  * Check if a given PMC at row index 'ri' can be currently used in
  * mode 'mode'.
  */
 
 static int
 pmc_can_allocate_row(int ri, enum pmc_mode mode)
 {
 	enum pmc_disp	disp;
 
 	sx_assert(&pmc_sx, SX_XLOCKED);
 
 	PMCDBG(PMC,ALR,1, "can-allocate-row ri=%d mode=%d", ri, mode);
 
 	if (PMC_IS_SYSTEM_MODE(mode))
 		disp = PMC_DISP_STANDALONE;
 	else
 		disp = PMC_DISP_THREAD;
 
 	/*
 	 * check disposition for PMC row 'ri':
 	 *
 	 * Expected disposition		Row-disposition		Result
 	 *
 	 * STANDALONE			STANDALONE or FREE	proceed
 	 * STANDALONE			THREAD			fail
 	 * THREAD			THREAD or FREE		proceed
 	 * THREAD			STANDALONE		fail
 	 */
 
 	if (!PMC_ROW_DISP_IS_FREE(ri) &&
 	    !(disp == PMC_DISP_THREAD && PMC_ROW_DISP_IS_THREAD(ri)) &&
 	    !(disp == PMC_DISP_STANDALONE && PMC_ROW_DISP_IS_STANDALONE(ri)))
 		return EBUSY;
 
 	/*
 	 * All OK
 	 */
 
 	PMCDBG(PMC,ALR,2, "can-allocate-row ri=%d mode=%d ok", ri, mode);
 
 	return 0;
 
 }
 
 /*
  * Find a PMC descriptor with user handle 'pmcid' for thread 'td'.
  */
 
 static struct pmc *
 pmc_find_pmc_descriptor_in_process(struct pmc_owner *po, pmc_id_t pmcid)
 {
 	struct pmc *pm;
 
 	KASSERT(PMC_ID_TO_ROWINDEX(pmcid) < md->pmd_npmc,
 	    ("[pmc,%d] Illegal pmc index %d (max %d)", __LINE__,
 		PMC_ID_TO_ROWINDEX(pmcid), md->pmd_npmc));
 
 	LIST_FOREACH(pm, &po->po_pmcs, pm_next)
 	    if (pm->pm_id == pmcid)
 		    return pm;
 
 	return NULL;
 }
 
 static int
 pmc_find_pmc(pmc_id_t pmcid, struct pmc **pmc)
 {
 
 	struct pmc *pm;
 	struct pmc_owner *po;
 
 	PMCDBG(PMC,FND,1, "find-pmc id=%d", pmcid);
 
 	if ((po = pmc_find_owner_descriptor(curthread->td_proc)) == NULL)
 		return ESRCH;
 
 	if ((pm = pmc_find_pmc_descriptor_in_process(po, pmcid)) == NULL)
 		return EINVAL;
 
 	PMCDBG(PMC,FND,2, "find-pmc id=%d -> pmc=%p", pmcid, pm);
 
 	*pmc = pm;
 	return 0;
 }
 
 /*
  * Start a PMC.
  */
 
 static int
 pmc_start(struct pmc *pm)
 {
 	enum pmc_mode mode;
 	struct pmc_owner *po;
 	struct pmc_binding pb;
 	struct pmc_classdep *pcd;
 	int adjri, error, cpu, ri;
 
 	KASSERT(pm != NULL,
 	    ("[pmc,%d] null pm", __LINE__));
 
 	mode = PMC_TO_MODE(pm);
 	ri   = PMC_TO_ROWINDEX(pm);
 	pcd  = pmc_ri_to_classdep(md, ri, &adjri);
 
 	error = 0;
 
 	PMCDBG(PMC,OPS,1, "start pmc=%p mode=%d ri=%d", pm, mode, ri);
 
 	po = pm->pm_owner;
 
 	/*
 	 * Disallow PMCSTART if a logfile is required but has not been
 	 * configured yet.
 	 */
 	if ((pm->pm_flags & PMC_F_NEEDS_LOGFILE) &&
 	    (po->po_flags & PMC_PO_OWNS_LOGFILE) == 0)
 		return (EDOOFUS);	/* programming error */
 
 	/*
 	 * If this is a sampling mode PMC, log mapping information for
 	 * the kernel modules that are currently loaded.
 	 */
 	if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 	    pmc_log_kernel_mappings(pm);
 
 	if (PMC_IS_VIRTUAL_MODE(mode)) {
 
 		/*
 		 * If a PMCATTACH has never been done on this PMC,
 		 * attach it to its owner process.
 		 */
 
 		if (LIST_EMPTY(&pm->pm_targets))
 			error = (pm->pm_flags & PMC_F_ATTACH_DONE) ? ESRCH :
 			    pmc_attach_process(po->po_owner, pm);
 
 		/*
 		 * If the PMC is attached to its owner, then force a context
 		 * switch to ensure that the MD state gets set correctly.
 		 */
 
 		if (error == 0) {
 			pm->pm_state = PMC_STATE_RUNNING;
 			if (pm->pm_flags & PMC_F_ATTACHED_TO_OWNER)
 				pmc_force_context_switch();
 		}
 
 		return (error);
 	}
 
 
 	/*
 	 * A system-wide PMC.
 	 *
 	 * Add the owner to the global list if this is a system-wide
 	 * sampling PMC.
 	 */
 
 	if (mode == PMC_MODE_SS) {
 		if (po->po_sscount == 0) {
 			LIST_INSERT_HEAD(&pmc_ss_owners, po, po_ssnext);
 			atomic_add_rel_int(&pmc_ss_count, 1);
 			PMCDBG(PMC,OPS,1, "po=%p in global list", po);
 		}
 		po->po_sscount++;
 
 		/*
 		 * Log mapping information for all existing processes in the
 		 * system.  Subsequent mappings are logged as they happen;
 		 * see pmc_process_mmap().
 		 */
 		if (po->po_logprocmaps == 0) {
 			pmc_log_all_process_mappings(po);
 			po->po_logprocmaps = 1;
 		}
 	}
 
 	/*
 	 * Move to the CPU associated with this
 	 * PMC, and start the hardware.
 	 */
 
 	pmc_save_cpu_binding(&pb);
 
 	cpu = PMC_TO_CPU(pm);
 
 	if (!pmc_cpu_is_active(cpu))
 		return (ENXIO);
 
 	pmc_select_cpu(cpu);
 
 	/*
 	 * global PMCs are configured at allocation time
 	 * so write out the initial value and start the PMC.
 	 */
 
 	pm->pm_state = PMC_STATE_RUNNING;
 
 	critical_enter();
 	if ((error = pcd->pcd_write_pmc(cpu, adjri,
 		 PMC_IS_SAMPLING_MODE(mode) ?
 		 pm->pm_sc.pm_reloadcount :
 		 pm->pm_sc.pm_initial)) == 0)
 		error = pcd->pcd_start_pmc(cpu, adjri);
 	critical_exit();
 
 	pmc_restore_cpu_binding(&pb);
 
 	return (error);
 }
 
 /*
  * Stop a PMC.
  */
 
 static int
 pmc_stop(struct pmc *pm)
 {
 	struct pmc_owner *po;
 	struct pmc_binding pb;
 	struct pmc_classdep *pcd;
 	int adjri, cpu, error, ri;
 
 	KASSERT(pm != NULL, ("[pmc,%d] null pmc", __LINE__));
 
 	PMCDBG(PMC,OPS,1, "stop pmc=%p mode=%d ri=%d", pm,
 	    PMC_TO_MODE(pm), PMC_TO_ROWINDEX(pm));
 
 	pm->pm_state = PMC_STATE_STOPPED;
 
 	/*
 	 * If the PMC is a virtual mode one, changing the state to
 	 * non-RUNNING is enough to ensure that the PMC never gets
 	 * scheduled.
 	 *
 	 * If this PMC is current running on a CPU, then it will
 	 * handled correctly at the time its target process is context
 	 * switched out.
 	 */
 
 	if (PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm)))
 		return 0;
 
 	/*
 	 * A system-mode PMC.  Move to the CPU associated with
 	 * this PMC, and stop the hardware.  We update the
 	 * 'initial count' so that a subsequent PMCSTART will
 	 * resume counting from the current hardware count.
 	 */
 
 	pmc_save_cpu_binding(&pb);
 
 	cpu = PMC_TO_CPU(pm);
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[pmc,%d] illegal cpu=%d", __LINE__, cpu));
 
 	if (!pmc_cpu_is_active(cpu))
 		return ENXIO;
 
 	pmc_select_cpu(cpu);
 
 	ri = PMC_TO_ROWINDEX(pm);
 	pcd = pmc_ri_to_classdep(md, ri, &adjri);
 
 	critical_enter();
 	if ((error = pcd->pcd_stop_pmc(cpu, adjri)) == 0)
 		error = pcd->pcd_read_pmc(cpu, adjri, &pm->pm_sc.pm_initial);
 	critical_exit();
 
 	pmc_restore_cpu_binding(&pb);
 
 	po = pm->pm_owner;
 
 	/* remove this owner from the global list of SS PMC owners */
 	if (PMC_TO_MODE(pm) == PMC_MODE_SS) {
 		po->po_sscount--;
 		if (po->po_sscount == 0) {
 			atomic_subtract_rel_int(&pmc_ss_count, 1);
 			LIST_REMOVE(po, po_ssnext);
 			PMCDBG(PMC,OPS,2,"po=%p removed from global list", po);
 		}
 	}
 
 	return (error);
 }
 
 
 #ifdef	DEBUG
 static const char *pmc_op_to_name[] = {
 #undef	__PMC_OP
 #define	__PMC_OP(N, D)	#N ,
 	__PMC_OPS()
 	NULL
 };
 #endif
 
 /*
  * The syscall interface
  */
 
 #define	PMC_GET_SX_XLOCK(...) do {		\
 	sx_xlock(&pmc_sx);			\
 	if (pmc_hook == NULL) {			\
 		sx_xunlock(&pmc_sx);		\
 		return __VA_ARGS__;		\
 	}					\
 } while (0)
 
 #define	PMC_DOWNGRADE_SX() do {			\
 	sx_downgrade(&pmc_sx);			\
 	is_sx_downgraded = 1;			\
 } while (0)
 
 static int
 pmc_syscall_handler(struct thread *td, void *syscall_args)
 {
 	int error, is_sx_downgraded, is_sx_locked, op;
 	struct pmc_syscall_args *c;
 	void *arg;
 
 	PMC_GET_SX_XLOCK(ENOSYS);
 
 	DROP_GIANT();
 
 	is_sx_downgraded = 0;
 	is_sx_locked = 1;
 
 	c = (struct pmc_syscall_args *) syscall_args;
 
 	op = c->pmop_code;
 	arg = c->pmop_data;
 
 	PMCDBG(MOD,PMS,1, "syscall op=%d \"%s\" arg=%p", op,
 	    pmc_op_to_name[op], arg);
 
 	error = 0;
 	atomic_add_int(&pmc_stats.pm_syscalls, 1);
 
 	switch(op)
 	{
 
 
 	/*
 	 * Configure a log file.
 	 *
 	 * XXX This OP will be reworked.
 	 */
 
 	case PMC_OP_CONFIGURELOG:
 	{
 		struct proc *p;
 		struct pmc *pm;
 		struct pmc_owner *po;
 		struct pmc_op_configurelog cl;
 
 		sx_assert(&pmc_sx, SX_XLOCKED);
 
 		if ((error = copyin(arg, &cl, sizeof(cl))) != 0)
 			break;
 
 		/* mark this process as owning a log file */
 		p = td->td_proc;
 		if ((po = pmc_find_owner_descriptor(p)) == NULL)
 			if ((po = pmc_allocate_owner_descriptor(p)) == NULL) {
 				error = ENOMEM;
 				break;
 			}
 
 		/*
 		 * If a valid fd was passed in, try to configure that,
 		 * otherwise if 'fd' was less than zero and there was
 		 * a log file configured, flush its buffers and
 		 * de-configure it.
 		 */
 		if (cl.pm_logfd >= 0) {
 			sx_xunlock(&pmc_sx);
 			is_sx_locked = 0;
 			error = pmclog_configure_log(md, po, cl.pm_logfd);
 		} else if (po->po_flags & PMC_PO_OWNS_LOGFILE) {
 			pmclog_process_closelog(po);
 			error = pmclog_flush(po);
 			if (error == 0) {
 				LIST_FOREACH(pm, &po->po_pmcs, pm_next)
 				    if (pm->pm_flags & PMC_F_NEEDS_LOGFILE &&
 					pm->pm_state == PMC_STATE_RUNNING)
 					    pmc_stop(pm);
 				error = pmclog_deconfigure_log(po);
 			}
 		} else
 			error = EINVAL;
 
 		if (error)
 			break;
 	}
 	break;
 
 
 	/*
 	 * Flush a log file.
 	 */
 
 	case PMC_OP_FLUSHLOG:
 	{
 		struct pmc_owner *po;
 
 		sx_assert(&pmc_sx, SX_XLOCKED);
 
 		if ((po = pmc_find_owner_descriptor(td->td_proc)) == NULL) {
 			error = EINVAL;
 			break;
 		}
 
 		error = pmclog_flush(po);
 	}
 	break;
 
 	/*
 	 * Retrieve hardware configuration.
 	 */
 
 	case PMC_OP_GETCPUINFO:	/* CPU information */
 	{
 		struct pmc_op_getcpuinfo gci;
 		struct pmc_classinfo *pci;
 		struct pmc_classdep *pcd;
 		int cl;
 
 		gci.pm_cputype = md->pmd_cputype;
 		gci.pm_ncpu    = pmc_cpu_max();
 		gci.pm_npmc    = md->pmd_npmc;
 		gci.pm_nclass  = md->pmd_nclass;
 		pci = gci.pm_classes;
 		pcd = md->pmd_classdep;
 		for (cl = 0; cl < md->pmd_nclass; cl++, pci++, pcd++) {
 			pci->pm_caps  = pcd->pcd_caps;
 			pci->pm_class = pcd->pcd_class;
 			pci->pm_width = pcd->pcd_width;
 			pci->pm_num   = pcd->pcd_num;
 		}
 		error = copyout(&gci, arg, sizeof(gci));
 	}
 	break;
 
 
 	/*
 	 * Get module statistics
 	 */
 
 	case PMC_OP_GETDRIVERSTATS:
 	{
 		struct pmc_op_getdriverstats gms;
 
 		bcopy(&pmc_stats, &gms, sizeof(gms));
 		error = copyout(&gms, arg, sizeof(gms));
 	}
 	break;
 
 
 	/*
 	 * Retrieve module version number
 	 */
 
 	case PMC_OP_GETMODULEVERSION:
 	{
 		uint32_t cv, modv;
 
 		/* retrieve the client's idea of the ABI version */
 		if ((error = copyin(arg, &cv, sizeof(uint32_t))) != 0)
 			break;
 		/* don't service clients newer than our driver */
 		modv = PMC_VERSION;
 		if ((cv & 0xFFFF0000) > (modv & 0xFFFF0000)) {
 			error = EPROGMISMATCH;
 			break;
 		}
 		error = copyout(&modv, arg, sizeof(int));
 	}
 	break;
 
 
 	/*
 	 * Retrieve the state of all the PMCs on a given
 	 * CPU.
 	 */
 
 	case PMC_OP_GETPMCINFO:
 	{
 		int ari;
 		struct pmc *pm;
 		size_t pmcinfo_size;
 		uint32_t cpu, n, npmc;
 		struct pmc_owner *po;
 		struct pmc_binding pb;
 		struct pmc_classdep *pcd;
 		struct pmc_info *p, *pmcinfo;
 		struct pmc_op_getpmcinfo *gpi;
 
 		PMC_DOWNGRADE_SX();
 
 		gpi = (struct pmc_op_getpmcinfo *) arg;
 
 		if ((error = copyin(&gpi->pm_cpu, &cpu, sizeof(cpu))) != 0)
 			break;
 
 		if (cpu >= pmc_cpu_max()) {
 			error = EINVAL;
 			break;
 		}
 
 		if (!pmc_cpu_is_active(cpu)) {
 			error = ENXIO;
 			break;
 		}
 
 		/* switch to CPU 'cpu' */
 		pmc_save_cpu_binding(&pb);
 		pmc_select_cpu(cpu);
 
 		npmc = md->pmd_npmc;
 
 		pmcinfo_size = npmc * sizeof(struct pmc_info);
 		pmcinfo = malloc(pmcinfo_size, M_PMC, M_WAITOK);
 
 		p = pmcinfo;
 
 		for (n = 0; n < md->pmd_npmc; n++, p++) {
 
 			pcd = pmc_ri_to_classdep(md, n, &ari);
 
 			KASSERT(pcd != NULL,
 			    ("[pmc,%d] null pcd ri=%d", __LINE__, n));
 
 			if ((error = pcd->pcd_describe(cpu, ari, p, &pm)) != 0)
 				break;
 
 			if (PMC_ROW_DISP_IS_STANDALONE(n))
 				p->pm_rowdisp = PMC_DISP_STANDALONE;
 			else if (PMC_ROW_DISP_IS_THREAD(n))
 				p->pm_rowdisp = PMC_DISP_THREAD;
 			else
 				p->pm_rowdisp = PMC_DISP_FREE;
 
 			p->pm_ownerpid = -1;
 
 			if (pm == NULL)	/* no PMC associated */
 				continue;
 
 			po = pm->pm_owner;
 
 			KASSERT(po->po_owner != NULL,
 			    ("[pmc,%d] pmc_owner had a null proc pointer",
 				__LINE__));
 
 			p->pm_ownerpid = po->po_owner->p_pid;
 			p->pm_mode     = PMC_TO_MODE(pm);
 			p->pm_event    = pm->pm_event;
 			p->pm_flags    = pm->pm_flags;
 
 			if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 				p->pm_reloadcount =
 				    pm->pm_sc.pm_reloadcount;
 		}
 
 		pmc_restore_cpu_binding(&pb);
 
 		/* now copy out the PMC info collected */
 		if (error == 0)
 			error = copyout(pmcinfo, &gpi->pm_pmcs, pmcinfo_size);
 
 		free(pmcinfo, M_PMC);
 	}
 	break;
 
 
 	/*
 	 * Set the administrative state of a PMC.  I.e. whether
 	 * the PMC is to be used or not.
 	 */
 
 	case PMC_OP_PMCADMIN:
 	{
 		int cpu, ri;
 		enum pmc_state request;
 		struct pmc_cpu *pc;
 		struct pmc_hw *phw;
 		struct pmc_op_pmcadmin pma;
 		struct pmc_binding pb;
 
 		sx_assert(&pmc_sx, SX_XLOCKED);
 
 		KASSERT(td == curthread,
 		    ("[pmc,%d] td != curthread", __LINE__));
 
 		error = priv_check(td, PRIV_PMC_MANAGE);
 		if (error)
 			break;
 
 		if ((error = copyin(arg, &pma, sizeof(pma))) != 0)
 			break;
 
 		cpu = pma.pm_cpu;
 
 		if (cpu < 0 || cpu >= (int) pmc_cpu_max()) {
 			error = EINVAL;
 			break;
 		}
 
 		if (!pmc_cpu_is_active(cpu)) {
 			error = ENXIO;
 			break;
 		}
 
 		request = pma.pm_state;
 
 		if (request != PMC_STATE_DISABLED &&
 		    request != PMC_STATE_FREE) {
 			error = EINVAL;
 			break;
 		}
 
 		ri = pma.pm_pmc; /* pmc id == row index */
 		if (ri < 0 || ri >= (int) md->pmd_npmc) {
 			error = EINVAL;
 			break;
 		}
 
 		/*
 		 * We can't disable a PMC with a row-index allocated
 		 * for process virtual PMCs.
 		 */
 
 		if (PMC_ROW_DISP_IS_THREAD(ri) &&
 		    request == PMC_STATE_DISABLED) {
 			error = EBUSY;
 			break;
 		}
 
 		/*
 		 * otherwise, this PMC on this CPU is either free or
 		 * in system-wide mode.
 		 */
 
 		pmc_save_cpu_binding(&pb);
 		pmc_select_cpu(cpu);
 
 		pc  = pmc_pcpu[cpu];
 		phw = pc->pc_hwpmcs[ri];
 
 		/*
 		 * XXX do we need some kind of 'forced' disable?
 		 */
 
 		if (phw->phw_pmc == NULL) {
 			if (request == PMC_STATE_DISABLED &&
 			    (phw->phw_state & PMC_PHW_FLAG_IS_ENABLED)) {
 				phw->phw_state &= ~PMC_PHW_FLAG_IS_ENABLED;
 				PMC_MARK_ROW_STANDALONE(ri);
 			} else if (request == PMC_STATE_FREE &&
 			    (phw->phw_state & PMC_PHW_FLAG_IS_ENABLED) == 0) {
 				phw->phw_state |=  PMC_PHW_FLAG_IS_ENABLED;
 				PMC_UNMARK_ROW_STANDALONE(ri);
 			}
 			/* other cases are a no-op */
 		} else
 			error = EBUSY;
 
 		pmc_restore_cpu_binding(&pb);
 	}
 	break;
 
 
 	/*
 	 * Allocate a PMC.
 	 */
 
 	case PMC_OP_PMCALLOCATE:
 	{
 		int adjri, n;
 		u_int cpu;
 		uint32_t caps;
 		struct pmc *pmc;
 		enum pmc_mode mode;
 		struct pmc_hw *phw;
 		struct pmc_binding pb;
 		struct pmc_classdep *pcd;
 		struct pmc_op_pmcallocate pa;
 
 		if ((error = copyin(arg, &pa, sizeof(pa))) != 0)
 			break;
 
 		caps = pa.pm_caps;
 		mode = pa.pm_mode;
 		cpu  = pa.pm_cpu;
 
 		if ((mode != PMC_MODE_SS  &&  mode != PMC_MODE_SC  &&
 		     mode != PMC_MODE_TS  &&  mode != PMC_MODE_TC) ||
 		    (cpu != (u_int) PMC_CPU_ANY && cpu >= pmc_cpu_max())) {
 			error = EINVAL;
 			break;
 		}
 
 		/*
 		 * Virtual PMCs should only ask for a default CPU.
 		 * System mode PMCs need to specify a non-default CPU.
 		 */
 
 		if ((PMC_IS_VIRTUAL_MODE(mode) && cpu != (u_int) PMC_CPU_ANY) ||
 		    (PMC_IS_SYSTEM_MODE(mode) && cpu == (u_int) PMC_CPU_ANY)) {
 			error = EINVAL;
 			break;
 		}
 
 		/*
 		 * Check that an inactive CPU is not being asked for.
 		 */
 
 		if (PMC_IS_SYSTEM_MODE(mode) && !pmc_cpu_is_active(cpu)) {
 			error = ENXIO;
 			break;
 		}
 
 		/*
 		 * Refuse an allocation for a system-wide PMC if this
 		 * process has been jailed, or if this process lacks
 		 * super-user credentials and the sysctl tunable
 		 * 'security.bsd.unprivileged_syspmcs' is zero.
 		 */
 
 		if (PMC_IS_SYSTEM_MODE(mode)) {
 			if (jailed(curthread->td_ucred)) {
 				error = EPERM;
 				break;
 			}
 			if (!pmc_unprivileged_syspmcs) {
 				error = priv_check(curthread,
 				    PRIV_PMC_SYSTEM);
 				if (error)
 					break;
 			}
 		}
 
 		/*
 		 * Look for valid values for 'pm_flags'
 		 */
 
 		if ((pa.pm_flags & ~(PMC_F_DESCENDANTS | PMC_F_LOG_PROCCSW |
 		    PMC_F_LOG_PROCEXIT | PMC_F_CALLCHAIN)) != 0) {
 			error = EINVAL;
 			break;
 		}
 
 		/* process logging options are not allowed for system PMCs */
 		if (PMC_IS_SYSTEM_MODE(mode) && (pa.pm_flags &
 		    (PMC_F_LOG_PROCCSW | PMC_F_LOG_PROCEXIT))) {
 			error = EINVAL;
 			break;
 		}
 
 		/*
 		 * All sampling mode PMCs need to be able to interrupt the
 		 * CPU.
 		 */
 		if (PMC_IS_SAMPLING_MODE(mode))
 			caps |= PMC_CAP_INTERRUPT;
 
 		/* A valid class specifier should have been passed in. */
 		for (n = 0; n < md->pmd_nclass; n++)
 			if (md->pmd_classdep[n].pcd_class == pa.pm_class)
 				break;
 		if (n == md->pmd_nclass) {
 			error = EINVAL;
 			break;
 		}
 
 		/* The requested PMC capabilities should be feasible. */
 		if ((md->pmd_classdep[n].pcd_caps & caps) != caps) {
 			error = EOPNOTSUPP;
 			break;
 		}
 
 		PMCDBG(PMC,ALL,2, "event=%d caps=0x%x mode=%d cpu=%d",
 		    pa.pm_ev, caps, mode, cpu);
 
 		pmc = pmc_allocate_pmc_descriptor();
 		pmc->pm_id    = PMC_ID_MAKE_ID(cpu,pa.pm_mode,pa.pm_class,
 		    PMC_ID_INVALID);
 		pmc->pm_event = pa.pm_ev;
 		pmc->pm_state = PMC_STATE_FREE;
 		pmc->pm_caps  = caps;
 		pmc->pm_flags = pa.pm_flags;
 
 		/* switch thread to CPU 'cpu' */
 		pmc_save_cpu_binding(&pb);
 
 #define	PMC_IS_SHAREABLE_PMC(cpu, n)				\
 	(pmc_pcpu[(cpu)]->pc_hwpmcs[(n)]->phw_state &		\
 	 PMC_PHW_FLAG_IS_SHAREABLE)
 #define	PMC_IS_UNALLOCATED(cpu, n)				\
 	(pmc_pcpu[(cpu)]->pc_hwpmcs[(n)]->phw_pmc == NULL)
 
 		if (PMC_IS_SYSTEM_MODE(mode)) {
 			pmc_select_cpu(cpu);
 			for (n = 0; n < (int) md->pmd_npmc; n++) {
 				pcd = pmc_ri_to_classdep(md, n, &adjri);
 				if (pmc_can_allocate_row(n, mode) == 0 &&
 				    pmc_can_allocate_rowindex(
 					    curthread->td_proc, n, cpu) == 0 &&
 				    (PMC_IS_UNALLOCATED(cpu, n) ||
 				     PMC_IS_SHAREABLE_PMC(cpu, n)) &&
 				    pcd->pcd_allocate_pmc(cpu, adjri, pmc,
 					&pa) == 0)
 					break;
 			}
 		} else {
 			/* Process virtual mode */
 			for (n = 0; n < (int) md->pmd_npmc; n++) {
 				pcd = pmc_ri_to_classdep(md, n, &adjri);
 				if (pmc_can_allocate_row(n, mode) == 0 &&
 				    pmc_can_allocate_rowindex(
 					    curthread->td_proc, n,
 					    PMC_CPU_ANY) == 0 &&
 				    pcd->pcd_allocate_pmc(curthread->td_oncpu,
 					adjri, pmc, &pa) == 0)
 					break;
 			}
 		}
 
 #undef	PMC_IS_UNALLOCATED
 #undef	PMC_IS_SHAREABLE_PMC
 
 		pmc_restore_cpu_binding(&pb);
 
 		if (n == (int) md->pmd_npmc) {
 			pmc_destroy_pmc_descriptor(pmc);
 			free(pmc, M_PMC);
 			pmc = NULL;
 			error = EINVAL;
 			break;
 		}
 
 		/* Fill in the correct value in the ID field */
 		pmc->pm_id = PMC_ID_MAKE_ID(cpu,mode,pa.pm_class,n);
 
 		PMCDBG(PMC,ALL,2, "ev=%d class=%d mode=%d n=%d -> pmcid=%x",
 		    pmc->pm_event, pa.pm_class, mode, n, pmc->pm_id);
 
 		/* Process mode PMCs with logging enabled need log files */
 		if (pmc->pm_flags & (PMC_F_LOG_PROCEXIT | PMC_F_LOG_PROCCSW))
 			pmc->pm_flags |= PMC_F_NEEDS_LOGFILE;
 
 		/* All system mode sampling PMCs require a log file */
 		if (PMC_IS_SAMPLING_MODE(mode) && PMC_IS_SYSTEM_MODE(mode))
 			pmc->pm_flags |= PMC_F_NEEDS_LOGFILE;
 
 		/*
 		 * Configure global pmc's immediately
 		 */
 
 		if (PMC_IS_SYSTEM_MODE(PMC_TO_MODE(pmc))) {
 
 			pmc_save_cpu_binding(&pb);
 			pmc_select_cpu(cpu);
 
 			phw = pmc_pcpu[cpu]->pc_hwpmcs[n];
 			pcd = pmc_ri_to_classdep(md, n, &adjri);
 
 			if ((phw->phw_state & PMC_PHW_FLAG_IS_ENABLED) == 0 ||
 			    (error = pcd->pcd_config_pmc(cpu, adjri, pmc)) != 0) {
 				(void) pcd->pcd_release_pmc(cpu, adjri, pmc);
 				pmc_destroy_pmc_descriptor(pmc);
 				free(pmc, M_PMC);
 				pmc = NULL;
 				pmc_restore_cpu_binding(&pb);
 				error = EPERM;
 				break;
 			}
 
 			pmc_restore_cpu_binding(&pb);
 		}
 
 		pmc->pm_state    = PMC_STATE_ALLOCATED;
 
 		/*
 		 * mark row disposition
 		 */
 
 		if (PMC_IS_SYSTEM_MODE(mode))
 			PMC_MARK_ROW_STANDALONE(n);
 		else
 			PMC_MARK_ROW_THREAD(n);
 
 		/*
 		 * Register this PMC with the current thread as its owner.
 		 */
 
 		if ((error =
 		    pmc_register_owner(curthread->td_proc, pmc)) != 0) {
 			pmc_release_pmc_descriptor(pmc);
 			free(pmc, M_PMC);
 			pmc = NULL;
 			break;
 		}
 
 		/*
 		 * Return the allocated index.
 		 */
 
 		pa.pm_pmcid = pmc->pm_id;
 
 		error = copyout(&pa, arg, sizeof(pa));
 	}
 	break;
 
 
 	/*
 	 * Attach a PMC to a process.
 	 */
 
 	case PMC_OP_PMCATTACH:
 	{
 		struct pmc *pm;
 		struct proc *p;
 		struct pmc_op_pmcattach a;
 
 		sx_assert(&pmc_sx, SX_XLOCKED);
 
 		if ((error = copyin(arg, &a, sizeof(a))) != 0)
 			break;
 
 		if (a.pm_pid < 0) {
 			error = EINVAL;
 			break;
 		} else if (a.pm_pid == 0)
 			a.pm_pid = td->td_proc->p_pid;
 
 		if ((error = pmc_find_pmc(a.pm_pmc, &pm)) != 0)
 			break;
 
 		if (PMC_IS_SYSTEM_MODE(PMC_TO_MODE(pm))) {
 			error = EINVAL;
 			break;
 		}
 
 		/* PMCs may be (re)attached only when allocated or stopped */
 		if (pm->pm_state == PMC_STATE_RUNNING) {
 			error = EBUSY;
 			break;
 		} else if (pm->pm_state != PMC_STATE_ALLOCATED &&
 		    pm->pm_state != PMC_STATE_STOPPED) {
 			error = EINVAL;
 			break;
 		}
 
 		/* lookup pid */
 		if ((p = pfind(a.pm_pid)) == NULL) {
 			error = ESRCH;
 			break;
 		}
 
 		/*
 		 * Ignore processes that are working on exiting.
 		 */
 		if (p->p_flag & P_WEXIT) {
 			error = ESRCH;
 			PROC_UNLOCK(p);	/* pfind() returns a locked process */
 			break;
 		}
 
 		/*
 		 * we are allowed to attach a PMC to a process if
 		 * we can debug it.
 		 */
 		error = p_candebug(curthread, p);
 
 		PROC_UNLOCK(p);
 
 		if (error == 0)
 			error = pmc_attach_process(p, pm);
 	}
 	break;
 
 
 	/*
 	 * Detach an attached PMC from a process.
 	 */
 
 	case PMC_OP_PMCDETACH:
 	{
 		struct pmc *pm;
 		struct proc *p;
 		struct pmc_op_pmcattach a;
 
 		if ((error = copyin(arg, &a, sizeof(a))) != 0)
 			break;
 
 		if (a.pm_pid < 0) {
 			error = EINVAL;
 			break;
 		} else if (a.pm_pid == 0)
 			a.pm_pid = td->td_proc->p_pid;
 
 		if ((error = pmc_find_pmc(a.pm_pmc, &pm)) != 0)
 			break;
 
 		if ((p = pfind(a.pm_pid)) == NULL) {
 			error = ESRCH;
 			break;
 		}
 
 		/*
 		 * Treat processes that are in the process of exiting
 		 * as if they were not present.
 		 */
 
 		if (p->p_flag & P_WEXIT)
 			error = ESRCH;
 
 		PROC_UNLOCK(p);	/* pfind() returns a locked process */
 
 		if (error == 0)
 			error = pmc_detach_process(p, pm);
 	}
 	break;
 
 
 	/*
 	 * Retrieve the MSR number associated with the counter
 	 * 'pmc_id'.  This allows processes to directly use RDPMC
 	 * instructions to read their PMCs, without the overhead of a
 	 * system call.
 	 */
 
 	case PMC_OP_PMCGETMSR:
 	{
 		int adjri, ri;
 		struct pmc *pm;
 		struct pmc_target *pt;
 		struct pmc_op_getmsr gm;
 		struct pmc_classdep *pcd;
 
 		PMC_DOWNGRADE_SX();
 
 		if ((error = copyin(arg, &gm, sizeof(gm))) != 0)
 			break;
 
 		if ((error = pmc_find_pmc(gm.pm_pmcid, &pm)) != 0)
 			break;
 
 		/*
 		 * The allocated PMC has to be a process virtual PMC,
 		 * i.e., of type MODE_T[CS].  Global PMCs can only be
 		 * read using the PMCREAD operation since they may be
 		 * allocated on a different CPU than the one we could
 		 * be running on at the time of the RDPMC instruction.
 		 *
 		 * The GETMSR operation is not allowed for PMCs that
 		 * are inherited across processes.
 		 */
 
 		if (!PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm)) ||
 		    (pm->pm_flags & PMC_F_DESCENDANTS)) {
 			error = EINVAL;
 			break;
 		}
 
 		/*
 		 * It only makes sense to use a RDPMC (or its
 		 * equivalent instruction on non-x86 architectures) on
 		 * a process that has allocated and attached a PMC to
 		 * itself.  Conversely the PMC is only allowed to have
 		 * one process attached to it -- its owner.
 		 */
 
 		if ((pt = LIST_FIRST(&pm->pm_targets)) == NULL ||
 		    LIST_NEXT(pt, pt_next) != NULL ||
 		    pt->pt_process->pp_proc != pm->pm_owner->po_owner) {
 			error = EINVAL;
 			break;
 		}
 
 		ri = PMC_TO_ROWINDEX(pm);
 		pcd = pmc_ri_to_classdep(md, ri, &adjri);
 
 		/* PMC class has no 'GETMSR' support */
 		if (pcd->pcd_get_msr == NULL) {
 			error = ENOSYS;
 			break;
 		}
 
 		if ((error = (*pcd->pcd_get_msr)(adjri, &gm.pm_msr)) < 0)
 			break;
 
 		if ((error = copyout(&gm, arg, sizeof(gm))) < 0)
 			break;
 
 		/*
 		 * Mark our process as using MSRs.  Update machine
 		 * state using a forced context switch.
 		 */
 
 		pt->pt_process->pp_flags |= PMC_PP_ENABLE_MSR_ACCESS;
 		pmc_force_context_switch();
 
 	}
 	break;
 
 	/*
 	 * Release an allocated PMC
 	 */
 
 	case PMC_OP_PMCRELEASE:
 	{
 		pmc_id_t pmcid;
 		struct pmc *pm;
 		struct pmc_owner *po;
 		struct pmc_op_simple sp;
 
 		/*
 		 * Find PMC pointer for the named PMC.
 		 *
 		 * Use pmc_release_pmc_descriptor() to switch off the
 		 * PMC, remove all its target threads, and remove the
 		 * PMC from its owner's list.
 		 *
 		 * Remove the owner record if this is the last PMC
 		 * owned.
 		 *
 		 * Free up space.
 		 */
 
 		if ((error = copyin(arg, &sp, sizeof(sp))) != 0)
 			break;
 
 		pmcid = sp.pm_pmcid;
 
 		if ((error = pmc_find_pmc(pmcid, &pm)) != 0)
 			break;
 
 		po = pm->pm_owner;
 		pmc_release_pmc_descriptor(pm);
 		pmc_maybe_remove_owner(po);
 
 		free(pm, M_PMC);
 	}
 	break;
 
 
 	/*
 	 * Read and/or write a PMC.
 	 */
 
 	case PMC_OP_PMCRW:
 	{
 		int adjri;
 		struct pmc *pm;
 		uint32_t cpu, ri;
 		pmc_value_t oldvalue;
 		struct pmc_binding pb;
 		struct pmc_op_pmcrw prw;
 		struct pmc_classdep *pcd;
 		struct pmc_op_pmcrw *pprw;
 
 		PMC_DOWNGRADE_SX();
 
 		if ((error = copyin(arg, &prw, sizeof(prw))) != 0)
 			break;
 
 		ri = 0;
 		PMCDBG(PMC,OPS,1, "rw id=%d flags=0x%x", prw.pm_pmcid,
 		    prw.pm_flags);
 
 		/* must have at least one flag set */
 		if ((prw.pm_flags & (PMC_F_OLDVALUE|PMC_F_NEWVALUE)) == 0) {
 			error = EINVAL;
 			break;
 		}
 
 		/* locate pmc descriptor */
 		if ((error = pmc_find_pmc(prw.pm_pmcid, &pm)) != 0)
 			break;
 
 		/* Can't read a PMC that hasn't been started. */
 		if (pm->pm_state != PMC_STATE_ALLOCATED &&
 		    pm->pm_state != PMC_STATE_STOPPED &&
 		    pm->pm_state != PMC_STATE_RUNNING) {
 			error = EINVAL;
 			break;
 		}
 
 		/* writing a new value is allowed only for 'STOPPED' pmcs */
 		if (pm->pm_state == PMC_STATE_RUNNING &&
 		    (prw.pm_flags & PMC_F_NEWVALUE)) {
 			error = EBUSY;
 			break;
 		}
 
 		if (PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm))) {
 
 			/*
 			 * If this PMC is attached to its owner (i.e.,
 			 * the process requesting this operation) and
 			 * is running, then attempt to get an
 			 * upto-date reading from hardware for a READ.
 			 * Writes are only allowed when the PMC is
 			 * stopped, so only update the saved value
 			 * field.
 			 *
 			 * If the PMC is not running, or is not
 			 * attached to its owner, read/write to the
 			 * savedvalue field.
 			 */
 
 			ri = PMC_TO_ROWINDEX(pm);
 			pcd = pmc_ri_to_classdep(md, ri, &adjri);
 
 			mtx_pool_lock_spin(pmc_mtxpool, pm);
 			cpu = curthread->td_oncpu;
 
 			if (prw.pm_flags & PMC_F_OLDVALUE) {
 				if ((pm->pm_flags & PMC_F_ATTACHED_TO_OWNER) &&
 				    (pm->pm_state == PMC_STATE_RUNNING))
 					error = (*pcd->pcd_read_pmc)(cpu, adjri,
 					    &oldvalue);
 				else
 					oldvalue = pm->pm_gv.pm_savedvalue;
 			}
 			if (prw.pm_flags & PMC_F_NEWVALUE)
 				pm->pm_gv.pm_savedvalue = prw.pm_value;
 
 			mtx_pool_unlock_spin(pmc_mtxpool, pm);
 
 		} else { /* System mode PMCs */
 			cpu = PMC_TO_CPU(pm);
 			ri  = PMC_TO_ROWINDEX(pm);
 			pcd = pmc_ri_to_classdep(md, ri, &adjri);
 
 			if (!pmc_cpu_is_active(cpu)) {
 				error = ENXIO;
 				break;
 			}
 
 			/* move this thread to CPU 'cpu' */
 			pmc_save_cpu_binding(&pb);
 			pmc_select_cpu(cpu);
 
 			critical_enter();
 			/* save old value */
 			if (prw.pm_flags & PMC_F_OLDVALUE)
 				if ((error = (*pcd->pcd_read_pmc)(cpu, adjri,
 					 &oldvalue)))
 					goto error;
 			/* write out new value */
 			if (prw.pm_flags & PMC_F_NEWVALUE)
 				error = (*pcd->pcd_write_pmc)(cpu, adjri,
 				    prw.pm_value);
 		error:
 			critical_exit();
 			pmc_restore_cpu_binding(&pb);
 			if (error)
 				break;
 		}
 
 		pprw = (struct pmc_op_pmcrw *) arg;
 
 #ifdef	DEBUG
 		if (prw.pm_flags & PMC_F_NEWVALUE)
 			PMCDBG(PMC,OPS,2, "rw id=%d new %jx -> old %jx",
 			    ri, prw.pm_value, oldvalue);
 		else if (prw.pm_flags & PMC_F_OLDVALUE)
 			PMCDBG(PMC,OPS,2, "rw id=%d -> old %jx", ri, oldvalue);
 #endif
 
 		/* return old value if requested */
 		if (prw.pm_flags & PMC_F_OLDVALUE)
 			if ((error = copyout(&oldvalue, &pprw->pm_value,
 				 sizeof(prw.pm_value))))
 				break;
 
 	}
 	break;
 
 
 	/*
 	 * Set the sampling rate for a sampling mode PMC and the
 	 * initial count for a counting mode PMC.
 	 */
 
 	case PMC_OP_PMCSETCOUNT:
 	{
 		struct pmc *pm;
 		struct pmc_op_pmcsetcount sc;
 
 		PMC_DOWNGRADE_SX();
 
 		if ((error = copyin(arg, &sc, sizeof(sc))) != 0)
 			break;
 
 		if ((error = pmc_find_pmc(sc.pm_pmcid, &pm)) != 0)
 			break;
 
 		if (pm->pm_state == PMC_STATE_RUNNING) {
 			error = EBUSY;
 			break;
 		}
 
 		if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 			pm->pm_sc.pm_reloadcount = sc.pm_count;
 		else
 			pm->pm_sc.pm_initial = sc.pm_count;
 	}
 	break;
 
 
 	/*
 	 * Start a PMC.
 	 */
 
 	case PMC_OP_PMCSTART:
 	{
 		pmc_id_t pmcid;
 		struct pmc *pm;
 		struct pmc_op_simple sp;
 
 		sx_assert(&pmc_sx, SX_XLOCKED);
 
 		if ((error = copyin(arg, &sp, sizeof(sp))) != 0)
 			break;
 
 		pmcid = sp.pm_pmcid;
 
 		if ((error = pmc_find_pmc(pmcid, &pm)) != 0)
 			break;
 
 		KASSERT(pmcid == pm->pm_id,
 		    ("[pmc,%d] pmcid %x != id %x", __LINE__,
 			pm->pm_id, pmcid));
 
 		if (pm->pm_state == PMC_STATE_RUNNING) /* already running */
 			break;
 		else if (pm->pm_state != PMC_STATE_STOPPED &&
 		    pm->pm_state != PMC_STATE_ALLOCATED) {
 			error = EINVAL;
 			break;
 		}
 
 		error = pmc_start(pm);
 	}
 	break;
 
 
 	/*
 	 * Stop a PMC.
 	 */
 
 	case PMC_OP_PMCSTOP:
 	{
 		pmc_id_t pmcid;
 		struct pmc *pm;
 		struct pmc_op_simple sp;
 
 		PMC_DOWNGRADE_SX();
 
 		if ((error = copyin(arg, &sp, sizeof(sp))) != 0)
 			break;
 
 		pmcid = sp.pm_pmcid;
 
 		/*
 		 * Mark the PMC as inactive and invoke the MD stop
 		 * routines if needed.
 		 */
 
 		if ((error = pmc_find_pmc(pmcid, &pm)) != 0)
 			break;
 
 		KASSERT(pmcid == pm->pm_id,
 		    ("[pmc,%d] pmc id %x != pmcid %x", __LINE__,
 			pm->pm_id, pmcid));
 
 		if (pm->pm_state == PMC_STATE_STOPPED) /* already stopped */
 			break;
 		else if (pm->pm_state != PMC_STATE_RUNNING) {
 			error = EINVAL;
 			break;
 		}
 
 		error = pmc_stop(pm);
 	}
 	break;
 
 
 	/*
 	 * Write a user supplied value to the log file.
 	 */
 
 	case PMC_OP_WRITELOG:
 	{
 		struct pmc_op_writelog wl;
 		struct pmc_owner *po;
 
 		PMC_DOWNGRADE_SX();
 
 		if ((error = copyin(arg, &wl, sizeof(wl))) != 0)
 			break;
 
 		if ((po = pmc_find_owner_descriptor(td->td_proc)) == NULL) {
 			error = EINVAL;
 			break;
 		}
 
 		if ((po->po_flags & PMC_PO_OWNS_LOGFILE) == 0) {
 			error = EINVAL;
 			break;
 		}
 
 		error = pmclog_process_userlog(po, &wl);
 	}
 	break;
 
 
 	default:
 		error = EINVAL;
 		break;
 	}
 
 	if (is_sx_locked != 0) {
 		if (is_sx_downgraded)
 			sx_sunlock(&pmc_sx);
 		else
 			sx_xunlock(&pmc_sx);
 	}
 
 	if (error)
 		atomic_add_int(&pmc_stats.pm_syscall_errors, 1);
 
 	PICKUP_GIANT();
 
 	return error;
 }
 
 /*
  * Helper functions
  */
 
 
 /*
  * Mark the thread as needing callchain capture and post an AST.  The
  * actual callchain capture will be done in a context where it is safe
  * to take page faults.
  */
 
 static void
 pmc_post_callchain_callback(void)
 {
 	struct thread *td;
 
 	td = curthread;
 
 	/*
 	 * If there is multiple PMCs for the same interrupt ignore new post
 	 */
 	if (td->td_pflags & TDP_CALLCHAIN)
 		return;
 
 	/*
 	 * Mark this thread as needing callchain capture.
 	 * `td->td_pflags' will be safe to touch because this thread
 	 * was in user space when it was interrupted.
 	 */
 	td->td_pflags |= TDP_CALLCHAIN;
 
 	/*
 	 * Don't let this thread migrate between CPUs until callchain
 	 * capture completes.
 	 */
 	sched_pin();
 
 	return;
 }
 
 /*
  * Interrupt processing.
  *
  * Find a free slot in the per-cpu array of samples and capture the
  * current callchain there.  If a sample was successfully added, a bit
  * is set in mask 'pmc_cpumask' denoting that the DO_SAMPLES hook
  * needs to be invoked from the clock handler.
  *
  * This function is meant to be called from an NMI handler.  It cannot
  * use any of the locking primitives supplied by the OS.
  */
 
 int
 pmc_process_interrupt(int cpu, struct pmc *pm, struct trapframe *tf,
     int inuserspace)
 {
 	int error, callchaindepth;
 	struct thread *td;
 	struct pmc_sample *ps;
 	struct pmc_samplebuffer *psb;
 
 	error = 0;
 
 	/*
 	 * Allocate space for a sample buffer.
 	 */
 	psb = pmc_pcpu[cpu]->pc_sb;
 
 	ps = psb->ps_write;
 	if (ps->ps_nsamples) {	/* in use, reader hasn't caught up */
 		pm->pm_stalled = 1;
 		atomic_add_int(&pmc_stats.pm_intr_bufferfull, 1);
 		PMCDBG(SAM,INT,1,"(spc) cpu=%d pm=%p tf=%p um=%d wr=%d rd=%d",
 		    cpu, pm, (void *) tf, inuserspace,
 		    (int) (psb->ps_write - psb->ps_samples),
 		    (int) (psb->ps_read - psb->ps_samples));
 		error = ENOMEM;
 		goto done;
 	}
 
 
 	/* Fill in entry. */
 	PMCDBG(SAM,INT,1,"cpu=%d pm=%p tf=%p um=%d wr=%d rd=%d", cpu, pm,
 	    (void *) tf, inuserspace,
 	    (int) (psb->ps_write - psb->ps_samples),
 	    (int) (psb->ps_read - psb->ps_samples));
 
 	KASSERT(pm->pm_runcount >= 0,
 	    ("[pmc,%d] pm=%p runcount %d", __LINE__, (void *) pm,
 		pm->pm_runcount));
 
 	atomic_add_rel_int(&pm->pm_runcount, 1);	/* hold onto PMC */
 	ps->ps_pmc = pm;
 	if ((td = curthread) && td->td_proc)
 		ps->ps_pid = td->td_proc->p_pid;
 	else
 		ps->ps_pid = -1;
 	ps->ps_cpu = cpu;
 	ps->ps_td = td;
 	ps->ps_flags = inuserspace ? PMC_CC_F_USERSPACE : 0;
 
 	callchaindepth = (pm->pm_flags & PMC_F_CALLCHAIN) ?
 	    pmc_callchaindepth : 1;
 
 	if (callchaindepth == 1)
 		ps->ps_pc[0] = PMC_TRAPFRAME_TO_PC(tf);
 	else {
 		/*
 		 * Kernel stack traversals can be done immediately,
 		 * while we defer to an AST for user space traversals.
 		 */
 		if (!inuserspace)
 			callchaindepth =
 			    pmc_save_kernel_callchain(ps->ps_pc,
 				callchaindepth, tf);
 		else {
 			pmc_post_callchain_callback();
 			callchaindepth = PMC_SAMPLE_INUSE;
 		}
 	}
 
 	ps->ps_nsamples = callchaindepth;	/* mark entry as in use */
 
 	/* increment write pointer, modulo ring buffer size */
 	ps++;
 	if (ps == psb->ps_fence)
 		psb->ps_write = psb->ps_samples;
 	else
 		psb->ps_write = ps;
 
  done:
 	/* mark CPU as needing processing */
-	atomic_set_int(&pmc_cpumask, (1 << cpu));
+	CPU_SET_ATOMIC(cpu, &pmc_cpumask);
 
 	return (error);
 }
 
 /*
  * Capture a user call chain.  This function will be called from ast()
  * before control returns to userland and before the process gets
  * rescheduled.
  */
 
 static void
 pmc_capture_user_callchain(int cpu, struct trapframe *tf)
 {
 	int i;
 	struct pmc *pm;
 	struct thread *td;
 	struct pmc_sample *ps;
 	struct pmc_samplebuffer *psb;
 #ifdef	INVARIANTS
 	int ncallchains;
 #endif
 
 	sched_unpin();	/* Can migrate safely now. */
 
 	psb = pmc_pcpu[cpu]->pc_sb;
 	td = curthread;
 
 	KASSERT(td->td_pflags & TDP_CALLCHAIN,
 	    ("[pmc,%d] Retrieving callchain for thread that doesn't want it",
 		__LINE__));
 
 #ifdef	INVARIANTS
 	ncallchains = 0;
 #endif
 
 	/*
 	 * Iterate through all deferred callchain requests.
 	 */
 
 	ps = psb->ps_samples;
 	for (i = 0; i < pmc_nsamples; i++, ps++) {
 
 		if (ps->ps_nsamples != PMC_SAMPLE_INUSE)
 			continue;
 		if (ps->ps_td != td)
 			continue;
 
 		KASSERT(ps->ps_cpu == cpu,
 		    ("[pmc,%d] cpu mismatch ps_cpu=%d pcpu=%d", __LINE__,
 			ps->ps_cpu, PCPU_GET(cpuid)));
 
 		pm = ps->ps_pmc;
 
 		KASSERT(pm->pm_flags & PMC_F_CALLCHAIN,
 		    ("[pmc,%d] Retrieving callchain for PMC that doesn't "
 			"want it", __LINE__));
 
 		KASSERT(pm->pm_runcount > 0,
 		    ("[pmc,%d] runcount %d", __LINE__, pm->pm_runcount));
 
 		/*
 		 * Retrieve the callchain and mark the sample buffer
 		 * as 'processable' by the timer tick sweep code.
 		 */
 		ps->ps_nsamples = pmc_save_user_callchain(ps->ps_pc,
 		    pmc_callchaindepth, tf);
 
 #ifdef	INVARIANTS
 		ncallchains++;
 #endif
 
 	}
 
 	KASSERT(ncallchains > 0,
 	    ("[pmc,%d] cpu %d didn't find a sample to collect", __LINE__,
 		cpu));
 
 	return;
 }
 
 
 /*
  * Process saved PC samples.
  */
 
 static void
 pmc_process_samples(int cpu)
 {
 	struct pmc *pm;
 	int adjri, n;
 	struct thread *td;
 	struct pmc_owner *po;
 	struct pmc_sample *ps;
 	struct pmc_classdep *pcd;
 	struct pmc_samplebuffer *psb;
 
 	KASSERT(PCPU_GET(cpuid) == cpu,
 	    ("[pmc,%d] not on the correct CPU pcpu=%d cpu=%d", __LINE__,
 		PCPU_GET(cpuid), cpu));
 
 	psb = pmc_pcpu[cpu]->pc_sb;
 
 	for (n = 0; n < pmc_nsamples; n++) { /* bound on #iterations */
 
 		ps = psb->ps_read;
 		if (ps->ps_nsamples == PMC_SAMPLE_FREE)
 			break;
 		if (ps->ps_nsamples == PMC_SAMPLE_INUSE) {
 			/* Need a rescan at a later time. */
-			atomic_set_int(&pmc_cpumask, (1 << cpu));
+			CPU_SET_ATOMIC(cpu, &pmc_cpumask);
 			break;
 		}
 
 		pm = ps->ps_pmc;
 
 		KASSERT(pm->pm_runcount > 0,
 		    ("[pmc,%d] pm=%p runcount %d", __LINE__, (void *) pm,
 			pm->pm_runcount));
 
 		po = pm->pm_owner;
 
 		KASSERT(PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)),
 		    ("[pmc,%d] pmc=%p non-sampling mode=%d", __LINE__,
 			pm, PMC_TO_MODE(pm)));
 
 		/* Ignore PMCs that have been switched off */
 		if (pm->pm_state != PMC_STATE_RUNNING)
 			goto entrydone;
 
 		PMCDBG(SAM,OPS,1,"cpu=%d pm=%p n=%d fl=%x wr=%d rd=%d", cpu,
 		    pm, ps->ps_nsamples, ps->ps_flags,
 		    (int) (psb->ps_write - psb->ps_samples),
 		    (int) (psb->ps_read - psb->ps_samples));
 
 		/*
 		 * If this is a process-mode PMC that is attached to
 		 * its owner, and if the PC is in user mode, update
 		 * profiling statistics like timer-based profiling
 		 * would have done.
 		 */
 		if (pm->pm_flags & PMC_F_ATTACHED_TO_OWNER) {
 			if (ps->ps_flags & PMC_CC_F_USERSPACE) {
 				td = FIRST_THREAD_IN_PROC(po->po_owner);
 				addupc_intr(td, ps->ps_pc[0], 1);
 			}
 			goto entrydone;
 		}
 
 		/*
 		 * Otherwise, this is either a sampling mode PMC that
 		 * is attached to a different process than its owner,
 		 * or a system-wide sampling PMC.  Dispatch a log
 		 * entry to the PMC's owner process.
 		 */
 
 		pmclog_process_callchain(pm, ps);
 
 	entrydone:
 		ps->ps_nsamples = 0;	/* mark entry as free */
 		atomic_subtract_rel_int(&pm->pm_runcount, 1);
 
 		/* increment read pointer, modulo sample size */
 		if (++ps == psb->ps_fence)
 			psb->ps_read = psb->ps_samples;
 		else
 			psb->ps_read = ps;
 	}
 
 	atomic_add_int(&pmc_stats.pm_log_sweeps, 1);
 
 	/* Do not re-enable stalled PMCs if we failed to process any samples */
 	if (n == 0)
 		return;
 
 	/*
 	 * Restart any stalled sampling PMCs on this CPU.
 	 *
 	 * If the NMI handler sets the pm_stalled field of a PMC after
 	 * the check below, we'll end up processing the stalled PMC at
 	 * the next hardclock tick.
 	 */
 	for (n = 0; n < md->pmd_npmc; n++) {
 		pcd = pmc_ri_to_classdep(md, n, &adjri);
 		KASSERT(pcd != NULL,
 		    ("[pmc,%d] null pcd ri=%d", __LINE__, n));
 		(void) (*pcd->pcd_get_config)(cpu,adjri,&pm);
 
 		if (pm == NULL ||			 /* !cfg'ed */
 		    pm->pm_state != PMC_STATE_RUNNING || /* !active */
 		    !PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)) || /* !sampling */
 		    pm->pm_stalled == 0) /* !stalled */
 			continue;
 
 		pm->pm_stalled = 0;
 		(*pcd->pcd_start_pmc)(cpu, adjri);
 	}
 }
 
 /*
  * Event handlers.
  */
 
 /*
  * Handle a process exit.
  *
  * Remove this process from all hash tables.  If this process
  * owned any PMCs, turn off those PMCs and deallocate them,
  * removing any associations with target processes.
  *
  * This function will be called by the last 'thread' of a
  * process.
  *
  * XXX This eventhandler gets called early in the exit process.
  * Consider using a 'hook' invocation from thread_exit() or equivalent
  * spot.  Another negative is that kse_exit doesn't seem to call
  * exit1() [??].
  *
  */
 
 static void
 pmc_process_exit(void *arg __unused, struct proc *p)
 {
 	struct pmc *pm;
 	int adjri, cpu;
 	unsigned int ri;
 	int is_using_hwpmcs;
 	struct pmc_owner *po;
 	struct pmc_process *pp;
 	struct pmc_classdep *pcd;
 	pmc_value_t newvalue, tmp;
 
 	PROC_LOCK(p);
 	is_using_hwpmcs = p->p_flag & P_HWPMC;
 	PROC_UNLOCK(p);
 
 	/*
 	 * Log a sysexit event to all SS PMC owners.
 	 */
 	LIST_FOREACH(po, &pmc_ss_owners, po_ssnext)
 	    if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 		    pmclog_process_sysexit(po, p->p_pid);
 
 	if (!is_using_hwpmcs)
 		return;
 
 	PMC_GET_SX_XLOCK();
 	PMCDBG(PRC,EXT,1,"process-exit proc=%p (%d, %s)", p, p->p_pid,
 	    p->p_comm);
 
 	/*
 	 * Since this code is invoked by the last thread in an exiting
 	 * process, we would have context switched IN at some prior
 	 * point.  However, with PREEMPTION, kernel mode context
 	 * switches may happen any time, so we want to disable a
 	 * context switch OUT till we get any PMCs targetting this
 	 * process off the hardware.
 	 *
 	 * We also need to atomically remove this process'
 	 * entry from our target process hash table, using
 	 * PMC_FLAG_REMOVE.
 	 */
 	PMCDBG(PRC,EXT,1, "process-exit proc=%p (%d, %s)", p, p->p_pid,
 	    p->p_comm);
 
 	critical_enter(); /* no preemption */
 
 	cpu = curthread->td_oncpu;
 
 	if ((pp = pmc_find_process_descriptor(p,
 		 PMC_FLAG_REMOVE)) != NULL) {
 
 		PMCDBG(PRC,EXT,2,
 		    "process-exit proc=%p pmc-process=%p", p, pp);
 
 		/*
 		 * The exiting process could the target of
 		 * some PMCs which will be running on
 		 * currently executing CPU.
 		 *
 		 * We need to turn these PMCs off like we
 		 * would do at context switch OUT time.
 		 */
 		for (ri = 0; ri < md->pmd_npmc; ri++) {
 
 			/*
 			 * Pick up the pmc pointer from hardware
 			 * state similar to the CSW_OUT code.
 			 */
 			pm = NULL;
 
 			pcd = pmc_ri_to_classdep(md, ri, &adjri);
 
 			(void) (*pcd->pcd_get_config)(cpu, adjri, &pm);
 
 			PMCDBG(PRC,EXT,2, "ri=%d pm=%p", ri, pm);
 
 			if (pm == NULL ||
 			    !PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm)))
 				continue;
 
 			PMCDBG(PRC,EXT,2, "ppmcs[%d]=%p pm=%p "
 			    "state=%d", ri, pp->pp_pmcs[ri].pp_pmc,
 			    pm, pm->pm_state);
 
 			KASSERT(PMC_TO_ROWINDEX(pm) == ri,
 			    ("[pmc,%d] ri mismatch pmc(%d) ri(%d)",
 				__LINE__, PMC_TO_ROWINDEX(pm), ri));
 
 			KASSERT(pm == pp->pp_pmcs[ri].pp_pmc,
 			    ("[pmc,%d] pm %p != pp_pmcs[%d] %p",
 				__LINE__, pm, ri, pp->pp_pmcs[ri].pp_pmc));
 
 			(void) pcd->pcd_stop_pmc(cpu, adjri);
 
 			KASSERT(pm->pm_runcount > 0,
 			    ("[pmc,%d] bad runcount ri %d rc %d",
 				__LINE__, ri, pm->pm_runcount));
 
 			/* Stop hardware only if it is actually running */
 			if (pm->pm_state == PMC_STATE_RUNNING &&
 			    pm->pm_stalled == 0) {
 				pcd->pcd_read_pmc(cpu, adjri, &newvalue);
 				tmp = newvalue -
 				    PMC_PCPU_SAVED(cpu,ri);
 
 				mtx_pool_lock_spin(pmc_mtxpool, pm);
 				pm->pm_gv.pm_savedvalue += tmp;
 				pp->pp_pmcs[ri].pp_pmcval += tmp;
 				mtx_pool_unlock_spin(pmc_mtxpool, pm);
 			}
 
 			atomic_subtract_rel_int(&pm->pm_runcount,1);
 
 			KASSERT((int) pm->pm_runcount >= 0,
 			    ("[pmc,%d] runcount is %d", __LINE__, ri));
 
 			(void) pcd->pcd_config_pmc(cpu, adjri, NULL);
 		}
 
 		/*
 		 * Inform the MD layer of this pseudo "context switch
 		 * out"
 		 */
 		(void) md->pmd_switch_out(pmc_pcpu[cpu], pp);
 
 		critical_exit(); /* ok to be pre-empted now */
 
 		/*
 		 * Unlink this process from the PMCs that are
 		 * targetting it.  This will send a signal to
 		 * all PMC owner's whose PMCs are orphaned.
 		 *
 		 * Log PMC value at exit time if requested.
 		 */
 		for (ri = 0; ri < md->pmd_npmc; ri++)
 			if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL) {
 				if (pm->pm_flags & PMC_F_NEEDS_LOGFILE &&
 				    PMC_IS_COUNTING_MODE(PMC_TO_MODE(pm)))
 					pmclog_process_procexit(pm, pp);
 				pmc_unlink_target_process(pm, pp);
 			}
 		free(pp, M_PMC);
 
 	} else
 		critical_exit(); /* pp == NULL */
 
 
 	/*
 	 * If the process owned PMCs, free them up and free up
 	 * memory.
 	 */
 	if ((po = pmc_find_owner_descriptor(p)) != NULL) {
 		pmc_remove_owner(po);
 		pmc_destroy_owner_descriptor(po);
 	}
 
 	sx_xunlock(&pmc_sx);
 }
 
 /*
  * Handle a process fork.
  *
  * If the parent process 'p1' is under HWPMC monitoring, then copy
  * over any attached PMCs that have 'do_descendants' semantics.
  */
 
 static void
 pmc_process_fork(void *arg __unused, struct proc *p1, struct proc *newproc,
     int flags)
 {
 	int is_using_hwpmcs;
 	unsigned int ri;
 	uint32_t do_descendants;
 	struct pmc *pm;
 	struct pmc_owner *po;
 	struct pmc_process *ppnew, *ppold;
 
 	(void) flags;		/* unused parameter */
 
 	PROC_LOCK(p1);
 	is_using_hwpmcs = p1->p_flag & P_HWPMC;
 	PROC_UNLOCK(p1);
 
 	/*
 	 * If there are system-wide sampling PMCs active, we need to
 	 * log all fork events to their owner's logs.
 	 */
 
 	LIST_FOREACH(po, &pmc_ss_owners, po_ssnext)
 	    if (po->po_flags & PMC_PO_OWNS_LOGFILE)
 		    pmclog_process_procfork(po, p1->p_pid, newproc->p_pid);
 
 	if (!is_using_hwpmcs)
 		return;
 
 	PMC_GET_SX_XLOCK();
 	PMCDBG(PMC,FRK,1, "process-fork proc=%p (%d, %s) -> %p", p1,
 	    p1->p_pid, p1->p_comm, newproc);
 
 	/*
 	 * If the parent process (curthread->td_proc) is a
 	 * target of any PMCs, look for PMCs that are to be
 	 * inherited, and link these into the new process
 	 * descriptor.
 	 */
 	if ((ppold = pmc_find_process_descriptor(curthread->td_proc,
 		 PMC_FLAG_NONE)) == NULL)
 		goto done;		/* nothing to do */
 
 	do_descendants = 0;
 	for (ri = 0; ri < md->pmd_npmc; ri++)
 		if ((pm = ppold->pp_pmcs[ri].pp_pmc) != NULL)
 			do_descendants |= pm->pm_flags & PMC_F_DESCENDANTS;
 	if (do_descendants == 0) /* nothing to do */
 		goto done;
 
 	/* allocate a descriptor for the new process  */
 	if ((ppnew = pmc_find_process_descriptor(newproc,
 		 PMC_FLAG_ALLOCATE)) == NULL)
 		goto done;
 
 	/*
 	 * Run through all PMCs that were targeting the old process
 	 * and which specified F_DESCENDANTS and attach them to the
 	 * new process.
 	 *
 	 * Log the fork event to all owners of PMCs attached to this
 	 * process, if not already logged.
 	 */
 	for (ri = 0; ri < md->pmd_npmc; ri++)
 		if ((pm = ppold->pp_pmcs[ri].pp_pmc) != NULL &&
 		    (pm->pm_flags & PMC_F_DESCENDANTS)) {
 			pmc_link_target_process(pm, ppnew);
 			po = pm->pm_owner;
 			if (po->po_sscount == 0 &&
 			    po->po_flags & PMC_PO_OWNS_LOGFILE)
 				pmclog_process_procfork(po, p1->p_pid,
 				    newproc->p_pid);
 		}
 
 	/*
 	 * Now mark the new process as being tracked by this driver.
 	 */
 	PROC_LOCK(newproc);
 	newproc->p_flag |= P_HWPMC;
 	PROC_UNLOCK(newproc);
 
  done:
 	sx_xunlock(&pmc_sx);
 }
 
 
 /*
  * initialization
  */
 
 static const char *pmc_name_of_pmcclass[] = {
 #undef	__PMC_CLASS
 #define	__PMC_CLASS(N) #N ,
 	__PMC_CLASSES()
 };
 
 static int
 pmc_initialize(void)
 {
 	int c, cpu, error, n, ri;
 	unsigned int maxcpu;
 	struct pmc_binding pb;
 	struct pmc_sample *ps;
 	struct pmc_classdep *pcd;
 	struct pmc_samplebuffer *sb;
 
 	md = NULL;
 	error = 0;
 
 #ifdef	DEBUG
 	/* parse debug flags first */
 	if (TUNABLE_STR_FETCH(PMC_SYSCTL_NAME_PREFIX "debugflags",
 		pmc_debugstr, sizeof(pmc_debugstr)))
 		pmc_debugflags_parse(pmc_debugstr,
 		    pmc_debugstr+strlen(pmc_debugstr));
 #endif
 
 	PMCDBG(MOD,INI,0, "PMC Initialize (version %x)", PMC_VERSION);
 
 	/* check kernel version */
 	if (pmc_kernel_version != PMC_VERSION) {
 		if (pmc_kernel_version == 0)
 			printf("hwpmc: this kernel has not been compiled with "
 			    "'options HWPMC_HOOKS'.\n");
 		else
 			printf("hwpmc: kernel version (0x%x) does not match "
 			    "module version (0x%x).\n", pmc_kernel_version,
 			    PMC_VERSION);
 		return EPROGMISMATCH;
 	}
 
 	/*
 	 * check sysctl parameters
 	 */
 
 	if (pmc_hashsize <= 0) {
 		(void) printf("hwpmc: tunable \"hashsize\"=%d must be "
 		    "greater than zero.\n", pmc_hashsize);
 		pmc_hashsize = PMC_HASH_SIZE;
 	}
 
 	if (pmc_nsamples <= 0 || pmc_nsamples > 65535) {
 		(void) printf("hwpmc: tunable \"nsamples\"=%d out of "
 		    "range.\n", pmc_nsamples);
 		pmc_nsamples = PMC_NSAMPLES;
 	}
 
 	if (pmc_callchaindepth <= 0 ||
 	    pmc_callchaindepth > PMC_CALLCHAIN_DEPTH_MAX) {
 		(void) printf("hwpmc: tunable \"callchaindepth\"=%d out of "
 		    "range.\n", pmc_callchaindepth);
 		pmc_callchaindepth = PMC_CALLCHAIN_DEPTH;
 	}
 
 	md = pmc_md_initialize();
 
 	if (md == NULL)
 		return (ENOSYS);
 
 	KASSERT(md->pmd_nclass >= 1 && md->pmd_npmc >= 1,
 	    ("[pmc,%d] no classes or pmcs", __LINE__));
 
 	/* Compute the map from row-indices to classdep pointers. */
 	pmc_rowindex_to_classdep = malloc(sizeof(struct pmc_classdep *) *
 	    md->pmd_npmc, M_PMC, M_WAITOK|M_ZERO);
 
 	for (n = 0; n < md->pmd_npmc; n++)
 		pmc_rowindex_to_classdep[n] = NULL;
 	for (ri = c = 0; c < md->pmd_nclass; c++) {
 		pcd = &md->pmd_classdep[c];
 		for (n = 0; n < pcd->pcd_num; n++, ri++)
 			pmc_rowindex_to_classdep[ri] = pcd;
 	}
 
 	KASSERT(ri == md->pmd_npmc,
 	    ("[pmc,%d] npmc miscomputed: ri=%d, md->npmc=%d", __LINE__,
 	    ri, md->pmd_npmc));
 
 	maxcpu = pmc_cpu_max();
 
 	/* allocate space for the per-cpu array */
 	pmc_pcpu = malloc(maxcpu * sizeof(struct pmc_cpu *), M_PMC,
 	    M_WAITOK|M_ZERO);
 
 	/* per-cpu 'saved values' for managing process-mode PMCs */
 	pmc_pcpu_saved = malloc(sizeof(pmc_value_t) * maxcpu * md->pmd_npmc,
 	    M_PMC, M_WAITOK);
 
 	/* Perform CPU-dependent initialization. */
 	pmc_save_cpu_binding(&pb);
 	error = 0;
 	for (cpu = 0; error == 0 && cpu < maxcpu; cpu++) {
 		if (!pmc_cpu_is_active(cpu))
 			continue;
 		pmc_select_cpu(cpu);
 		pmc_pcpu[cpu] = malloc(sizeof(struct pmc_cpu) +
 		    md->pmd_npmc * sizeof(struct pmc_hw *), M_PMC,
 		    M_WAITOK|M_ZERO);
 		if (md->pmd_pcpu_init)
 			error = md->pmd_pcpu_init(md, cpu);
 		for (n = 0; error == 0 && n < md->pmd_nclass; n++)
 			error = md->pmd_classdep[n].pcd_pcpu_init(md, cpu);
 	}
 	pmc_restore_cpu_binding(&pb);
 
 	if (error)
 		return (error);
 
 	/* allocate space for the sample array */
 	for (cpu = 0; cpu < maxcpu; cpu++) {
 		if (!pmc_cpu_is_active(cpu))
 			continue;
 
 		sb = malloc(sizeof(struct pmc_samplebuffer) +
 		    pmc_nsamples * sizeof(struct pmc_sample), M_PMC,
 		    M_WAITOK|M_ZERO);
 		sb->ps_read = sb->ps_write = sb->ps_samples;
 		sb->ps_fence = sb->ps_samples + pmc_nsamples;
 
 		KASSERT(pmc_pcpu[cpu] != NULL,
 		    ("[pmc,%d] cpu=%d Null per-cpu data", __LINE__, cpu));
 
 		sb->ps_callchains = malloc(pmc_callchaindepth * pmc_nsamples *
 		    sizeof(uintptr_t), M_PMC, M_WAITOK|M_ZERO);
 
 		for (n = 0, ps = sb->ps_samples; n < pmc_nsamples; n++, ps++)
 			ps->ps_pc = sb->ps_callchains +
 			    (n * pmc_callchaindepth);
 
 		pmc_pcpu[cpu]->pc_sb = sb;
 	}
 
 	/* allocate space for the row disposition array */
 	pmc_pmcdisp = malloc(sizeof(enum pmc_mode) * md->pmd_npmc,
 	    M_PMC, M_WAITOK|M_ZERO);
 
 	KASSERT(pmc_pmcdisp != NULL,
 	    ("[pmc,%d] pmcdisp allocation returned NULL", __LINE__));
 
 	/* mark all PMCs as available */
 	for (n = 0; n < (int) md->pmd_npmc; n++)
 		PMC_MARK_ROW_FREE(n);
 
 	/* allocate thread hash tables */
 	pmc_ownerhash = hashinit(pmc_hashsize, M_PMC,
 	    &pmc_ownerhashmask);
 
 	pmc_processhash = hashinit(pmc_hashsize, M_PMC,
 	    &pmc_processhashmask);
 	mtx_init(&pmc_processhash_mtx, "pmc-process-hash", "pmc-leaf",
 	    MTX_SPIN);
 
 	LIST_INIT(&pmc_ss_owners);
 	pmc_ss_count = 0;
 
 	/* allocate a pool of spin mutexes */
 	pmc_mtxpool = mtx_pool_create("pmc-leaf", pmc_mtxpool_size,
 	    MTX_SPIN);
 
 	PMCDBG(MOD,INI,1, "pmc_ownerhash=%p, mask=0x%lx "
 	    "targethash=%p mask=0x%lx", pmc_ownerhash, pmc_ownerhashmask,
 	    pmc_processhash, pmc_processhashmask);
 
 	/* register process {exit,fork,exec} handlers */
 	pmc_exit_tag = EVENTHANDLER_REGISTER(process_exit,
 	    pmc_process_exit, NULL, EVENTHANDLER_PRI_ANY);
 	pmc_fork_tag = EVENTHANDLER_REGISTER(process_fork,
 	    pmc_process_fork, NULL, EVENTHANDLER_PRI_ANY);
 
 	/* initialize logging */
 	pmclog_initialize();
 
 	/* set hook functions */
 	pmc_intr = md->pmd_intr;
 	pmc_hook = pmc_hook_handler;
 
 	if (error == 0) {
 		printf(PMC_MODULE_NAME ":");
 		for (n = 0; n < (int) md->pmd_nclass; n++) {
 			pcd = &md->pmd_classdep[n];
 			printf(" %s/%d/%d/0x%b",
 			    pmc_name_of_pmcclass[pcd->pcd_class],
 			    pcd->pcd_num,
 			    pcd->pcd_width,
 			    pcd->pcd_caps,
 			    "\20"
 			    "\1INT\2USR\3SYS\4EDG\5THR"
 			    "\6REA\7WRI\10INV\11QUA\12PRC"
 			    "\13TAG\14CSC");
 		}
 		printf("\n");
 	}
 
 	return (error);
 }
 
 /* prepare to be unloaded */
 static void
 pmc_cleanup(void)
 {
 	int c, cpu;
 	unsigned int maxcpu;
 	struct pmc_ownerhash *ph;
 	struct pmc_owner *po, *tmp;
 	struct pmc_binding pb;
 #ifdef	DEBUG
 	struct pmc_processhash *prh;
 #endif
 
 	PMCDBG(MOD,INI,0, "%s", "cleanup");
 
 	/* switch off sampling */
-	pmc_cpumask = 0;
+	CPU_ZERO(&pmc_cpumask);
 	pmc_intr = NULL;
 
 	sx_xlock(&pmc_sx);
 	if (pmc_hook == NULL) {	/* being unloaded already */
 		sx_xunlock(&pmc_sx);
 		return;
 	}
 
 	pmc_hook = NULL; /* prevent new threads from entering module */
 
 	/* deregister event handlers */
 	EVENTHANDLER_DEREGISTER(process_fork, pmc_fork_tag);
 	EVENTHANDLER_DEREGISTER(process_exit, pmc_exit_tag);
 
 	/* send SIGBUS to all owner threads, free up allocations */
 	if (pmc_ownerhash)
 		for (ph = pmc_ownerhash;
 		     ph <= &pmc_ownerhash[pmc_ownerhashmask];
 		     ph++) {
 			LIST_FOREACH_SAFE(po, ph, po_next, tmp) {
 				pmc_remove_owner(po);
 
 				/* send SIGBUS to owner processes */
 				PMCDBG(MOD,INI,2, "cleanup signal proc=%p "
 				    "(%d, %s)", po->po_owner,
 				    po->po_owner->p_pid,
 				    po->po_owner->p_comm);
 
 				PROC_LOCK(po->po_owner);
 				psignal(po->po_owner, SIGBUS);
 				PROC_UNLOCK(po->po_owner);
 
 				pmc_destroy_owner_descriptor(po);
 			}
 		}
 
 	/* reclaim allocated data structures */
 	if (pmc_mtxpool)
 		mtx_pool_destroy(&pmc_mtxpool);
 
 	mtx_destroy(&pmc_processhash_mtx);
 	if (pmc_processhash) {
 #ifdef	DEBUG
 		struct pmc_process *pp;
 
 		PMCDBG(MOD,INI,3, "%s", "destroy process hash");
 		for (prh = pmc_processhash;
 		     prh <= &pmc_processhash[pmc_processhashmask];
 		     prh++)
 			LIST_FOREACH(pp, prh, pp_next)
 			    PMCDBG(MOD,INI,3, "pid=%d", pp->pp_proc->p_pid);
 #endif
 
 		hashdestroy(pmc_processhash, M_PMC, pmc_processhashmask);
 		pmc_processhash = NULL;
 	}
 
 	if (pmc_ownerhash) {
 		PMCDBG(MOD,INI,3, "%s", "destroy owner hash");
 		hashdestroy(pmc_ownerhash, M_PMC, pmc_ownerhashmask);
 		pmc_ownerhash = NULL;
 	}
 
 	KASSERT(LIST_EMPTY(&pmc_ss_owners),
 	    ("[pmc,%d] Global SS owner list not empty", __LINE__));
 	KASSERT(pmc_ss_count == 0,
 	    ("[pmc,%d] Global SS count not empty", __LINE__));
 
  	/* do processor and pmc-class dependent cleanup */
 	maxcpu = pmc_cpu_max();
 
 	PMCDBG(MOD,INI,3, "%s", "md cleanup");
 	if (md) {
 		pmc_save_cpu_binding(&pb);
 		for (cpu = 0; cpu < maxcpu; cpu++) {
 			PMCDBG(MOD,INI,1,"pmc-cleanup cpu=%d pcs=%p",
 			    cpu, pmc_pcpu[cpu]);
 			if (!pmc_cpu_is_active(cpu) || pmc_pcpu[cpu] == NULL)
 				continue;
 			pmc_select_cpu(cpu);
 			for (c = 0; c < md->pmd_nclass; c++)
 				md->pmd_classdep[c].pcd_pcpu_fini(md, cpu);
 			if (md->pmd_pcpu_fini)
 				md->pmd_pcpu_fini(md, cpu);
 		}
 
 		pmc_md_finalize(md);
 
 		free(md, M_PMC);
 		md = NULL;
 		pmc_restore_cpu_binding(&pb);
 	}
 
 	/* Free per-cpu descriptors. */
 	for (cpu = 0; cpu < maxcpu; cpu++) {
 		if (!pmc_cpu_is_active(cpu))
 			continue;
 		KASSERT(pmc_pcpu[cpu]->pc_sb != NULL,
 		    ("[pmc,%d] Null cpu sample buffer cpu=%d", __LINE__,
 			cpu));
 		free(pmc_pcpu[cpu]->pc_sb->ps_callchains, M_PMC);
 		free(pmc_pcpu[cpu]->pc_sb, M_PMC);
 		free(pmc_pcpu[cpu], M_PMC);
 	}
 
 	free(pmc_pcpu, M_PMC);
 	pmc_pcpu = NULL;
 
 	free(pmc_pcpu_saved, M_PMC);
 	pmc_pcpu_saved = NULL;
 
 	if (pmc_pmcdisp) {
 		free(pmc_pmcdisp, M_PMC);
 		pmc_pmcdisp = NULL;
 	}
 
 	if (pmc_rowindex_to_classdep) {
 		free(pmc_rowindex_to_classdep, M_PMC);
 		pmc_rowindex_to_classdep = NULL;
 	}
 
 	pmclog_shutdown();
 
 	sx_xunlock(&pmc_sx); 	/* we are done */
 }
 
 /*
  * The function called at load/unload.
  */
 
 static int
 load (struct module *module __unused, int cmd, void *arg __unused)
 {
 	int error;
 
 	error = 0;
 
 	switch (cmd) {
 	case MOD_LOAD :
 		/* initialize the subsystem */
 		error = pmc_initialize();
 		if (error != 0)
 			break;
 		PMCDBG(MOD,INI,1, "syscall=%d maxcpu=%d",
 		    pmc_syscall_num, pmc_cpu_max());
 		break;
 
 
 	case MOD_UNLOAD :
 	case MOD_SHUTDOWN:
 		pmc_cleanup();
 		PMCDBG(MOD,INI,1, "%s", "unloaded");
 		break;
 
 	default :
 		error = EINVAL;	/* XXX should panic(9) */
 		break;
 	}
 
 	return error;
 }
 
 /* memory pool */
 MALLOC_DEFINE(M_PMC, "pmc", "Memory space for the PMC module");
Index: head/sys/dev/iicbus/ad7417.c
===================================================================
--- head/sys/dev/iicbus/ad7417.c	(revision 222812)
+++ head/sys/dev/iicbus/ad7417.c	(revision 222813)

Property changes on: head/sys/dev/iicbus/ad7417.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: head/sys/dev/xen/control/control.c
===================================================================
--- head/sys/dev/xen/control/control.c	(revision 222812)
+++ head/sys/dev/xen/control/control.c	(revision 222813)
@@ -1,493 +1,498 @@
 /*-
  * Copyright (c) 2010 Justin T. Gibbs, Spectra Logic Corporation
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions, and the following disclaimer,
  *    without modification.
  * 2. Redistributions in binary form must reproduce at minimum a disclaimer
  *    substantially similar to the "NO WARRANTY" disclaimer below
  *    ("Disclaimer") and any redistribution must be conditioned upon
  *    including a substantially similar Disclaimer requirement for further
  *    binary redistribution.
  *
  * NO WARRANTY
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
  * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
  * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
  * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGES.
  */
 
 /*-
  * PV suspend/resume support:
  *
  * Copyright (c) 2004 Christian Limpach.
  * Copyright (c) 2004-2006,2008 Kip Macy
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *      This product includes software developed by Christian Limpach.
  * 4. The name of the author may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 /*-
  * HVM suspend/resume support:
  *
  * Copyright (c) 2008 Citrix Systems, Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /**
  * \file control.c
  *
  * \brief Device driver to repond to control domain events that impact
  *        this VM.
  */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 
 #include <sys/bio.h>
 #include <sys/bus.h>
 #include <sys/conf.h>
 #include <sys/disk.h>
 #include <sys/fcntl.h>
 #include <sys/filedesc.h>
 #include <sys/kdb.h>
 #include <sys/module.h>
 #include <sys/namei.h>
 #include <sys/proc.h>
 #include <sys/reboot.h>
 #include <sys/rman.h>
 #include <sys/taskqueue.h>
 #include <sys/types.h>
 #include <sys/vnode.h>
 
 #ifndef XENHVM
 #include <sys/sched.h>
 #include <sys/smp.h>
 #endif
 
 
 #include <geom/geom.h>
 
 #include <machine/_inttypes.h>
 #include <machine/xen/xen-os.h>
 
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 
 #include <xen/blkif.h>
 #include <xen/evtchn.h>
 #include <xen/gnttab.h>
 #include <xen/xen_intr.h>
 
 #include <xen/interface/event_channel.h>
 #include <xen/interface/grant_table.h>
 
 #include <xen/xenbus/xenbusvar.h>
 
 #define NUM_ELEMENTS(x) (sizeof(x) / sizeof(*(x)))
 
 /*--------------------------- Forward Declarations --------------------------*/
 /** Function signature for shutdown event handlers. */
 typedef	void (xctrl_shutdown_handler_t)(void);
 
 static xctrl_shutdown_handler_t xctrl_poweroff;
 static xctrl_shutdown_handler_t xctrl_reboot;
 static xctrl_shutdown_handler_t xctrl_suspend;
 static xctrl_shutdown_handler_t xctrl_crash;
 static xctrl_shutdown_handler_t xctrl_halt;
 
 /*-------------------------- Private Data Structures -------------------------*/
 /** Element type for lookup table of event name to handler. */
 struct xctrl_shutdown_reason {
 	const char		 *name;
 	xctrl_shutdown_handler_t *handler;
 };
 
 /** Lookup table for shutdown event name to handler. */
 static struct xctrl_shutdown_reason xctrl_shutdown_reasons[] = {
 	{ "poweroff", xctrl_poweroff },
 	{ "reboot",   xctrl_reboot   },
 	{ "suspend",  xctrl_suspend  },
 	{ "crash",    xctrl_crash    },
 	{ "halt",     xctrl_halt     },
 };
 
 struct xctrl_softc {
 
 	/** Must be first */
 	struct xs_watch    xctrl_watch;	
 };
 
 /*------------------------------ Event Handlers ------------------------------*/
 static void
 xctrl_poweroff()
 {
 	shutdown_nice(RB_POWEROFF|RB_HALT);
 }
 
 static void
 xctrl_reboot()
 {
 	shutdown_nice(0);
 }
 
 #ifndef XENHVM
 extern void xencons_suspend(void);
 extern void xencons_resume(void);
 
 /* Full PV mode suspension. */
 static void
 xctrl_suspend()
 {
 	int i, j, k, fpp;
 	unsigned long max_pfn, start_info_mfn;
 
 #ifdef SMP
-	cpumask_t map;
+	struct thread *td;
+	cpuset_t map;
 	/*
 	 * Bind us to CPU 0 and stop any other VCPUs.
 	 */
-	thread_lock(curthread);
-	sched_bind(curthread, 0);
-	thread_unlock(curthread);
+	td = curthread;
+	thread_lock(td);
+	sched_bind(td, 0);
+	thread_unlock(td);
 	KASSERT(PCPU_GET(cpuid) == 0, ("xen_suspend: not running on cpu 0"));
 
-	map = PCPU_GET(other_cpus) & ~stopped_cpus;
-	if (map)
+	sched_pin();
+	map = PCPU_GET(other_cpus);
+	sched_unpin();
+	CPU_NAND(&map, &stopped_cpus);
+	if (!CPU_EMPTY(&map))
 		stop_cpus(map);
 #endif
 
 	if (DEVICE_SUSPEND(root_bus) != 0) {
 		printf("xen_suspend: device_suspend failed\n");
 #ifdef SMP
-		if (map)
+		if (!CPU_EMPTY(&map))
 			restart_cpus(map);
 #endif
 		return;
 	}
 
 	local_irq_disable();
 
 	xencons_suspend();
 	gnttab_suspend();
 
 	max_pfn = HYPERVISOR_shared_info->arch.max_pfn;
 
 	void *shared_info = HYPERVISOR_shared_info;
 	HYPERVISOR_shared_info = NULL;
 	pmap_kremove((vm_offset_t) shared_info);
 	PT_UPDATES_FLUSH();
 
 	xen_start_info->store_mfn = MFNTOPFN(xen_start_info->store_mfn);
 	xen_start_info->console.domU.mfn = MFNTOPFN(xen_start_info->console.domU.mfn);
 
 	/*
 	 * We'll stop somewhere inside this hypercall. When it returns,
 	 * we'll start resuming after the restore.
 	 */
 	start_info_mfn = VTOMFN(xen_start_info);
 	pmap_suspend();
 	HYPERVISOR_suspend(start_info_mfn);
 	pmap_resume();
 
 	pmap_kenter_ma((vm_offset_t) shared_info, xen_start_info->shared_info);
 	HYPERVISOR_shared_info = shared_info;
 
 	HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
 		VTOMFN(xen_pfn_to_mfn_frame_list_list);
   
 	fpp = PAGE_SIZE/sizeof(unsigned long);
 	for (i = 0, j = 0, k = -1; i < max_pfn; i += fpp, j++) {
 		if ((j % fpp) == 0) {
 			k++;
 			xen_pfn_to_mfn_frame_list_list[k] = 
 				VTOMFN(xen_pfn_to_mfn_frame_list[k]);
 			j = 0;
 		}
 		xen_pfn_to_mfn_frame_list[k][j] = 
 			VTOMFN(&xen_phys_machine[i]);
 	}
 	HYPERVISOR_shared_info->arch.max_pfn = max_pfn;
 
 	gnttab_resume();
 	irq_resume();
 	local_irq_enable();
 	xencons_resume();
 
 #ifdef CONFIG_SMP
 	for_each_cpu(i)
 		vcpu_prepare(i);
 
 #endif
 	/* 
 	 * Only resume xenbus /after/ we've prepared our VCPUs; otherwise
 	 * the VCPU hotplug callback can race with our vcpu_prepare
 	 */
 	DEVICE_RESUME(root_bus);
 
 #ifdef SMP
 	thread_lock(curthread);
 	sched_unbind(curthread);
 	thread_unlock(curthread);
-	if (map)
+	if (!CPU_EMPTY(&map))
 		restart_cpus(map);
 #endif
 }
 
 static void
 xen_pv_shutdown_final(void *arg, int howto)
 {
 	/*
 	 * Inform the hypervisor that shutdown is complete.
 	 * This is not necessary in HVM domains since Xen
 	 * emulates ACPI in that mode and FreeBSD's ACPI
 	 * support will request this transition.
 	 */
 	if (howto & (RB_HALT | RB_POWEROFF))
 		HYPERVISOR_shutdown(SHUTDOWN_poweroff);
 	else
 		HYPERVISOR_shutdown(SHUTDOWN_reboot);
 }
 
 #else
 extern void xenpci_resume(void);
 
 /* HVM mode suspension. */
 static void
 xctrl_suspend()
 {
 	int suspend_cancelled;
 
 	if (DEVICE_SUSPEND(root_bus)) {
 		printf("xen_suspend: device_suspend failed\n");
 		return;
 	}
 
 	/*
 	 * Make sure we don't change cpus or switch to some other
 	 * thread. for the duration.
 	 */
 	critical_enter();
 
 	/*
 	 * Prevent any races with evtchn_interrupt() handler.
 	 */
 	irq_suspend();
 	disable_intr();
 
 	suspend_cancelled = HYPERVISOR_suspend(0);
 	if (!suspend_cancelled)
 		xenpci_resume();
 
 	/*
 	 * Re-enable interrupts and put the scheduler back to normal.
 	 */
 	enable_intr();
 	critical_exit();
 
 	/*
 	 * FreeBSD really needs to add DEVICE_SUSPEND_CANCEL or
 	 * similar.
 	 */
 	if (!suspend_cancelled)
 		DEVICE_RESUME(root_bus);
 }
 #endif
 
 static void
 xctrl_crash()
 {
 	panic("Xen directed crash");
 }
 
 static void
 xctrl_halt()
 {
 	shutdown_nice(RB_HALT);
 }
 
 /*------------------------------ Event Reception -----------------------------*/
 static void
 xctrl_on_watch_event(struct xs_watch *watch, const char **vec, unsigned int len)
 {
 	struct xctrl_shutdown_reason *reason;
 	struct xctrl_shutdown_reason *last_reason;
 	char *result;
 	int   error;
 	int   result_len;
 	
 	error = xs_read(XST_NIL, "control", "shutdown",
 			&result_len, (void **)&result);
 	if (error != 0)
 		return;
 
 	reason = xctrl_shutdown_reasons;
 	last_reason = reason + NUM_ELEMENTS(xctrl_shutdown_reasons);
 	while (reason < last_reason) {
 
 		if (!strcmp(result, reason->name)) {
 			reason->handler();
 			break;
 		}
 		reason++;
 	}
 
 	free(result, M_XENSTORE);
 }
 
 /*------------------ Private Device Attachment Functions  --------------------*/
 /**
  * \brief Identify instances of this device type in the system.
  *
  * \param driver  The driver performing this identify action.
  * \param parent  The NewBus parent device for any devices this method adds.
  */
 static void
 xctrl_identify(driver_t *driver __unused, device_t parent)
 {
 	/*
 	 * A single device instance for our driver is always present
 	 * in a system operating under Xen.
 	 */
 	BUS_ADD_CHILD(parent, 0, driver->name, 0);
 }
 
 /**
  * \brief Probe for the existance of the Xen Control device
  *
  * \param dev  NewBus device_t for this Xen control instance.
  *
  * \return  Always returns 0 indicating success.
  */
 static int 
 xctrl_probe(device_t dev)
 {
 	device_set_desc(dev, "Xen Control Device");
 
 	return (0);
 }
 
 /**
  * \brief Attach the Xen control device.
  *
  * \param dev  NewBus device_t for this Xen control instance.
  *
  * \return  On success, 0. Otherwise an errno value indicating the
  *          type of failure.
  */
 static int
 xctrl_attach(device_t dev)
 {
 	struct xctrl_softc *xctrl;
 
 	xctrl = device_get_softc(dev);
 
 	/* Activate watch */
 	xctrl->xctrl_watch.node = "control/shutdown";
 	xctrl->xctrl_watch.callback = xctrl_on_watch_event;
 	xs_register_watch(&xctrl->xctrl_watch);
 
 #ifndef XENHVM
 	EVENTHANDLER_REGISTER(shutdown_final, xen_pv_shutdown_final, NULL,
 			      SHUTDOWN_PRI_LAST);
 #endif
 
 	return (0);
 }
 
 /**
  * \brief Detach the Xen control device.
  *
  * \param dev  NewBus device_t for this Xen control device instance.
  *
  * \return  On success, 0. Otherwise an errno value indicating the
  *          type of failure.
  */
 static int
 xctrl_detach(device_t dev)
 {
 	struct xctrl_softc *xctrl;
 
 	xctrl = device_get_softc(dev);
 
 	/* Release watch */
 	xs_unregister_watch(&xctrl->xctrl_watch);
 
 	return (0);
 }
 
 /*-------------------- Private Device Attachment Data  -----------------------*/
 static device_method_t xctrl_methods[] = { 
 	/* Device interface */ 
 	DEVMETHOD(device_identify,	xctrl_identify),
 	DEVMETHOD(device_probe,         xctrl_probe), 
 	DEVMETHOD(device_attach,        xctrl_attach), 
 	DEVMETHOD(device_detach,        xctrl_detach), 
  
 	{ 0, 0 } 
 }; 
 
 DEFINE_CLASS_0(xctrl, xctrl_driver, xctrl_methods, sizeof(struct xctrl_softc));
 devclass_t xctrl_devclass; 
  
 DRIVER_MODULE(xctrl, xenstore, xctrl_driver, xctrl_devclass, 0, 0);
Index: head/sys/geom/eli/g_eli.c
===================================================================
--- head/sys/geom/eli/g_eli.c	(revision 222812)
+++ head/sys/geom/eli/g_eli.c	(revision 222813)
@@ -1,1282 +1,1282 @@
 /*-
  * Copyright (c) 2005-2011 Pawel Jakub Dawidek <pawel@dawidek.net>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/linker.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/eventhandler.h>
 #include <sys/kthread.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/uio.h>
 #include <sys/vnode.h>
 
 #include <vm/uma.h>
 
 #include <geom/geom.h>
 #include <geom/eli/g_eli.h>
 #include <geom/eli/pkcs5v2.h>
 
 FEATURE(geom_eli, "GEOM crypto module");
 
 MALLOC_DEFINE(M_ELI, "eli data", "GEOM_ELI Data");
 
 SYSCTL_DECL(_kern_geom);
 SYSCTL_NODE(_kern_geom, OID_AUTO, eli, CTLFLAG_RW, 0, "GEOM_ELI stuff");
 static int g_eli_version = G_ELI_VERSION;
 SYSCTL_INT(_kern_geom_eli, OID_AUTO, version, CTLFLAG_RD, &g_eli_version, 0,
     "GELI version");
 int g_eli_debug = 0;
 TUNABLE_INT("kern.geom.eli.debug", &g_eli_debug);
 SYSCTL_INT(_kern_geom_eli, OID_AUTO, debug, CTLFLAG_RW, &g_eli_debug, 0,
     "Debug level");
 static u_int g_eli_tries = 3;
 TUNABLE_INT("kern.geom.eli.tries", &g_eli_tries);
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, tries, CTLFLAG_RW, &g_eli_tries, 0,
     "Number of tries for entering the passphrase");
 static u_int g_eli_visible_passphrase = GETS_NOECHO;
 TUNABLE_INT("kern.geom.eli.visible_passphrase", &g_eli_visible_passphrase);
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, visible_passphrase, CTLFLAG_RW,
     &g_eli_visible_passphrase, 0,
     "Visibility of passphrase prompt (0 = invisible, 1 = visible, 2 = asterisk)");
 u_int g_eli_overwrites = G_ELI_OVERWRITES;
 TUNABLE_INT("kern.geom.eli.overwrites", &g_eli_overwrites);
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, overwrites, CTLFLAG_RW, &g_eli_overwrites,
     0, "Number of times on-disk keys should be overwritten when destroying them");
 static u_int g_eli_threads = 0;
 TUNABLE_INT("kern.geom.eli.threads", &g_eli_threads);
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, threads, CTLFLAG_RW, &g_eli_threads, 0,
     "Number of threads doing crypto work");
 u_int g_eli_batch = 0;
 TUNABLE_INT("kern.geom.eli.batch", &g_eli_batch);
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, batch, CTLFLAG_RW, &g_eli_batch, 0,
     "Use crypto operations batching");
 
 static eventhandler_tag g_eli_pre_sync = NULL;
 
 static int g_eli_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 static void g_eli_init(struct g_class *mp);
 static void g_eli_fini(struct g_class *mp);
 
 static g_taste_t g_eli_taste;
 static g_dumpconf_t g_eli_dumpconf;
 
 struct g_class g_eli_class = {
 	.name = G_ELI_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_eli_config,
 	.taste = g_eli_taste,
 	.destroy_geom = g_eli_destroy_geom,
 	.init = g_eli_init,
 	.fini = g_eli_fini
 };
 
 
 /*
  * Code paths:
  * BIO_READ:
  *	g_eli_start -> g_eli_crypto_read -> g_io_request -> g_eli_read_done -> g_eli_crypto_run -> g_eli_crypto_read_done -> g_io_deliver
  * BIO_WRITE:
  *	g_eli_start -> g_eli_crypto_run -> g_eli_crypto_write_done -> g_io_request -> g_eli_write_done -> g_io_deliver
  */
 
 
 /*
  * EAGAIN from crypto(9) means, that we were probably balanced to another crypto
  * accelerator or something like this.
  * The function updates the SID and rerun the operation.
  */
 int
 g_eli_crypto_rerun(struct cryptop *crp)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_worker *wr;
 	struct bio *bp;
 	int error;
 
 	bp = (struct bio *)crp->crp_opaque;
 	sc = bp->bio_to->geom->softc;
 	LIST_FOREACH(wr, &sc->sc_workers, w_next) {
 		if (wr->w_number == bp->bio_pflags)
 			break;
 	}
 	KASSERT(wr != NULL, ("Invalid worker (%u).", bp->bio_pflags));
 	G_ELI_DEBUG(1, "Rerunning crypto %s request (sid: %ju -> %ju).",
 	    bp->bio_cmd == BIO_READ ? "READ" : "WRITE", (uintmax_t)wr->w_sid,
 	    (uintmax_t)crp->crp_sid);
 	wr->w_sid = crp->crp_sid;
 	crp->crp_etype = 0;
 	error = crypto_dispatch(crp);
 	if (error == 0)
 		return (0);
 	G_ELI_DEBUG(1, "%s: crypto_dispatch() returned %d.", __func__, error);
 	crp->crp_etype = error;
 	return (error);
 }
 
 /*
  * The function is called afer reading encrypted data from the provider.
  *
  * g_eli_start -> g_eli_crypto_read -> g_io_request -> G_ELI_READ_DONE -> g_eli_crypto_run -> g_eli_crypto_read_done -> g_io_deliver
  */
 void
 g_eli_read_done(struct bio *bp)
 {
 	struct g_eli_softc *sc;
 	struct bio *pbp;
 
 	G_ELI_LOGREQ(2, bp, "Request done.");
 	pbp = bp->bio_parent;
 	if (pbp->bio_error == 0)
 		pbp->bio_error = bp->bio_error;
 	g_destroy_bio(bp);
 	/*
 	 * Do we have all sectors already?
 	 */
 	pbp->bio_inbed++;
 	if (pbp->bio_inbed < pbp->bio_children)
 		return;
 	sc = pbp->bio_to->geom->softc;
 	if (pbp->bio_error != 0) {
 		G_ELI_LOGREQ(0, pbp, "%s() failed", __func__);
 		pbp->bio_completed = 0;
 		if (pbp->bio_driver2 != NULL) {
 			free(pbp->bio_driver2, M_ELI);
 			pbp->bio_driver2 = NULL;
 		}
 		g_io_deliver(pbp, pbp->bio_error);
 		atomic_subtract_int(&sc->sc_inflight, 1);
 		return;
 	}
 	mtx_lock(&sc->sc_queue_mtx);
 	bioq_insert_tail(&sc->sc_queue, pbp);
 	mtx_unlock(&sc->sc_queue_mtx);
 	wakeup(sc);
 }
 
 /*
  * The function is called after we encrypt and write data.
  *
  * g_eli_start -> g_eli_crypto_run -> g_eli_crypto_write_done -> g_io_request -> G_ELI_WRITE_DONE -> g_io_deliver
  */
 void
 g_eli_write_done(struct bio *bp)
 {
 	struct g_eli_softc *sc;
 	struct bio *pbp;
 
 	G_ELI_LOGREQ(2, bp, "Request done.");
 	pbp = bp->bio_parent;
 	if (pbp->bio_error == 0) {
 		if (bp->bio_error != 0)
 			pbp->bio_error = bp->bio_error;
 	}
 	g_destroy_bio(bp);
 	/*
 	 * Do we have all sectors already?
 	 */
 	pbp->bio_inbed++;
 	if (pbp->bio_inbed < pbp->bio_children)
 		return;
 	free(pbp->bio_driver2, M_ELI);
 	pbp->bio_driver2 = NULL;
 	if (pbp->bio_error != 0) {
 		G_ELI_LOGREQ(0, pbp, "Crypto WRITE request failed (error=%d).",
 		    pbp->bio_error);
 		pbp->bio_completed = 0;
 	}
 	/*
 	 * Write is finished, send it up.
 	 */
 	pbp->bio_completed = pbp->bio_length;
 	sc = pbp->bio_to->geom->softc;
 	g_io_deliver(pbp, pbp->bio_error);
 	atomic_subtract_int(&sc->sc_inflight, 1);
 }
 
 /*
  * This function should never be called, but GEOM made as it set ->orphan()
  * method for every geom.
  */
 static void
 g_eli_orphan_spoil_assert(struct g_consumer *cp)
 {
 
 	panic("Function %s() called for %s.", __func__, cp->geom->name);
 }
 
 static void
 g_eli_orphan(struct g_consumer *cp)
 {
 	struct g_eli_softc *sc;
 
 	g_topology_assert();
 	sc = cp->geom->softc;
 	if (sc == NULL)
 		return;
 	g_eli_destroy(sc, TRUE);
 }
 
 /*
  * BIO_READ:
  *	G_ELI_START -> g_eli_crypto_read -> g_io_request -> g_eli_read_done -> g_eli_crypto_run -> g_eli_crypto_read_done -> g_io_deliver
  * BIO_WRITE:
  *	G_ELI_START -> g_eli_crypto_run -> g_eli_crypto_write_done -> g_io_request -> g_eli_write_done -> g_io_deliver
  */
 static void
 g_eli_start(struct bio *bp)
 {
 	struct g_eli_softc *sc;
 	struct g_consumer *cp;
 	struct bio *cbp;
 
 	sc = bp->bio_to->geom->softc;
 	KASSERT(sc != NULL,
 	    ("Provider's error should be set (error=%d)(device=%s).",
 	    bp->bio_to->error, bp->bio_to->name));
 	G_ELI_LOGREQ(2, bp, "Request received.");
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_GETATTR:
 	case BIO_FLUSH:
 		break;
 	case BIO_DELETE:
 		/*
 		 * We could eventually support BIO_DELETE request.
 		 * It could be done by overwritting requested sector with
 		 * random data g_eli_overwrites number of times.
 		 */
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 	bp->bio_driver1 = cbp;
 	bp->bio_pflags = G_ELI_NEW_BIO;
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 		if (!(sc->sc_flags & G_ELI_FLAG_AUTH)) {
 			g_eli_crypto_read(sc, bp, 0);
 			break;
 		}
 		/* FALLTHROUGH */
 	case BIO_WRITE:
 		mtx_lock(&sc->sc_queue_mtx);
 		bioq_insert_tail(&sc->sc_queue, bp);
 		mtx_unlock(&sc->sc_queue_mtx);
 		wakeup(sc);
 		break;
 	case BIO_GETATTR:
 	case BIO_FLUSH:
 		cbp->bio_done = g_std_done;
 		cp = LIST_FIRST(&sc->sc_geom->consumer);
 		cbp->bio_to = cp->provider;
 		G_ELI_LOGREQ(2, cbp, "Sending request.");
 		g_io_request(cbp, cp);
 		break;
 	}
 }
 
 static int
 g_eli_newsession(struct g_eli_worker *wr)
 {
 	struct g_eli_softc *sc;
 	struct cryptoini crie, cria;
 	int error;
 
 	sc = wr->w_softc;
 
 	bzero(&crie, sizeof(crie));
 	crie.cri_alg = sc->sc_ealgo;
 	crie.cri_klen = sc->sc_ekeylen;
 	if (sc->sc_ealgo == CRYPTO_AES_XTS)
 		crie.cri_klen <<= 1;
 	if ((sc->sc_flags & G_ELI_FLAG_FIRST_KEY) != 0) {
 		crie.cri_key = g_eli_key_hold(sc, 0,
 		    LIST_FIRST(&sc->sc_geom->consumer)->provider->sectorsize);
 	} else {
 		crie.cri_key = sc->sc_ekey;
 	}
 	if (sc->sc_flags & G_ELI_FLAG_AUTH) {
 		bzero(&cria, sizeof(cria));
 		cria.cri_alg = sc->sc_aalgo;
 		cria.cri_klen = sc->sc_akeylen;
 		cria.cri_key = sc->sc_akey;
 		crie.cri_next = &cria;
 	}
 
 	switch (sc->sc_crypto) {
 	case G_ELI_CRYPTO_SW:
 		error = crypto_newsession(&wr->w_sid, &crie,
 		    CRYPTOCAP_F_SOFTWARE);
 		break;
 	case G_ELI_CRYPTO_HW:
 		error = crypto_newsession(&wr->w_sid, &crie,
 		    CRYPTOCAP_F_HARDWARE);
 		break;
 	case G_ELI_CRYPTO_UNKNOWN:
 		error = crypto_newsession(&wr->w_sid, &crie,
 		    CRYPTOCAP_F_HARDWARE);
 		if (error == 0) {
 			mtx_lock(&sc->sc_queue_mtx);
 			if (sc->sc_crypto == G_ELI_CRYPTO_UNKNOWN)
 				sc->sc_crypto = G_ELI_CRYPTO_HW;
 			mtx_unlock(&sc->sc_queue_mtx);
 		} else {
 			error = crypto_newsession(&wr->w_sid, &crie,
 			    CRYPTOCAP_F_SOFTWARE);
 			mtx_lock(&sc->sc_queue_mtx);
 			if (sc->sc_crypto == G_ELI_CRYPTO_UNKNOWN)
 				sc->sc_crypto = G_ELI_CRYPTO_SW;
 			mtx_unlock(&sc->sc_queue_mtx);
 		}
 		break;
 	default:
 		panic("%s: invalid condition", __func__);
 	}
 
 	if ((sc->sc_flags & G_ELI_FLAG_FIRST_KEY) != 0)
 		g_eli_key_drop(sc, crie.cri_key);
 
 	return (error);
 }
 
 static void
 g_eli_freesession(struct g_eli_worker *wr)
 {
 
 	crypto_freesession(wr->w_sid);
 }
 
 static void
 g_eli_cancel(struct g_eli_softc *sc)
 {
 	struct bio *bp;
 
 	mtx_assert(&sc->sc_queue_mtx, MA_OWNED);
 
 	while ((bp = bioq_takefirst(&sc->sc_queue)) != NULL) {
 		KASSERT(bp->bio_pflags == G_ELI_NEW_BIO,
 		    ("Not new bio when canceling (bp=%p).", bp));
 		g_io_deliver(bp, ENXIO);
 	}
 }
 
 static struct bio *
 g_eli_takefirst(struct g_eli_softc *sc)
 {
 	struct bio *bp;
 
 	mtx_assert(&sc->sc_queue_mtx, MA_OWNED);
 
 	if (!(sc->sc_flags & G_ELI_FLAG_SUSPEND))
 		return (bioq_takefirst(&sc->sc_queue));
 	/*
 	 * Device suspended, so we skip new I/O requests.
 	 */
 	TAILQ_FOREACH(bp, &sc->sc_queue.queue, bio_queue) {
 		if (bp->bio_pflags != G_ELI_NEW_BIO)
 			break;
 	}
 	if (bp != NULL)
 		bioq_remove(&sc->sc_queue, bp);
 	return (bp);
 }
 
 /*
  * This is the main function for kernel worker thread when we don't have
  * hardware acceleration and we have to do cryptography in software.
  * Dedicated thread is needed, so we don't slow down g_up/g_down GEOM
  * threads with crypto work.
  */
 static void
 g_eli_worker(void *arg)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_worker *wr;
 	struct bio *bp;
 	int error;
 
 	wr = arg;
 	sc = wr->w_softc;
 #ifdef SMP
 	/* Before sched_bind() to a CPU, wait for all CPUs to go on-line. */
 	if (mp_ncpus > 1 && sc->sc_crypto == G_ELI_CRYPTO_SW &&
 	    g_eli_threads == 0) {
 		while (!smp_started)
 			tsleep(wr, 0, "geli:smp", hz / 4);
 	}
 #endif
 	thread_lock(curthread);
 	sched_prio(curthread, PUSER);
 	if (sc->sc_crypto == G_ELI_CRYPTO_SW && g_eli_threads == 0)
 		sched_bind(curthread, wr->w_number);
 	thread_unlock(curthread);
 
 	G_ELI_DEBUG(1, "Thread %s started.", curthread->td_proc->p_comm);
 
 	for (;;) {
 		mtx_lock(&sc->sc_queue_mtx);
 again:
 		bp = g_eli_takefirst(sc);
 		if (bp == NULL) {
 			if (sc->sc_flags & G_ELI_FLAG_DESTROY) {
 				g_eli_cancel(sc);
 				LIST_REMOVE(wr, w_next);
 				g_eli_freesession(wr);
 				free(wr, M_ELI);
 				G_ELI_DEBUG(1, "Thread %s exiting.",
 				    curthread->td_proc->p_comm);
 				wakeup(&sc->sc_workers);
 				mtx_unlock(&sc->sc_queue_mtx);
 				kproc_exit(0);
 			}
 			while (sc->sc_flags & G_ELI_FLAG_SUSPEND) {
 				if (sc->sc_inflight > 0) {
 					G_ELI_DEBUG(0, "inflight=%d", sc->sc_inflight);
 					/*
 					 * We still have inflight BIOs, so
 					 * sleep and retry.
 					 */
 					msleep(sc, &sc->sc_queue_mtx, PRIBIO,
 					    "geli:inf", hz / 5);
 					goto again;
 				}
 				/*
 				 * Suspend requested, mark the worker as
 				 * suspended and go to sleep.
 				 */
 				if (wr->w_active) {
 					g_eli_freesession(wr);
 					wr->w_active = FALSE;
 				}
 				wakeup(&sc->sc_workers);
 				msleep(sc, &sc->sc_queue_mtx, PRIBIO,
 				    "geli:suspend", 0);
 				if (!wr->w_active &&
 				    !(sc->sc_flags & G_ELI_FLAG_SUSPEND)) {
 					error = g_eli_newsession(wr);
 					KASSERT(error == 0,
 					    ("g_eli_newsession() failed on resume (error=%d)",
 					    error));
 					wr->w_active = TRUE;
 				}
 				goto again;
 			}
 			msleep(sc, &sc->sc_queue_mtx, PDROP, "geli:w", 0);
 			continue;
 		}
 		if (bp->bio_pflags == G_ELI_NEW_BIO)
 			atomic_add_int(&sc->sc_inflight, 1);
 		mtx_unlock(&sc->sc_queue_mtx);
 		if (bp->bio_pflags == G_ELI_NEW_BIO) {
 			bp->bio_pflags = 0;
 			if (sc->sc_flags & G_ELI_FLAG_AUTH) {
 				if (bp->bio_cmd == BIO_READ)
 					g_eli_auth_read(sc, bp);
 				else
 					g_eli_auth_run(wr, bp);
 			} else {
 				if (bp->bio_cmd == BIO_READ)
 					g_eli_crypto_read(sc, bp, 1);
 				else
 					g_eli_crypto_run(wr, bp);
 			}
 		} else {
 			if (sc->sc_flags & G_ELI_FLAG_AUTH)
 				g_eli_auth_run(wr, bp);
 			else
 				g_eli_crypto_run(wr, bp);
 		}
 	}
 }
 
 /*
  * Here we generate IV. It is unique for every sector.
  */
 void
 g_eli_crypto_ivgen(struct g_eli_softc *sc, off_t offset, u_char *iv,
     size_t size)
 {
 	uint8_t off[8];
 
 	if ((sc->sc_flags & G_ELI_FLAG_NATIVE_BYTE_ORDER) != 0)
 		bcopy(&offset, off, sizeof(off));
 	else
 		le64enc(off, (uint64_t)offset);
 
 	switch (sc->sc_ealgo) {
 	case CRYPTO_AES_XTS:
 		bcopy(off, iv, sizeof(off));
 		bzero(iv + sizeof(off), size - sizeof(off));
 		break;
 	default:
 	    {
 		u_char hash[SHA256_DIGEST_LENGTH];
 		SHA256_CTX ctx;
 
 		/* Copy precalculated SHA256 context for IV-Key. */
 		bcopy(&sc->sc_ivctx, &ctx, sizeof(ctx));
 		SHA256_Update(&ctx, off, sizeof(off));
 		SHA256_Final(hash, &ctx);
 		bcopy(hash, iv, MIN(sizeof(hash), size));
 		break;
 	    }
 	}
 }
 
 int
 g_eli_read_metadata(struct g_class *mp, struct g_provider *pp,
     struct g_eli_metadata *md)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	u_char *buf = NULL;
 	int error;
 
 	g_topology_assert();
 
 	gp = g_new_geomf(mp, "eli:taste");
 	gp->start = g_eli_start;
 	gp->access = g_std_access;
 	/*
 	 * g_eli_read_metadata() is always called from the event thread.
 	 * Our geom is created and destroyed in the same event, so there
 	 * could be no orphan nor spoil event in the meantime.
 	 */
 	gp->orphan = g_eli_orphan_spoil_assert;
 	gp->spoiled = g_eli_orphan_spoil_assert;
 	cp = g_new_consumer(gp);
 	error = g_attach(cp, pp);
 	if (error != 0)
 		goto end;
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		goto end;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	if (buf == NULL)
 		goto end;
 	eli_metadata_decode(buf, md);
 end:
 	if (buf != NULL)
 		g_free(buf);
 	if (cp->provider != NULL) {
 		if (cp->acr == 1)
 			g_access(cp, -1, 0, 0);
 		g_detach(cp);
 	}
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	return (error);
 }
 
 /*
  * The function is called when we had last close on provider and user requested
  * to close it when this situation occur.
  */
 static void
 g_eli_last_close(struct g_eli_softc *sc)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 	char ppname[64];
 	int error;
 
 	g_topology_assert();
 	gp = sc->sc_geom;
 	pp = LIST_FIRST(&gp->provider);
 	strlcpy(ppname, pp->name, sizeof(ppname));
 	error = g_eli_destroy(sc, TRUE);
 	KASSERT(error == 0, ("Cannot detach %s on last close (error=%d).",
 	    ppname, error));
 	G_ELI_DEBUG(0, "Detached %s on last close.", ppname);
 }
 
 int
 g_eli_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_eli_softc *sc;
 	struct g_geom *gp;
 
 	gp = pp->geom;
 	sc = gp->softc;
 
 	if (dw > 0) {
 		if (sc->sc_flags & G_ELI_FLAG_RO) {
 			/* Deny write attempts. */
 			return (EROFS);
 		}
 		/* Someone is opening us for write, we need to remember that. */
 		sc->sc_flags |= G_ELI_FLAG_WOPEN;
 		return (0);
 	}
 	/* Is this the last close? */
 	if (pp->acr + dr > 0 || pp->acw + dw > 0 || pp->ace + de > 0)
 		return (0);
 
 	/*
 	 * Automatically detach on last close if requested.
 	 */
 	if ((sc->sc_flags & G_ELI_FLAG_RW_DETACH) ||
 	    (sc->sc_flags & G_ELI_FLAG_WOPEN)) {
 		g_eli_last_close(sc);
 	}
 	return (0);
 }
 
 static int
 g_eli_cpu_is_disabled(int cpu)
 {
 #ifdef SMP
-	return ((hlt_cpus_mask & (1 << cpu)) != 0);
+	return (CPU_ISSET(cpu, &hlt_cpus_mask));
 #else
 	return (0);
 #endif
 }
 
 struct g_geom *
 g_eli_create(struct gctl_req *req, struct g_class *mp, struct g_provider *bpp,
     const struct g_eli_metadata *md, const u_char *mkey, int nkey)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_worker *wr;
 	struct g_geom *gp;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	u_int i, threads;
 	int error;
 
 	G_ELI_DEBUG(1, "Creating device %s%s.", bpp->name, G_ELI_SUFFIX);
 
 	gp = g_new_geomf(mp, "%s%s", bpp->name, G_ELI_SUFFIX);
 	sc = malloc(sizeof(*sc), M_ELI, M_WAITOK | M_ZERO);
 	gp->start = g_eli_start;
 	/*
 	 * Spoiling cannot happen actually, because we keep provider open for
 	 * writing all the time or provider is read-only.
 	 */
 	gp->spoiled = g_eli_orphan_spoil_assert;
 	gp->orphan = g_eli_orphan;
 	gp->dumpconf = g_eli_dumpconf;
 	/*
 	 * If detach-on-last-close feature is not enabled and we don't operate
 	 * on read-only provider, we can simply use g_std_access().
 	 */
 	if (md->md_flags & (G_ELI_FLAG_WO_DETACH | G_ELI_FLAG_RO))
 		gp->access = g_eli_access;
 	else
 		gp->access = g_std_access;
 
 	sc->sc_inflight = 0;
 	sc->sc_crypto = G_ELI_CRYPTO_UNKNOWN;
 	sc->sc_flags = md->md_flags;
 	/* Backward compatibility. */
 	if (md->md_version < 4)
 		sc->sc_flags |= G_ELI_FLAG_NATIVE_BYTE_ORDER;
 	if (md->md_version < 5)
 		sc->sc_flags |= G_ELI_FLAG_SINGLE_KEY;
 	if (md->md_version < 6 && (sc->sc_flags & G_ELI_FLAG_AUTH) != 0)
 		sc->sc_flags |= G_ELI_FLAG_FIRST_KEY;
 	sc->sc_ealgo = md->md_ealgo;
 	sc->sc_nkey = nkey;
 
 	if (sc->sc_flags & G_ELI_FLAG_AUTH) {
 		sc->sc_akeylen = sizeof(sc->sc_akey) * 8;
 		sc->sc_aalgo = md->md_aalgo;
 		sc->sc_alen = g_eli_hashlen(sc->sc_aalgo);
 
 		sc->sc_data_per_sector = bpp->sectorsize - sc->sc_alen;
 		/*
 		 * Some hash functions (like SHA1 and RIPEMD160) generates hash
 		 * which length is not multiple of 128 bits, but we want data
 		 * length to be multiple of 128, so we can encrypt without
 		 * padding. The line below rounds down data length to multiple
 		 * of 128 bits.
 		 */
 		sc->sc_data_per_sector -= sc->sc_data_per_sector % 16;
 
 		sc->sc_bytes_per_sector =
 		    (md->md_sectorsize - 1) / sc->sc_data_per_sector + 1;
 		sc->sc_bytes_per_sector *= bpp->sectorsize;
 	}
 
 	gp->softc = sc;
 	sc->sc_geom = gp;
 
 	bioq_init(&sc->sc_queue);
 	mtx_init(&sc->sc_queue_mtx, "geli:queue", NULL, MTX_DEF);
 	mtx_init(&sc->sc_ekeys_lock, "geli:ekeys", NULL, MTX_DEF);
 
 	pp = NULL;
 	cp = g_new_consumer(gp);
 	error = g_attach(cp, bpp);
 	if (error != 0) {
 		if (req != NULL) {
 			gctl_error(req, "Cannot attach to %s (error=%d).",
 			    bpp->name, error);
 		} else {
 			G_ELI_DEBUG(1, "Cannot attach to %s (error=%d).",
 			    bpp->name, error);
 		}
 		goto failed;
 	}
 	/*
 	 * Keep provider open all the time, so we can run critical tasks,
 	 * like Master Keys deletion, without wondering if we can open
 	 * provider or not.
 	 * We don't open provider for writing only when user requested read-only
 	 * access.
 	 */
 	if (sc->sc_flags & G_ELI_FLAG_RO)
 		error = g_access(cp, 1, 0, 1);
 	else
 		error = g_access(cp, 1, 1, 1);
 	if (error != 0) {
 		if (req != NULL) {
 			gctl_error(req, "Cannot access %s (error=%d).",
 			    bpp->name, error);
 		} else {
 			G_ELI_DEBUG(1, "Cannot access %s (error=%d).",
 			    bpp->name, error);
 		}
 		goto failed;
 	}
 
 	sc->sc_sectorsize = md->md_sectorsize;
 	sc->sc_mediasize = bpp->mediasize;
 	if (!(sc->sc_flags & G_ELI_FLAG_ONETIME))
 		sc->sc_mediasize -= bpp->sectorsize;
 	if (!(sc->sc_flags & G_ELI_FLAG_AUTH))
 		sc->sc_mediasize -= (sc->sc_mediasize % sc->sc_sectorsize);
 	else {
 		sc->sc_mediasize /= sc->sc_bytes_per_sector;
 		sc->sc_mediasize *= sc->sc_sectorsize;
 	}
 
 	/*
 	 * Remember the keys in our softc structure.
 	 */
 	g_eli_mkey_propagate(sc, mkey);
 	sc->sc_ekeylen = md->md_keylen;
 
 	LIST_INIT(&sc->sc_workers);
 
 	threads = g_eli_threads;
 	if (threads == 0)
 		threads = mp_ncpus;
 	else if (threads > mp_ncpus) {
 		/* There is really no need for too many worker threads. */
 		threads = mp_ncpus;
 		G_ELI_DEBUG(0, "Reducing number of threads to %u.", threads);
 	}
 	for (i = 0; i < threads; i++) {
 		if (g_eli_cpu_is_disabled(i)) {
 			G_ELI_DEBUG(1, "%s: CPU %u disabled, skipping.",
 			    bpp->name, i);
 			continue;
 		}
 		wr = malloc(sizeof(*wr), M_ELI, M_WAITOK | M_ZERO);
 		wr->w_softc = sc;
 		wr->w_number = i;
 		wr->w_active = TRUE;
 
 		error = g_eli_newsession(wr);
 		if (error != 0) {
 			free(wr, M_ELI);
 			if (req != NULL) {
 				gctl_error(req, "Cannot set up crypto session "
 				    "for %s (error=%d).", bpp->name, error);
 			} else {
 				G_ELI_DEBUG(1, "Cannot set up crypto session "
 				    "for %s (error=%d).", bpp->name, error);
 			}
 			goto failed;
 		}
 
 		error = kproc_create(g_eli_worker, wr, &wr->w_proc, 0, 0,
 		    "g_eli[%u] %s", i, bpp->name);
 		if (error != 0) {
 			g_eli_freesession(wr);
 			free(wr, M_ELI);
 			if (req != NULL) {
 				gctl_error(req, "Cannot create kernel thread "
 				    "for %s (error=%d).", bpp->name, error);
 			} else {
 				G_ELI_DEBUG(1, "Cannot create kernel thread "
 				    "for %s (error=%d).", bpp->name, error);
 			}
 			goto failed;
 		}
 		LIST_INSERT_HEAD(&sc->sc_workers, wr, w_next);
 		/* If we have hardware support, one thread is enough. */
 		if (sc->sc_crypto == G_ELI_CRYPTO_HW)
 			break;
 	}
 
 	/*
 	 * Create decrypted provider.
 	 */
 	pp = g_new_providerf(gp, "%s%s", bpp->name, G_ELI_SUFFIX);
 	pp->mediasize = sc->sc_mediasize;
 	pp->sectorsize = sc->sc_sectorsize;
 
 	g_error_provider(pp, 0);
 
 	G_ELI_DEBUG(0, "Device %s created.", pp->name);
 	G_ELI_DEBUG(0, "Encryption: %s %u", g_eli_algo2str(sc->sc_ealgo),
 	    sc->sc_ekeylen);
 	if (sc->sc_flags & G_ELI_FLAG_AUTH)
 		G_ELI_DEBUG(0, " Integrity: %s", g_eli_algo2str(sc->sc_aalgo));
 	G_ELI_DEBUG(0, "    Crypto: %s",
 	    sc->sc_crypto == G_ELI_CRYPTO_SW ? "software" : "hardware");
 	return (gp);
 failed:
 	mtx_lock(&sc->sc_queue_mtx);
 	sc->sc_flags |= G_ELI_FLAG_DESTROY;
 	wakeup(sc);
 	/*
 	 * Wait for kernel threads self destruction.
 	 */
 	while (!LIST_EMPTY(&sc->sc_workers)) {
 		msleep(&sc->sc_workers, &sc->sc_queue_mtx, PRIBIO,
 		    "geli:destroy", 0);
 	}
 	mtx_destroy(&sc->sc_queue_mtx);
 	if (cp->provider != NULL) {
 		if (cp->acr == 1)
 			g_access(cp, -1, -1, -1);
 		g_detach(cp);
 	}
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	g_eli_key_destroy(sc);
 	bzero(sc, sizeof(*sc));
 	free(sc, M_ELI);
 	return (NULL);
 }
 
 int
 g_eli_destroy(struct g_eli_softc *sc, boolean_t force)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 
 	g_topology_assert();
 
 	if (sc == NULL)
 		return (ENXIO);
 
 	gp = sc->sc_geom;
 	pp = LIST_FIRST(&gp->provider);
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_ELI_DEBUG(1, "Device %s is still open, so it "
 			    "cannot be definitely removed.", pp->name);
 		} else {
 			G_ELI_DEBUG(1,
 			    "Device %s is still open (r%dw%de%d).", pp->name,
 			    pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	}
 
 	mtx_lock(&sc->sc_queue_mtx);
 	sc->sc_flags |= G_ELI_FLAG_DESTROY;
 	wakeup(sc);
 	while (!LIST_EMPTY(&sc->sc_workers)) {
 		msleep(&sc->sc_workers, &sc->sc_queue_mtx, PRIBIO,
 		    "geli:destroy", 0);
 	}
 	mtx_destroy(&sc->sc_queue_mtx);
 	gp->softc = NULL;
 	g_eli_key_destroy(sc);
 	bzero(sc, sizeof(*sc));
 	free(sc, M_ELI);
 
 	if (pp == NULL || (pp->acr == 0 && pp->acw == 0 && pp->ace == 0))
 		G_ELI_DEBUG(0, "Device %s destroyed.", gp->name);
 	g_wither_geom_close(gp, ENXIO);
 
 	return (0);
 }
 
 static int
 g_eli_destroy_geom(struct gctl_req *req __unused,
     struct g_class *mp __unused, struct g_geom *gp)
 {
 	struct g_eli_softc *sc;
 
 	sc = gp->softc;
 	return (g_eli_destroy(sc, FALSE));
 }
 
 static int
 g_eli_keyfiles_load(struct hmac_ctx *ctx, const char *provider)
 {
 	u_char *keyfile, *data;
 	char *file, name[64];
 	size_t size;
 	int i;
 
 	for (i = 0; ; i++) {
 		snprintf(name, sizeof(name), "%s:geli_keyfile%d", provider, i);
 		keyfile = preload_search_by_type(name);
 		if (keyfile == NULL)
 			return (i);	/* Return number of loaded keyfiles. */
 		data = preload_fetch_addr(keyfile);
 		if (data == NULL) {
 			G_ELI_DEBUG(0, "Cannot find key file data for %s.",
 			    name);
 			return (0);
 		}
 		size = preload_fetch_size(keyfile);
 		if (size == 0) {
 			G_ELI_DEBUG(0, "Cannot find key file size for %s.",
 			    name);
 			return (0);
 		}
 		file = preload_search_info(keyfile, MODINFO_NAME);
 		if (file == NULL) {
 			G_ELI_DEBUG(0, "Cannot find key file name for %s.",
 			    name);
 			return (0);
 		}
 		G_ELI_DEBUG(1, "Loaded keyfile %s for %s (type: %s).", file,
 		    provider, name);
 		g_eli_crypto_hmac_update(ctx, data, size);
 	}
 }
 
 static void
 g_eli_keyfiles_clear(const char *provider)
 {
 	u_char *keyfile, *data;
 	char name[64];
 	size_t size;
 	int i;
 
 	for (i = 0; ; i++) {
 		snprintf(name, sizeof(name), "%s:geli_keyfile%d", provider, i);
 		keyfile = preload_search_by_type(name);
 		if (keyfile == NULL)
 			return;
 		data = preload_fetch_addr(keyfile);
 		size = preload_fetch_size(keyfile);
 		if (data != NULL && size != 0)
 			bzero(data, size);
 	}
 }
 
 /*
  * Tasting is only made on boot.
  * We detect providers which should be attached before root is mounted.
  */
 static struct g_geom *
 g_eli_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_eli_metadata md;
 	struct g_geom *gp;
 	struct hmac_ctx ctx;
 	char passphrase[256];
 	u_char key[G_ELI_USERKEYLEN], mkey[G_ELI_DATAIVKEYLEN];
 	u_int i, nkey, nkeyfiles, tries;
 	int error;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 
 	if (root_mounted() || g_eli_tries == 0)
 		return (NULL);
 
 	G_ELI_DEBUG(3, "Tasting %s.", pp->name);
 
 	error = g_eli_read_metadata(mp, pp, &md);
 	if (error != 0)
 		return (NULL);
 	gp = NULL;
 
 	if (strcmp(md.md_magic, G_ELI_MAGIC) != 0)
 		return (NULL);
 	if (md.md_version > G_ELI_VERSION) {
 		printf("geom_eli.ko module is too old to handle %s.\n",
 		    pp->name);
 		return (NULL);
 	}
 	if (md.md_provsize != pp->mediasize)
 		return (NULL);
 	/* Should we attach it on boot? */
 	if (!(md.md_flags & G_ELI_FLAG_BOOT))
 		return (NULL);
 	if (md.md_keys == 0x00) {
 		G_ELI_DEBUG(0, "No valid keys on %s.", pp->name);
 		return (NULL);
 	}
 	if (md.md_iterations == -1) {
 		/* If there is no passphrase, we try only once. */
 		tries = 1;
 	} else {
 		/* Ask for the passphrase no more than g_eli_tries times. */
 		tries = g_eli_tries;
 	}
 
 	for (i = 0; i < tries; i++) {
 		g_eli_crypto_hmac_init(&ctx, NULL, 0);
 
 		/*
 		 * Load all key files.
 		 */
 		nkeyfiles = g_eli_keyfiles_load(&ctx, pp->name);
 
 		if (nkeyfiles == 0 && md.md_iterations == -1) {
 			/*
 			 * No key files and no passphrase, something is
 			 * definitely wrong here.
 			 * geli(8) doesn't allow for such situation, so assume
 			 * that there was really no passphrase and in that case
 			 * key files are no properly defined in loader.conf.
 			 */
 			G_ELI_DEBUG(0,
 			    "Found no key files in loader.conf for %s.",
 			    pp->name);
 			return (NULL);
 		}
 
 		/* Ask for the passphrase if defined. */
 		if (md.md_iterations >= 0) {
 			printf("Enter passphrase for %s: ", pp->name);
 			gets(passphrase, sizeof(passphrase),
 			    g_eli_visible_passphrase);
 		}
 
 		/*
 		 * Prepare Derived-Key from the user passphrase.
 		 */
 		if (md.md_iterations == 0) {
 			g_eli_crypto_hmac_update(&ctx, md.md_salt,
 			    sizeof(md.md_salt));
 			g_eli_crypto_hmac_update(&ctx, passphrase,
 			    strlen(passphrase));
 			bzero(passphrase, sizeof(passphrase));
 		} else if (md.md_iterations > 0) {
 			u_char dkey[G_ELI_USERKEYLEN];
 
 			pkcs5v2_genkey(dkey, sizeof(dkey), md.md_salt,
 			    sizeof(md.md_salt), passphrase, md.md_iterations);
 			bzero(passphrase, sizeof(passphrase));
 			g_eli_crypto_hmac_update(&ctx, dkey, sizeof(dkey));
 			bzero(dkey, sizeof(dkey));
 		}
 
 		g_eli_crypto_hmac_final(&ctx, key, 0);
 
 		/*
 		 * Decrypt Master-Key.
 		 */
 		error = g_eli_mkey_decrypt(&md, key, mkey, &nkey);
 		bzero(key, sizeof(key));
 		if (error == -1) {
 			if (i == tries - 1) {
 				G_ELI_DEBUG(0,
 				    "Wrong key for %s. No tries left.",
 				    pp->name);
 				g_eli_keyfiles_clear(pp->name);
 				return (NULL);
 			}
 			G_ELI_DEBUG(0, "Wrong key for %s. Tries left: %u.",
 			    pp->name, tries - i - 1);
 			/* Try again. */
 			continue;
 		} else if (error > 0) {
 			G_ELI_DEBUG(0, "Cannot decrypt Master Key for %s (error=%d).",
 			    pp->name, error);
 			g_eli_keyfiles_clear(pp->name);
 			return (NULL);
 		}
 		G_ELI_DEBUG(1, "Using Master Key %u for %s.", nkey, pp->name);
 		break;
 	}
 
 	/*
 	 * We have correct key, let's attach provider.
 	 */
 	gp = g_eli_create(NULL, mp, pp, &md, mkey, nkey);
 	bzero(mkey, sizeof(mkey));
 	bzero(&md, sizeof(md));
 	if (gp == NULL) {
 		G_ELI_DEBUG(0, "Cannot create device %s%s.", pp->name,
 		    G_ELI_SUFFIX);
 		return (NULL);
 	}
 	return (gp);
 }
 
 static void
 g_eli_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_eli_softc *sc;
 
 	g_topology_assert();
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	if (pp != NULL || cp != NULL)
 		return;	/* Nothing here. */
 
 	sbuf_printf(sb, "%s<KeysTotal>%ju</KeysTotal>", indent,
 	    (uintmax_t)sc->sc_ekeys_total);
 	sbuf_printf(sb, "%s<KeysAllocated>%ju</KeysAllocated>", indent,
 	    (uintmax_t)sc->sc_ekeys_allocated);
 	sbuf_printf(sb, "%s<Flags>", indent);
 	if (sc->sc_flags == 0)
 		sbuf_printf(sb, "NONE");
 	else {
 		int first = 1;
 
 #define ADD_FLAG(flag, name)	do {					\
 	if (sc->sc_flags & (flag)) {					\
 		if (!first)						\
 			sbuf_printf(sb, ", ");				\
 		else							\
 			first = 0;					\
 		sbuf_printf(sb, name);					\
 	}								\
 } while (0)
 		ADD_FLAG(G_ELI_FLAG_SUSPEND, "SUSPEND");
 		ADD_FLAG(G_ELI_FLAG_SINGLE_KEY, "SINGLE-KEY");
 		ADD_FLAG(G_ELI_FLAG_NATIVE_BYTE_ORDER, "NATIVE-BYTE-ORDER");
 		ADD_FLAG(G_ELI_FLAG_ONETIME, "ONETIME");
 		ADD_FLAG(G_ELI_FLAG_BOOT, "BOOT");
 		ADD_FLAG(G_ELI_FLAG_WO_DETACH, "W-DETACH");
 		ADD_FLAG(G_ELI_FLAG_RW_DETACH, "RW-DETACH");
 		ADD_FLAG(G_ELI_FLAG_AUTH, "AUTH");
 		ADD_FLAG(G_ELI_FLAG_WOPEN, "W-OPEN");
 		ADD_FLAG(G_ELI_FLAG_DESTROY, "DESTROY");
 		ADD_FLAG(G_ELI_FLAG_RO, "READ-ONLY");
 #undef  ADD_FLAG
 	}
 	sbuf_printf(sb, "</Flags>\n");
 
 	if (!(sc->sc_flags & G_ELI_FLAG_ONETIME)) {
 		sbuf_printf(sb, "%s<UsedKey>%u</UsedKey>\n", indent,
 		    sc->sc_nkey);
 	}
 	sbuf_printf(sb, "%s<Crypto>", indent);
 	switch (sc->sc_crypto) {
 	case G_ELI_CRYPTO_HW:
 		sbuf_printf(sb, "hardware");
 		break;
 	case G_ELI_CRYPTO_SW:
 		sbuf_printf(sb, "software");
 		break;
 	default:
 		sbuf_printf(sb, "UNKNOWN");
 		break;
 	}
 	sbuf_printf(sb, "</Crypto>\n");
 	if (sc->sc_flags & G_ELI_FLAG_AUTH) {
 		sbuf_printf(sb,
 		    "%s<AuthenticationAlgorithm>%s</AuthenticationAlgorithm>\n",
 		    indent, g_eli_algo2str(sc->sc_aalgo));
 	}
 	sbuf_printf(sb, "%s<KeyLength>%u</KeyLength>\n", indent,
 	    sc->sc_ekeylen);
 	sbuf_printf(sb, "%s<EncryptionAlgorithm>%s</EncryptionAlgorithm>\n", indent,
 	    g_eli_algo2str(sc->sc_ealgo));
 	sbuf_printf(sb, "%s<State>%s</State>\n", indent,
 	    (sc->sc_flags & G_ELI_FLAG_SUSPEND) ? "SUSPENDED" : "ACTIVE");
 }
 
 static void
 g_eli_shutdown_pre_sync(void *arg, int howto)
 {
 	struct g_class *mp;
 	struct g_geom *gp, *gp2;
 	struct g_provider *pp;
 	struct g_eli_softc *sc;
 	int error;
 
 	mp = arg;
 	DROP_GIANT();
 	g_topology_lock();
 	LIST_FOREACH_SAFE(gp, &mp->geom, geom, gp2) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		pp = LIST_FIRST(&gp->provider);
 		KASSERT(pp != NULL, ("No provider? gp=%p (%s)", gp, gp->name));
 		if (pp->acr + pp->acw + pp->ace == 0)
 			error = g_eli_destroy(sc, TRUE);
 		else {
 			sc->sc_flags |= G_ELI_FLAG_RW_DETACH;
 			gp->access = g_eli_access;
 		}
 	}
 	g_topology_unlock();
 	PICKUP_GIANT();
 }
 
 static void
 g_eli_init(struct g_class *mp)
 {
 
 	g_eli_pre_sync = EVENTHANDLER_REGISTER(shutdown_pre_sync,
 	    g_eli_shutdown_pre_sync, mp, SHUTDOWN_PRI_FIRST);
 	if (g_eli_pre_sync == NULL)
 		G_ELI_DEBUG(0, "Warning! Cannot register shutdown event.");
 }
 
 static void
 g_eli_fini(struct g_class *mp)
 {
 
 	if (g_eli_pre_sync != NULL)
 		EVENTHANDLER_DEREGISTER(shutdown_pre_sync, g_eli_pre_sync);
 }
 
 DECLARE_GEOM_CLASS(g_eli_class, g_eli);
 MODULE_DEPEND(g_eli, crypto, 1, 1, 1);
Index: head/sys/i386/i386/intr_machdep.c
===================================================================
--- head/sys/i386/i386/intr_machdep.c	(revision 222812)
+++ head/sys/i386/i386/intr_machdep.c	(revision 222813)
@@ -1,526 +1,528 @@
 /*-
  * Copyright (c) 2003 John Baldwin <jhb@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /*
  * Machine dependent interrupt code for i386.  For the i386, we have to
  * deal with different PICs.  Thus, we use the passed in vector to lookup
  * an interrupt source associated with that vector.  The interrupt source
  * describes which PIC the source belongs to and includes methods to handle
  * that source.
  */
 
 #include "opt_ddb.h"
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/interrupt.h>
 #include <sys/ktr.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 #include <sys/syslog.h>
 #include <sys/systm.h>
 #include <machine/clock.h>
 #include <machine/intr_machdep.h>
 #include <machine/smp.h>
 #ifdef DDB
 #include <ddb/ddb.h>
 #endif
 
 #define	MAX_STRAY_LOG	5
 
 typedef void (*mask_fn)(void *);
 
 static int intrcnt_index;
 static struct intsrc *interrupt_sources[NUM_IO_INTS];
 static struct mtx intr_table_lock;
 static struct mtx intrcnt_lock;
 static STAILQ_HEAD(, pic) pics;
 
 #ifdef SMP
 static int assign_cpu;
 #endif
 
 static int	intr_assign_cpu(void *arg, u_char cpu);
 static void	intr_disable_src(void *arg);
 static void	intr_init(void *__dummy);
 static int	intr_pic_registered(struct pic *pic);
 static void	intrcnt_setname(const char *name, int index);
 static void	intrcnt_updatename(struct intsrc *is);
 static void	intrcnt_register(struct intsrc *is);
 
 static int
 intr_pic_registered(struct pic *pic)
 {
 	struct pic *p;
 
 	STAILQ_FOREACH(p, &pics, pics) {
 		if (p == pic)
 			return (1);
 	}
 	return (0);
 }
 
 /*
  * Register a new interrupt controller (PIC).  This is to support suspend
  * and resume where we suspend/resume controllers rather than individual
  * sources.  This also allows controllers with no active sources (such as
  * 8259As in a system using the APICs) to participate in suspend and resume.
  */
 int
 intr_register_pic(struct pic *pic)
 {
 	int error;
 
 	mtx_lock(&intr_table_lock);
 	if (intr_pic_registered(pic))
 		error = EBUSY;
 	else {
 		STAILQ_INSERT_TAIL(&pics, pic, pics);
 		error = 0;
 	}
 	mtx_unlock(&intr_table_lock);
 	return (error);
 }
 
 /*
  * Register a new interrupt source with the global interrupt system.
  * The global interrupts need to be disabled when this function is
  * called.
  */
 int
 intr_register_source(struct intsrc *isrc)
 {
 	int error, vector;
 
 	KASSERT(intr_pic_registered(isrc->is_pic), ("unregistered PIC"));
 	vector = isrc->is_pic->pic_vector(isrc);
 	if (interrupt_sources[vector] != NULL)
 		return (EEXIST);
 	error = intr_event_create(&isrc->is_event, isrc, 0, vector,
 	    intr_disable_src, (mask_fn)isrc->is_pic->pic_enable_source,
 	    (mask_fn)isrc->is_pic->pic_eoi_source, intr_assign_cpu, "irq%d:",
 	    vector);
 	if (error)
 		return (error);
 	mtx_lock(&intr_table_lock);
 	if (interrupt_sources[vector] != NULL) {
 		mtx_unlock(&intr_table_lock);
 		intr_event_destroy(isrc->is_event);
 		return (EEXIST);
 	}
 	intrcnt_register(isrc);
 	interrupt_sources[vector] = isrc;
 	isrc->is_handlers = 0;
 	mtx_unlock(&intr_table_lock);
 	return (0);
 }
 
 struct intsrc *
 intr_lookup_source(int vector)
 {
 
 	return (interrupt_sources[vector]);
 }
 
 int
 intr_add_handler(const char *name, int vector, driver_filter_t filter,
     driver_intr_t handler, void *arg, enum intr_type flags, void **cookiep)
 {
 	struct intsrc *isrc;
 	int error;
 
 	isrc = intr_lookup_source(vector);
 	if (isrc == NULL)
 		return (EINVAL);
 	error = intr_event_add_handler(isrc->is_event, name, filter, handler,
 	    arg, intr_priority(flags), flags, cookiep);
 	if (error == 0) {
 		mtx_lock(&intr_table_lock);
 		intrcnt_updatename(isrc);
 		isrc->is_handlers++;
 		if (isrc->is_handlers == 1) {
 			isrc->is_pic->pic_enable_intr(isrc);
 			isrc->is_pic->pic_enable_source(isrc);
 		}
 		mtx_unlock(&intr_table_lock);
 	}
 	return (error);
 }
 
 int
 intr_remove_handler(void *cookie)
 {
 	struct intsrc *isrc;
 	int error;
 
 	isrc = intr_handler_source(cookie);
 	error = intr_event_remove_handler(cookie);
 	if (error == 0) {
 		mtx_lock(&intr_table_lock);
 		isrc->is_handlers--;
 		if (isrc->is_handlers == 0) {
 			isrc->is_pic->pic_disable_source(isrc, PIC_NO_EOI);
 			isrc->is_pic->pic_disable_intr(isrc);
 		}
 		intrcnt_updatename(isrc);
 		mtx_unlock(&intr_table_lock);
 	}
 	return (error);
 }
 
 int
 intr_config_intr(int vector, enum intr_trigger trig, enum intr_polarity pol)
 {
 	struct intsrc *isrc;
 
 	isrc = intr_lookup_source(vector);
 	if (isrc == NULL)
 		return (EINVAL);
 	return (isrc->is_pic->pic_config_intr(isrc, trig, pol));
 }
 
 static void
 intr_disable_src(void *arg)
 {
 	struct intsrc *isrc;
 
 	isrc = arg;
 	isrc->is_pic->pic_disable_source(isrc, PIC_EOI);
 }
 
 void
 intr_execute_handlers(struct intsrc *isrc, struct trapframe *frame)
 {
 	struct intr_event *ie;
 	int vector;
 
 	/*
 	 * We count software interrupts when we process them.  The
 	 * code here follows previous practice, but there's an
 	 * argument for counting hardware interrupts when they're
 	 * processed too.
 	 */
 	(*isrc->is_count)++;
 	PCPU_INC(cnt.v_intr);
 
 	ie = isrc->is_event;
 
 	/*
 	 * XXX: We assume that IRQ 0 is only used for the ISA timer
 	 * device (clk).
 	 */
 	vector = isrc->is_pic->pic_vector(isrc);
 	if (vector == 0)
 		clkintr_pending = 1;
 
 	/*
 	 * For stray interrupts, mask and EOI the source, bump the
 	 * stray count, and log the condition.
 	 */
 	if (intr_event_handle(ie, frame) != 0) {
 		isrc->is_pic->pic_disable_source(isrc, PIC_EOI);
 		(*isrc->is_straycount)++;
 		if (*isrc->is_straycount < MAX_STRAY_LOG)
 			log(LOG_ERR, "stray irq%d\n", vector);
 		else if (*isrc->is_straycount == MAX_STRAY_LOG)
 			log(LOG_CRIT,
 			    "too many stray irq %d's: not logging anymore\n",
 			    vector);
 	}
 }
 
 void
 intr_resume(void)
 {
 	struct pic *pic;
 
 	mtx_lock(&intr_table_lock);
 	STAILQ_FOREACH(pic, &pics, pics) {
 		if (pic->pic_resume != NULL)
 			pic->pic_resume(pic);
 	}
 	mtx_unlock(&intr_table_lock);
 }
 
 void
 intr_suspend(void)
 {
 	struct pic *pic;
 
 	mtx_lock(&intr_table_lock);
 	STAILQ_FOREACH(pic, &pics, pics) {
 		if (pic->pic_suspend != NULL)
 			pic->pic_suspend(pic);
 	}
 	mtx_unlock(&intr_table_lock);
 }
 
 static int
 intr_assign_cpu(void *arg, u_char cpu)
 {
 #ifdef SMP
 	struct intsrc *isrc;
 	int error;
 
 	/*
 	 * Don't do anything during early boot.  We will pick up the
 	 * assignment once the APs are started.
 	 */
 	if (assign_cpu && cpu != NOCPU) {
 		isrc = arg;
 		mtx_lock(&intr_table_lock);
 		error = isrc->is_pic->pic_assign_cpu(isrc, cpu_apic_ids[cpu]);
 		mtx_unlock(&intr_table_lock);
 	} else
 		error = 0;
 	return (error);
 #else
 	return (EOPNOTSUPP);
 #endif
 }
 
 static void
 intrcnt_setname(const char *name, int index)
 {
 
 	snprintf(intrnames + (MAXCOMLEN + 1) * index, MAXCOMLEN + 1, "%-*s",
 	    MAXCOMLEN, name);
 }
 
 static void
 intrcnt_updatename(struct intsrc *is)
 {
 
 	intrcnt_setname(is->is_event->ie_fullname, is->is_index);
 }
 
 static void
 intrcnt_register(struct intsrc *is)
 {
 	char straystr[MAXCOMLEN + 1];
 
 	KASSERT(is->is_event != NULL, ("%s: isrc with no event", __func__));
 	mtx_lock_spin(&intrcnt_lock);
 	is->is_index = intrcnt_index;
 	intrcnt_index += 2;
 	snprintf(straystr, MAXCOMLEN + 1, "stray irq%d",
 	    is->is_pic->pic_vector(is));
 	intrcnt_updatename(is);
 	is->is_count = &intrcnt[is->is_index];
 	intrcnt_setname(straystr, is->is_index + 1);
 	is->is_straycount = &intrcnt[is->is_index + 1];
 	mtx_unlock_spin(&intrcnt_lock);
 }
 
 void
 intrcnt_add(const char *name, u_long **countp)
 {
 
 	mtx_lock_spin(&intrcnt_lock);
 	*countp = &intrcnt[intrcnt_index];
 	intrcnt_setname(name, intrcnt_index);
 	intrcnt_index++;
 	mtx_unlock_spin(&intrcnt_lock);
 }
 
 static void
 intr_init(void *dummy __unused)
 {
 
 	intrcnt_setname("???", 0);
 	intrcnt_index = 1;
 	STAILQ_INIT(&pics);
 	mtx_init(&intr_table_lock, "intr sources", NULL, MTX_DEF);
 	mtx_init(&intrcnt_lock, "intrcnt", NULL, MTX_SPIN);
 }
 SYSINIT(intr_init, SI_SUB_INTR, SI_ORDER_FIRST, intr_init, NULL);
 
 /* Add a description to an active interrupt handler. */
 int
 intr_describe(u_int vector, void *ih, const char *descr)
 {
 	struct intsrc *isrc;
 	int error;
 
 	isrc = intr_lookup_source(vector);
 	if (isrc == NULL)
 		return (EINVAL);
 	error = intr_event_describe_handler(isrc->is_event, ih, descr);
 	if (error)
 		return (error);
 	intrcnt_updatename(isrc);
 	return (0);
 }
 
 #ifdef DDB
 /*
  * Dump data about interrupt handlers
  */
 DB_SHOW_COMMAND(irqs, db_show_irqs)
 {
 	struct intsrc **isrc;
 	int i, verbose;
 
 	if (strcmp(modif, "v") == 0)
 		verbose = 1;
 	else
 		verbose = 0;
 	isrc = interrupt_sources;
 	for (i = 0; i < NUM_IO_INTS && !db_pager_quit; i++, isrc++)
 		if (*isrc != NULL)
 			db_dump_intr_event((*isrc)->is_event, verbose);
 }
 #endif
 
 #ifdef SMP
 /*
  * Support for balancing interrupt sources across CPUs.  For now we just
  * allocate CPUs round-robin.
  */
 
-/* The BSP is always a valid target. */
-static cpumask_t intr_cpus = (1 << 0);
+static cpuset_t intr_cpus;
 static int current_cpu;
 
 /*
  * Return the CPU that the next interrupt source should use.  For now
  * this just returns the next local APIC according to round-robin.
  */
 u_int
 intr_next_cpu(void)
 {
 	u_int apic_id;
 
 	/* Leave all interrupts on the BSP during boot. */
 	if (!assign_cpu)
 		return (PCPU_GET(apic_id));
 
 	mtx_lock_spin(&icu_lock);
 	apic_id = cpu_apic_ids[current_cpu];
 	do {
 		current_cpu++;
 		if (current_cpu > mp_maxid)
 			current_cpu = 0;
-	} while (!(intr_cpus & (1 << current_cpu)));
+	} while (!CPU_ISSET(current_cpu, &intr_cpus));
 	mtx_unlock_spin(&icu_lock);
 	return (apic_id);
 }
 
 /* Attempt to bind the specified IRQ to the specified CPU. */
 int
 intr_bind(u_int vector, u_char cpu)
 {
 	struct intsrc *isrc;
 
 	isrc = intr_lookup_source(vector);
 	if (isrc == NULL)
 		return (EINVAL);
 	return (intr_event_bind(isrc->is_event, cpu));
 }
 
 /*
  * Add a CPU to our mask of valid CPUs that can be destinations of
  * interrupts.
  */
 void
 intr_add_cpu(u_int cpu)
 {
 
 	if (cpu >= MAXCPU)
 		panic("%s: Invalid CPU ID", __func__);
 	if (bootverbose)
 		printf("INTR: Adding local APIC %d as a target\n",
 		    cpu_apic_ids[cpu]);
 
-	intr_cpus |= (1 << cpu);
+	CPU_SET(cpu, &intr_cpus);
 }
 
 /*
  * Distribute all the interrupt sources among the available CPUs once the
  * AP's have been launched.
  */
 static void
 intr_shuffle_irqs(void *arg __unused)
 {
 	struct intsrc *isrc;
 	int i;
 
 #ifdef XEN
 	/*
 	 * Doesn't work yet
 	 */
 	return;
 #endif
+
+	/* The BSP is always a valid target. */
+	CPU_SETOF(0, &intr_cpus);
 
 	/* Don't bother on UP. */
 	if (mp_ncpus == 1)
 		return;
 
 	/* Round-robin assign a CPU to each enabled source. */
 	mtx_lock(&intr_table_lock);
 	assign_cpu = 1;
 	for (i = 0; i < NUM_IO_INTS; i++) {
 		isrc = interrupt_sources[i];
 		if (isrc != NULL && isrc->is_handlers > 0) {
 			/*
 			 * If this event is already bound to a CPU,
 			 * then assign the source to that CPU instead
 			 * of picking one via round-robin.  Note that
 			 * this is careful to only advance the
 			 * round-robin if the CPU assignment succeeds.
 			 */
 			if (isrc->is_event->ie_cpu != NOCPU)
 				(void)isrc->is_pic->pic_assign_cpu(isrc,
 				    cpu_apic_ids[isrc->is_event->ie_cpu]);
 			else if (isrc->is_pic->pic_assign_cpu(isrc,
 				cpu_apic_ids[current_cpu]) == 0)
 				(void)intr_next_cpu();
 
 		}
 	}
 	mtx_unlock(&intr_table_lock);
 }
 SYSINIT(intr_shuffle_irqs, SI_SUB_SMP, SI_ORDER_SECOND, intr_shuffle_irqs,
     NULL);
 #else
 /*
  * Always route interrupts to the current processor in the UP case.
  */
 u_int
 intr_next_cpu(void)
 {
 
 	return (PCPU_GET(apic_id));
 }
 #endif
Index: head/sys/i386/i386/mp_machdep.c
===================================================================
--- head/sys/i386/i386/mp_machdep.c	(revision 222812)
+++ head/sys/i386/i386/mp_machdep.c	(revision 222813)
@@ -1,1718 +1,1736 @@
 /*-
  * Copyright (c) 1996, by Steve Passe
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. The name of the developer may NOT be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_apic.h"
 #include "opt_cpu.h"
 #include "opt_kstack_pages.h"
 #include "opt_mp_watchdog.h"
 #include "opt_pmap.h"
 #include "opt_sched.h"
 #include "opt_smp.h"
 
 #if !defined(lint)
 #if !defined(SMP)
 #error How did you get here?
 #endif
 
 #ifndef DEV_APIC
 #error The apic device is required for SMP, add "device apic" to your config file.
 #endif
 #if defined(CPU_DISABLE_CMPXCHG) && !defined(COMPILING_LINT)
 #error SMP not supported with CPU_DISABLE_CMPXCHG
 #endif
 #endif /* not lint */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/cons.h>	/* cngetc() */
+#include <sys/cpuset.h>
 #ifdef GPROF 
 #include <sys/gmon.h>
 #endif
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/memrange.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_extern.h>
 
 #include <x86/apicreg.h>
 #include <machine/clock.h>
 #include <machine/cputypes.h>
 #include <x86/mca.h>
 #include <machine/md_var.h>
 #include <machine/mp_watchdog.h>
 #include <machine/pcb.h>
 #include <machine/psl.h>
 #include <machine/smp.h>
 #include <machine/specialreg.h>
 
 #define WARMBOOT_TARGET		0
 #define WARMBOOT_OFF		(KERNBASE + 0x0467)
 #define WARMBOOT_SEG		(KERNBASE + 0x0469)
 
 #define CMOS_REG		(0x70)
 #define CMOS_DATA		(0x71)
 #define BIOS_RESET		(0x0f)
 #define BIOS_WARM		(0x0a)
 
 /*
  * this code MUST be enabled here and in mpboot.s.
  * it follows the very early stages of AP boot by placing values in CMOS ram.
  * it NORMALLY will never be needed and thus the primitive method for enabling.
  *
 #define CHECK_POINTS
  */
 
 #if defined(CHECK_POINTS) && !defined(PC98)
 #define CHECK_READ(A)	 (outb(CMOS_REG, (A)), inb(CMOS_DATA))
 #define CHECK_WRITE(A,D) (outb(CMOS_REG, (A)), outb(CMOS_DATA, (D)))
 
 #define CHECK_INIT(D);				\
 	CHECK_WRITE(0x34, (D));			\
 	CHECK_WRITE(0x35, (D));			\
 	CHECK_WRITE(0x36, (D));			\
 	CHECK_WRITE(0x37, (D));			\
 	CHECK_WRITE(0x38, (D));			\
 	CHECK_WRITE(0x39, (D));
 
 #define CHECK_PRINT(S);				\
 	printf("%s: %d, %d, %d, %d, %d, %d\n",	\
 	   (S),					\
 	   CHECK_READ(0x34),			\
 	   CHECK_READ(0x35),			\
 	   CHECK_READ(0x36),			\
 	   CHECK_READ(0x37),			\
 	   CHECK_READ(0x38),			\
 	   CHECK_READ(0x39));
 
 #else				/* CHECK_POINTS */
 
 #define CHECK_INIT(D)
 #define CHECK_PRINT(S)
 #define CHECK_WRITE(A, D)
 
 #endif				/* CHECK_POINTS */
 
 /* lock region used by kernel profiling */
 int	mcount_lock;
 
 int	mp_naps;		/* # of Applications processors */
 int	boot_cpu_id = -1;	/* designated BSP */
 
 extern	struct pcpu __pcpu[];
 
 /* AP uses this during bootstrap.  Do not staticize.  */
 char *bootSTK;
 static int bootAP;
 
 /* Free these after use */
 void *bootstacks[MAXCPU];
 static void *dpcpu;
 
 /* Hotwire a 0->4MB V==P mapping */
 extern pt_entry_t *KPTphys;
 
 struct pcb stoppcbs[MAXCPU];
 
 /* Variables needed for SMP tlb shootdown. */
 vm_offset_t smp_tlb_addr1;
 vm_offset_t smp_tlb_addr2;
 volatile int smp_tlb_wait;
 
 #ifdef COUNT_IPIS
 /* Interrupt counts. */
 static u_long *ipi_preempt_counts[MAXCPU];
 static u_long *ipi_ast_counts[MAXCPU];
 u_long *ipi_invltlb_counts[MAXCPU];
 u_long *ipi_invlrng_counts[MAXCPU];
 u_long *ipi_invlpg_counts[MAXCPU];
 u_long *ipi_invlcache_counts[MAXCPU];
 u_long *ipi_rendezvous_counts[MAXCPU];
 u_long *ipi_lazypmap_counts[MAXCPU];
 static u_long *ipi_hardclock_counts[MAXCPU];
 #endif
 
 /*
  * Local data and functions.
  */
 
-static volatile cpumask_t ipi_nmi_pending;
+static volatile cpuset_t ipi_nmi_pending;
 
 /* used to hold the AP's until we are ready to release them */
 static struct mtx ap_boot_mtx;
 
 /* Set to 1 once we're ready to let the APs out of the pen. */
 static volatile int aps_ready = 0;
 
 /*
  * Store data from cpu_add() until later in the boot when we actually setup
  * the APs.
  */
 struct cpu_info {
 	int	cpu_present:1;
 	int	cpu_bsp:1;
 	int	cpu_disabled:1;
 	int	cpu_hyperthread:1;
 } static cpu_info[MAX_APIC_ID + 1];
 int cpu_apic_ids[MAXCPU];
 int apic_cpuids[MAX_APIC_ID + 1];
 
 /* Holds pending bitmap based IPIs per CPU */
 static volatile u_int cpu_ipi_pending[MAXCPU];
 
 static u_int boot_address;
 static int cpu_logical;			/* logical cpus per core */
 static int cpu_cores;			/* cores per package */
 
 static void	assign_cpu_ids(void);
 static void	install_ap_tramp(void);
 static void	set_interrupt_apic_ids(void);
 static int	start_all_aps(void);
 static int	start_ap(int apic_id);
 static void	release_aps(void *dummy);
 
 static int	hlt_logical_cpus;
 static u_int	hyperthreading_cpus;	/* logical cpus sharing L1 cache */
-static cpumask_t	hyperthreading_cpus_mask;
+static cpuset_t	hyperthreading_cpus_mask;
 static int	hyperthreading_allowed = 1;
 static struct	sysctl_ctx_list logical_cpu_clist;
 
 static void
 mem_range_AP_init(void)
 {
 	if (mem_range_softc.mr_op && mem_range_softc.mr_op->initAP)
 		mem_range_softc.mr_op->initAP(&mem_range_softc);
 }
 
 static void
 topo_probe_amd(void)
 {
 	int core_id_bits;
 	int id;
 
 	/* AMD processors do not support HTT. */
 	cpu_logical = 1;
 
 	if ((amd_feature2 & AMDID2_CMP) == 0) {
 		cpu_cores = 1;
 		return;
 	}
 
 	core_id_bits = (cpu_procinfo2 & AMDID_COREID_SIZE) >>
 	    AMDID_COREID_SIZE_SHIFT;
 	if (core_id_bits == 0) {
 		cpu_cores = (cpu_procinfo2 & AMDID_CMP_CORES) + 1;
 		return;
 	}
 
 	/* Fam 10h and newer should get here. */
 	for (id = 0; id <= MAX_APIC_ID; id++) {
 		/* Check logical CPU availability. */
 		if (!cpu_info[id].cpu_present || cpu_info[id].cpu_disabled)
 			continue;
 		/* Check if logical CPU has the same package ID. */
 		if ((id >> core_id_bits) != (boot_cpu_id >> core_id_bits))
 			continue;
 		cpu_cores++;
 	}
 }
 
 /*
  * Round up to the next power of two, if necessary, and then
  * take log2.
  * Returns -1 if argument is zero.
  */
 static __inline int
 mask_width(u_int x)
 {
 
 	return (fls(x << (1 - powerof2(x))) - 1);
 }
 
 static void
 topo_probe_0x4(void)
 {
 	u_int p[4];
 	int pkg_id_bits;
 	int core_id_bits;
 	int max_cores;
 	int max_logical;
 	int id;
 
 	/* Both zero and one here mean one logical processor per package. */
 	max_logical = (cpu_feature & CPUID_HTT) != 0 ?
 	    (cpu_procinfo & CPUID_HTT_CORES) >> 16 : 1;
 	if (max_logical <= 1)
 		return;
 
 	/*
 	 * Because of uniformity assumption we examine only
 	 * those logical processors that belong to the same
 	 * package as BSP.  Further, we count number of
 	 * logical processors that belong to the same core
 	 * as BSP thus deducing number of threads per core.
 	 */
 	if (cpu_high >= 0x4) {
 		cpuid_count(0x04, 0, p);
 		max_cores = ((p[0] >> 26) & 0x3f) + 1;
 	} else
 		max_cores = 1;
 	core_id_bits = mask_width(max_logical/max_cores);
 	if (core_id_bits < 0)
 		return;
 	pkg_id_bits = core_id_bits + mask_width(max_cores);
 
 	for (id = 0; id <= MAX_APIC_ID; id++) {
 		/* Check logical CPU availability. */
 		if (!cpu_info[id].cpu_present || cpu_info[id].cpu_disabled)
 			continue;
 		/* Check if logical CPU has the same package ID. */
 		if ((id >> pkg_id_bits) != (boot_cpu_id >> pkg_id_bits))
 			continue;
 		cpu_cores++;
 		/* Check if logical CPU has the same package and core IDs. */
 		if ((id >> core_id_bits) == (boot_cpu_id >> core_id_bits))
 			cpu_logical++;
 	}
 
 	KASSERT(cpu_cores >= 1 && cpu_logical >= 1,
 	    ("topo_probe_0x4 couldn't find BSP"));
 
 	cpu_cores /= cpu_logical;
 	hyperthreading_cpus = cpu_logical;
 }
 
 static void
 topo_probe_0xb(void)
 {
 	u_int p[4];
 	int bits;
 	int cnt;
 	int i;
 	int logical;
 	int type;
 	int x;
 
 	/* We only support three levels for now. */
 	for (i = 0; i < 3; i++) {
 		cpuid_count(0x0b, i, p);
 
 		/* Fall back if CPU leaf 11 doesn't really exist. */
 		if (i == 0 && p[1] == 0) {
 			topo_probe_0x4();
 			return;
 		}
 
 		bits = p[0] & 0x1f;
 		logical = p[1] &= 0xffff;
 		type = (p[2] >> 8) & 0xff;
 		if (type == 0 || logical == 0)
 			break;
 		/*
 		 * Because of uniformity assumption we examine only
 		 * those logical processors that belong to the same
 		 * package as BSP.
 		 */
 		for (cnt = 0, x = 0; x <= MAX_APIC_ID; x++) {
 			if (!cpu_info[x].cpu_present ||
 			    cpu_info[x].cpu_disabled)
 				continue;
 			if (x >> bits == boot_cpu_id >> bits)
 				cnt++;
 		}
 		if (type == CPUID_TYPE_SMT)
 			cpu_logical = cnt;
 		else if (type == CPUID_TYPE_CORE)
 			cpu_cores = cnt;
 	}
 	if (cpu_logical == 0)
 		cpu_logical = 1;
 	cpu_cores /= cpu_logical;
 }
 
 /*
  * Both topology discovery code and code that consumes topology
  * information assume top-down uniformity of the topology.
  * That is, all physical packages must be identical and each
  * core in a package must have the same number of threads.
  * Topology information is queried only on BSP, on which this
  * code runs and for which it can query CPUID information.
  * Then topology is extrapolated on all packages using the
  * uniformity assumption.
  */
 static void
 topo_probe(void)
 {
 	static int cpu_topo_probed = 0;
 
 	if (cpu_topo_probed)
 		return;
 
-	logical_cpus_mask = 0;
+	CPU_ZERO(&logical_cpus_mask);
 	if (mp_ncpus <= 1)
 		cpu_cores = cpu_logical = 1;
 	else if (cpu_vendor_id == CPU_VENDOR_AMD)
 		topo_probe_amd();
 	else if (cpu_vendor_id == CPU_VENDOR_INTEL) {
 		/*
 		 * See Intel(R) 64 Architecture Processor
 		 * Topology Enumeration article for details.
 		 *
 		 * Note that 0x1 <= cpu_high < 4 case should be
 		 * compatible with topo_probe_0x4() logic when
 		 * CPUID.1:EBX[23:16] > 0 (cpu_cores will be 1)
 		 * or it should trigger the fallback otherwise.
 		 */
 		if (cpu_high >= 0xb)
 			topo_probe_0xb();
 		else if (cpu_high >= 0x1)
 			topo_probe_0x4();
 	}
 
 	/*
 	 * Fallback: assume each logical CPU is in separate
 	 * physical package.  That is, no multi-core, no SMT.
 	 */
 	if (cpu_cores == 0 || cpu_logical == 0)
 		cpu_cores = cpu_logical = 1;
 	cpu_topo_probed = 1;
 }
 
 struct cpu_group *
 cpu_topo(void)
 {
 	int cg_flags;
 
 	/*
 	 * Determine whether any threading flags are
 	 * necessry.
 	 */
 	topo_probe();
 	if (cpu_logical > 1 && hyperthreading_cpus)
 		cg_flags = CG_FLAG_HTT;
 	else if (cpu_logical > 1)
 		cg_flags = CG_FLAG_SMT;
 	else
 		cg_flags = 0;
 	if (mp_ncpus % (cpu_cores * cpu_logical) != 0) {
 		printf("WARNING: Non-uniform processors.\n");
 		printf("WARNING: Using suboptimal topology.\n");
 		return (smp_topo_none());
 	}
 	/*
 	 * No multi-core or hyper-threaded.
 	 */
 	if (cpu_logical * cpu_cores == 1)
 		return (smp_topo_none());
 	/*
 	 * Only HTT no multi-core.
 	 */
 	if (cpu_logical > 1 && cpu_cores == 1)
 		return (smp_topo_1level(CG_SHARE_L1, cpu_logical, cg_flags));
 	/*
 	 * Only multi-core no HTT.
 	 */
 	if (cpu_cores > 1 && cpu_logical == 1)
 		return (smp_topo_1level(CG_SHARE_L2, cpu_cores, cg_flags));
 	/*
 	 * Both HTT and multi-core.
 	 */
 	return (smp_topo_2level(CG_SHARE_L2, cpu_cores,
 	    CG_SHARE_L1, cpu_logical, cg_flags));
 }
 
 
 /*
  * Calculate usable address in base memory for AP trampoline code.
  */
 u_int
 mp_bootaddress(u_int basemem)
 {
 
 	boot_address = trunc_page(basemem);	/* round down to 4k boundary */
 	if ((basemem - boot_address) < bootMP_size)
 		boot_address -= PAGE_SIZE;	/* not enough, lower by 4k */
 
 	return boot_address;
 }
 
 void
 cpu_add(u_int apic_id, char boot_cpu)
 {
 
 	if (apic_id > MAX_APIC_ID) {
 		panic("SMP: APIC ID %d too high", apic_id);
 		return;
 	}
 	KASSERT(cpu_info[apic_id].cpu_present == 0, ("CPU %d added twice",
 	    apic_id));
 	cpu_info[apic_id].cpu_present = 1;
 	if (boot_cpu) {
 		KASSERT(boot_cpu_id == -1,
 		    ("CPU %d claims to be BSP, but CPU %d already is", apic_id,
 		    boot_cpu_id));
 		boot_cpu_id = apic_id;
 		cpu_info[apic_id].cpu_bsp = 1;
 	}
 	if (mp_ncpus < MAXCPU) {
 		mp_ncpus++;
 		mp_maxid = mp_ncpus - 1;
 	}
 	if (bootverbose)
 		printf("SMP: Added CPU %d (%s)\n", apic_id, boot_cpu ? "BSP" :
 		    "AP");
 }
 
 void
 cpu_mp_setmaxid(void)
 {
 
 	/*
 	 * mp_maxid should be already set by calls to cpu_add().
 	 * Just sanity check its value here.
 	 */
 	if (mp_ncpus == 0)
 		KASSERT(mp_maxid == 0,
 		    ("%s: mp_ncpus is zero, but mp_maxid is not", __func__));
 	else if (mp_ncpus == 1)
 		mp_maxid = 0;
 	else
 		KASSERT(mp_maxid >= mp_ncpus - 1,
 		    ("%s: counters out of sync: max %d, count %d", __func__,
 			mp_maxid, mp_ncpus));
 }
 
 int
 cpu_mp_probe(void)
 {
 
 	/*
 	 * Always record BSP in CPU map so that the mbuf init code works
 	 * correctly.
 	 */
-	all_cpus = 1;
+	CPU_SETOF(0, &all_cpus);
 	if (mp_ncpus == 0) {
 		/*
 		 * No CPUs were found, so this must be a UP system.  Setup
 		 * the variables to represent a system with a single CPU
 		 * with an id of 0.
 		 */
 		mp_ncpus = 1;
 		return (0);
 	}
 
 	/* At least one CPU was found. */
 	if (mp_ncpus == 1) {
 		/*
 		 * One CPU was found, so this must be a UP system with
 		 * an I/O APIC.
 		 */
 		mp_maxid = 0;
 		return (0);
 	}
 
 	/* At least two CPUs were found. */
 	return (1);
 }
 
 /*
  * Initialize the IPI handlers and start up the AP's.
  */
 void
 cpu_mp_start(void)
 {
 	int i;
 
 	/* Initialize the logical ID to APIC ID table. */
 	for (i = 0; i < MAXCPU; i++) {
 		cpu_apic_ids[i] = -1;
 		cpu_ipi_pending[i] = 0;
 	}
 
 	/* Install an inter-CPU IPI for TLB invalidation */
 	setidt(IPI_INVLTLB, IDTVEC(invltlb),
 	       SDT_SYS386IGT, SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
 	setidt(IPI_INVLPG, IDTVEC(invlpg),
 	       SDT_SYS386IGT, SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
 	setidt(IPI_INVLRNG, IDTVEC(invlrng),
 	       SDT_SYS386IGT, SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
 
 	/* Install an inter-CPU IPI for cache invalidation. */
 	setidt(IPI_INVLCACHE, IDTVEC(invlcache),
 	       SDT_SYS386IGT, SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
 
 	/* Install an inter-CPU IPI for lazy pmap release */
 	setidt(IPI_LAZYPMAP, IDTVEC(lazypmap),
 	       SDT_SYS386IGT, SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
 
 	/* Install an inter-CPU IPI for all-CPU rendezvous */
 	setidt(IPI_RENDEZVOUS, IDTVEC(rendezvous),
 	       SDT_SYS386IGT, SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
 
 	/* Install generic inter-CPU IPI handler */
 	setidt(IPI_BITMAP_VECTOR, IDTVEC(ipi_intr_bitmap_handler),
 	       SDT_SYS386IGT, SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
 
 	/* Install an inter-CPU IPI for CPU stop/restart */
 	setidt(IPI_STOP, IDTVEC(cpustop),
 	       SDT_SYS386IGT, SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
 
 
 	/* Set boot_cpu_id if needed. */
 	if (boot_cpu_id == -1) {
 		boot_cpu_id = PCPU_GET(apic_id);
 		cpu_info[boot_cpu_id].cpu_bsp = 1;
 	} else
 		KASSERT(boot_cpu_id == PCPU_GET(apic_id),
 		    ("BSP's APIC ID doesn't match boot_cpu_id"));
 
 	/* Probe logical/physical core configuration. */
 	topo_probe();
 
 	assign_cpu_ids();
 
 	/* Start each Application Processor */
 	start_all_aps();
 
 	set_interrupt_apic_ids();
 }
 
 
 /*
  * Print various information about the SMP system hardware and setup.
  */
 void
 cpu_mp_announce(void)
 {
 	const char *hyperthread;
 	int i;
 
 	printf("FreeBSD/SMP: %d package(s) x %d core(s)",
 	    mp_ncpus / (cpu_cores * cpu_logical), cpu_cores);
 	if (hyperthreading_cpus > 1)
 	    printf(" x %d HTT threads", cpu_logical);
 	else if (cpu_logical > 1)
 	    printf(" x %d SMT threads", cpu_logical);
 	printf("\n");
 
 	/* List active CPUs first. */
 	printf(" cpu0 (BSP): APIC ID: %2d\n", boot_cpu_id);
 	for (i = 1; i < mp_ncpus; i++) {
 		if (cpu_info[cpu_apic_ids[i]].cpu_hyperthread)
 			hyperthread = "/HT";
 		else
 			hyperthread = "";
 		printf(" cpu%d (AP%s): APIC ID: %2d\n", i, hyperthread,
 		    cpu_apic_ids[i]);
 	}
 
 	/* List disabled CPUs last. */
 	for (i = 0; i <= MAX_APIC_ID; i++) {
 		if (!cpu_info[i].cpu_present || !cpu_info[i].cpu_disabled)
 			continue;
 		if (cpu_info[i].cpu_hyperthread)
 			hyperthread = "/HT";
 		else
 			hyperthread = "";
 		printf("  cpu (AP%s): APIC ID: %2d (disabled)\n", hyperthread,
 		    i);
 	}
 }
 
 /*
  * AP CPU's call this to initialize themselves.
  */
 void
 init_secondary(void)
 {
+	cpuset_t tcpuset, tallcpus;
 	struct pcpu *pc;
 	vm_offset_t addr;
 	int	gsel_tss;
 	int	x, myid;
 	u_int	cr0;
 
 	/* bootAP is set in start_ap() to our ID. */
 	myid = bootAP;
 
 	/* Get per-cpu data */
 	pc = &__pcpu[myid];
 
 	/* prime data page for it to use */
 	pcpu_init(pc, myid, sizeof(struct pcpu));
 	dpcpu_init(dpcpu, myid);
 	pc->pc_apic_id = cpu_apic_ids[myid];
 	pc->pc_prvspace = pc;
 	pc->pc_curthread = 0;
 
 	gdt_segs[GPRIV_SEL].ssd_base = (int) pc;
 	gdt_segs[GPROC0_SEL].ssd_base = (int) &pc->pc_common_tss;
 
 	for (x = 0; x < NGDT; x++) {
 		ssdtosd(&gdt_segs[x], &gdt[myid * NGDT + x].sd);
 	}
 
 	r_gdt.rd_limit = NGDT * sizeof(gdt[0]) - 1;
 	r_gdt.rd_base = (int) &gdt[myid * NGDT];
 	lgdt(&r_gdt);			/* does magic intra-segment return */
 
 	lidt(&r_idt);
 
 	lldt(_default_ldt);
 	PCPU_SET(currentldt, _default_ldt);
 
 	gsel_tss = GSEL(GPROC0_SEL, SEL_KPL);
 	gdt[myid * NGDT + GPROC0_SEL].sd.sd_type = SDT_SYS386TSS;
 	PCPU_SET(common_tss.tss_esp0, 0); /* not used until after switch */
 	PCPU_SET(common_tss.tss_ss0, GSEL(GDATA_SEL, SEL_KPL));
 	PCPU_SET(common_tss.tss_ioopt, (sizeof (struct i386tss)) << 16);
 	PCPU_SET(tss_gdt, &gdt[myid * NGDT + GPROC0_SEL].sd);
 	PCPU_SET(common_tssd, *PCPU_GET(tss_gdt));
 	ltr(gsel_tss);
 
 	PCPU_SET(fsgs_gdt, &gdt[myid * NGDT + GUFS_SEL].sd);
 
 	/*
 	 * Set to a known state:
 	 * Set by mpboot.s: CR0_PG, CR0_PE
 	 * Set by cpu_setregs: CR0_NE, CR0_MP, CR0_TS, CR0_WP, CR0_AM
 	 */
 	cr0 = rcr0();
 	cr0 &= ~(CR0_CD | CR0_NW | CR0_EM);
 	load_cr0(cr0);
 	CHECK_WRITE(0x38, 5);
 	
 	/* Disable local APIC just to be sure. */
 	lapic_disable();
 
 	/* signal our startup to the BSP. */
 	mp_naps++;
 	CHECK_WRITE(0x39, 6);
 
 	/* Spin until the BSP releases the AP's. */
 	while (!aps_ready)
 		ia32_pause();
 
 	/* BSP may have changed PTD while we were waiting */
 	invltlb();
 	for (addr = 0; addr < NKPT * NBPDR - 1; addr += PAGE_SIZE)
 		invlpg(addr);
 
 #if defined(I586_CPU) && !defined(NO_F00F_HACK)
 	lidt(&r_idt);
 #endif
 
 	/* Initialize the PAT MSR if present. */
 	pmap_init_pat();
 
 	/* set up CPU registers and state */
 	cpu_setregs();
 
 	/* set up FPU state on the AP */
 	npxinit();
 
 	/* set up SSE registers */
 	enable_sse();
 
 #ifdef PAE
 	/* Enable the PTE no-execute bit. */
 	if ((amd_feature & AMDID_NX) != 0) {
 		uint64_t msr;
 
 		msr = rdmsr(MSR_EFER) | EFER_NXE;
 		wrmsr(MSR_EFER, msr);
 	}
 #endif
 
 	/* A quick check from sanity claus */
 	if (PCPU_GET(apic_id) != lapic_id()) {
 		printf("SMP: cpuid = %d\n", PCPU_GET(cpuid));
 		printf("SMP: actual apic_id = %d\n", lapic_id());
 		printf("SMP: correct apic_id = %d\n", PCPU_GET(apic_id));
 		panic("cpuid mismatch! boom!!");
 	}
 
 	/* Initialize curthread. */
 	KASSERT(PCPU_GET(idlethread) != NULL, ("no idle thread"));
 	PCPU_SET(curthread, PCPU_GET(idlethread));
 
 	mca_init();
 
 	mtx_lock_spin(&ap_boot_mtx);
 
 	/* Init local apic for irq's */
 	lapic_setup(1);
 
 	/* Set memory range attributes for this CPU to match the BSP */
 	mem_range_AP_init();
 
 	smp_cpus++;
 
 	CTR1(KTR_SMP, "SMP: AP CPU #%d Launched", PCPU_GET(cpuid));
 	printf("SMP: AP CPU #%d Launched!\n", PCPU_GET(cpuid));
+	tcpuset = PCPU_GET(cpumask);
 
 	/* Determine if we are a logical CPU. */
 	/* XXX Calculation depends on cpu_logical being a power of 2, e.g. 2 */
 	if (cpu_logical > 1 && PCPU_GET(apic_id) % cpu_logical != 0)
-		logical_cpus_mask |= PCPU_GET(cpumask);
+		CPU_OR(&logical_cpus_mask, &tcpuset);
 	
 	/* Determine if we are a hyperthread. */
 	if (hyperthreading_cpus > 1 &&
 	    PCPU_GET(apic_id) % hyperthreading_cpus != 0)
-		hyperthreading_cpus_mask |= PCPU_GET(cpumask);
+		CPU_OR(&hyperthreading_cpus_mask, &tcpuset);
 
 	/* Build our map of 'other' CPUs. */
-	PCPU_SET(other_cpus, all_cpus & ~PCPU_GET(cpumask));
+	tallcpus = all_cpus;
+	CPU_NAND(&tallcpus, &tcpuset);
+	PCPU_SET(other_cpus, tallcpus);
 
 	if (bootverbose)
 		lapic_dump("AP");
 
 	if (smp_cpus == mp_ncpus) {
 		/* enable IPI's, tlb shootdown, freezes etc */
 		atomic_store_rel_int(&smp_started, 1);
 		smp_active = 1;	 /* historic */
 	}
 
 	mtx_unlock_spin(&ap_boot_mtx);
 
 	/* Wait until all the AP's are up. */
 	while (smp_started == 0)
 		ia32_pause();
 
 	/* Start per-CPU event timers. */
 	cpu_initclocks_ap();
 
 	/* Enter the scheduler. */
 	sched_throw(NULL);
 
 	panic("scheduler returned us to %s", __func__);
 	/* NOTREACHED */
 }
 
 /*******************************************************************
  * local functions and data
  */
 
 /*
  * We tell the I/O APIC code about all the CPUs we want to receive
  * interrupts.  If we don't want certain CPUs to receive IRQs we
  * can simply not tell the I/O APIC code about them in this function.
  * We also do not tell it about the BSP since it tells itself about
  * the BSP internally to work with UP kernels and on UP machines.
  */
 static void
 set_interrupt_apic_ids(void)
 {
 	u_int i, apic_id;
 
 	for (i = 0; i < MAXCPU; i++) {
 		apic_id = cpu_apic_ids[i];
 		if (apic_id == -1)
 			continue;
 		if (cpu_info[apic_id].cpu_bsp)
 			continue;
 		if (cpu_info[apic_id].cpu_disabled)
 			continue;
 
 		/* Don't let hyperthreads service interrupts. */
 		if (hyperthreading_cpus > 1 &&
 		    apic_id % hyperthreading_cpus != 0)
 			continue;
 
 		intr_add_cpu(i);
 	}
 }
 
 /*
  * Assign logical CPU IDs to local APICs.
  */
 static void
 assign_cpu_ids(void)
 {
 	u_int i;
 
 	TUNABLE_INT_FETCH("machdep.hyperthreading_allowed",
 	    &hyperthreading_allowed);
 
 	/* Check for explicitly disabled CPUs. */
 	for (i = 0; i <= MAX_APIC_ID; i++) {
 		if (!cpu_info[i].cpu_present || cpu_info[i].cpu_bsp)
 			continue;
 
 		if (hyperthreading_cpus > 1 && i % hyperthreading_cpus != 0) {
 			cpu_info[i].cpu_hyperthread = 1;
 #if defined(SCHED_ULE)
 			/*
 			 * Don't use HT CPU if it has been disabled by a
 			 * tunable.
 			 */
 			if (hyperthreading_allowed == 0) {
 				cpu_info[i].cpu_disabled = 1;
 				continue;
 			}
 #endif
 		}
 
 		/* Don't use this CPU if it has been disabled by a tunable. */
 		if (resource_disabled("lapic", i)) {
 			cpu_info[i].cpu_disabled = 1;
 			continue;
 		}
 	}
 
 	/*
 	 * Assign CPU IDs to local APIC IDs and disable any CPUs
 	 * beyond MAXCPU.  CPU 0 is always assigned to the BSP.
 	 *
 	 * To minimize confusion for userland, we attempt to number
 	 * CPUs such that all threads and cores in a package are
 	 * grouped together.  For now we assume that the BSP is always
 	 * the first thread in a package and just start adding APs
 	 * starting with the BSP's APIC ID.
 	 */
 	mp_ncpus = 1;
 	cpu_apic_ids[0] = boot_cpu_id;
 	apic_cpuids[boot_cpu_id] = 0;
 	for (i = boot_cpu_id + 1; i != boot_cpu_id;
 	     i == MAX_APIC_ID ? i = 0 : i++) {
 		if (!cpu_info[i].cpu_present || cpu_info[i].cpu_bsp ||
 		    cpu_info[i].cpu_disabled)
 			continue;
 
 		if (mp_ncpus < MAXCPU) {
 			cpu_apic_ids[mp_ncpus] = i;
 			apic_cpuids[i] = mp_ncpus;
 			mp_ncpus++;
 		} else
 			cpu_info[i].cpu_disabled = 1;
 	}
 	KASSERT(mp_maxid >= mp_ncpus - 1,
 	    ("%s: counters out of sync: max %d, count %d", __func__, mp_maxid,
 	    mp_ncpus));		
 }
 
 /*
  * start each AP in our list
  */
 /* Lowest 1MB is already mapped: don't touch*/
 #define TMPMAP_START 1
 static int
 start_all_aps(void)
 {
+	cpuset_t tallcpus;
 #ifndef PC98
 	u_char mpbiosreason;
 #endif
 	uintptr_t kptbase;
 	u_int32_t mpbioswarmvec;
 	int apic_id, cpu, i;
 
 	mtx_init(&ap_boot_mtx, "ap boot", NULL, MTX_SPIN);
 
 	/* install the AP 1st level boot code */
 	install_ap_tramp();
 
 	/* save the current value of the warm-start vector */
 	mpbioswarmvec = *((u_int32_t *) WARMBOOT_OFF);
 #ifndef PC98
 	outb(CMOS_REG, BIOS_RESET);
 	mpbiosreason = inb(CMOS_DATA);
 #endif
 
 	/* set up temporary P==V mapping for AP boot */
 	/* XXX this is a hack, we should boot the AP on its own stack/PTD */
 
 	kptbase = (uintptr_t)(void *)KPTphys;
 	for (i = TMPMAP_START; i < NKPT; i++)
 		PTD[i] = (pd_entry_t)(PG_V | PG_RW |
 		    ((kptbase + i * PAGE_SIZE) & PG_FRAME));
 	invltlb();
 
 	/* start each AP */
 	for (cpu = 1; cpu < mp_ncpus; cpu++) {
 		apic_id = cpu_apic_ids[cpu];
 
 		/* allocate and set up a boot stack data page */
 		bootstacks[cpu] =
 		    (char *)kmem_alloc(kernel_map, KSTACK_PAGES * PAGE_SIZE);
 		dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE);
 		/* setup a vector to our boot code */
 		*((volatile u_short *) WARMBOOT_OFF) = WARMBOOT_TARGET;
 		*((volatile u_short *) WARMBOOT_SEG) = (boot_address >> 4);
 #ifndef PC98
 		outb(CMOS_REG, BIOS_RESET);
 		outb(CMOS_DATA, BIOS_WARM);	/* 'warm-start' */
 #endif
 
 		bootSTK = (char *)bootstacks[cpu] + KSTACK_PAGES * PAGE_SIZE - 4;
 		bootAP = cpu;
 
 		/* attempt to start the Application Processor */
 		CHECK_INIT(99);	/* setup checkpoints */
 		if (!start_ap(apic_id)) {
 			printf("AP #%d (PHY# %d) failed!\n", cpu, apic_id);
 			CHECK_PRINT("trace");	/* show checkpoints */
 			/* better panic as the AP may be running loose */
 			printf("panic y/n? [y] ");
 			if (cngetc() != 'n')
 				panic("bye-bye");
 		}
 		CHECK_PRINT("trace");		/* show checkpoints */
 
-		all_cpus |= (1 << cpu);		/* record AP in CPU map */
+		CPU_SET(cpu, &all_cpus);	/* record AP in CPU map */
 	}
 
 	/* build our map of 'other' CPUs */
-	PCPU_SET(other_cpus, all_cpus & ~PCPU_GET(cpumask));
+	tallcpus = all_cpus;
+	CPU_NAND(&tallcpus, PCPU_PTR(cpumask));
+	PCPU_SET(other_cpus, tallcpus);
 
 	/* restore the warmstart vector */
 	*(u_int32_t *) WARMBOOT_OFF = mpbioswarmvec;
 
 #ifndef PC98
 	outb(CMOS_REG, BIOS_RESET);
 	outb(CMOS_DATA, mpbiosreason);
 #endif
 
 	/* Undo V==P hack from above */
 	for (i = TMPMAP_START; i < NKPT; i++)
 		PTD[i] = 0;
 	pmap_invalidate_range(kernel_pmap, 0, NKPT * NBPDR - 1);
 
 	/* number of APs actually started */
 	return mp_naps;
 }
 
 /*
  * load the 1st level AP boot code into base memory.
  */
 
 /* targets for relocation */
 extern void bigJump(void);
 extern void bootCodeSeg(void);
 extern void bootDataSeg(void);
 extern void MPentry(void);
 extern u_int MP_GDT;
 extern u_int mp_gdtbase;
 
 static void
 install_ap_tramp(void)
 {
 	int     x;
 	int     size = *(int *) ((u_long) & bootMP_size);
 	vm_offset_t va = boot_address + KERNBASE;
 	u_char *src = (u_char *) ((u_long) bootMP);
 	u_char *dst = (u_char *) va;
 	u_int   boot_base = (u_int) bootMP;
 	u_int8_t *dst8;
 	u_int16_t *dst16;
 	u_int32_t *dst32;
 
 	KASSERT (size <= PAGE_SIZE,
 	    ("'size' do not fit into PAGE_SIZE, as expected."));
 	pmap_kenter(va, boot_address);
 	pmap_invalidate_page (kernel_pmap, va);
 	for (x = 0; x < size; ++x)
 		*dst++ = *src++;
 
 	/*
 	 * modify addresses in code we just moved to basemem. unfortunately we
 	 * need fairly detailed info about mpboot.s for this to work.  changes
 	 * to mpboot.s might require changes here.
 	 */
 
 	/* boot code is located in KERNEL space */
 	dst = (u_char *) va;
 
 	/* modify the lgdt arg */
 	dst32 = (u_int32_t *) (dst + ((u_int) & mp_gdtbase - boot_base));
 	*dst32 = boot_address + ((u_int) & MP_GDT - boot_base);
 
 	/* modify the ljmp target for MPentry() */
 	dst32 = (u_int32_t *) (dst + ((u_int) bigJump - boot_base) + 1);
 	*dst32 = ((u_int) MPentry - KERNBASE);
 
 	/* modify the target for boot code segment */
 	dst16 = (u_int16_t *) (dst + ((u_int) bootCodeSeg - boot_base));
 	dst8 = (u_int8_t *) (dst16 + 1);
 	*dst16 = (u_int) boot_address & 0xffff;
 	*dst8 = ((u_int) boot_address >> 16) & 0xff;
 
 	/* modify the target for boot data segment */
 	dst16 = (u_int16_t *) (dst + ((u_int) bootDataSeg - boot_base));
 	dst8 = (u_int8_t *) (dst16 + 1);
 	*dst16 = (u_int) boot_address & 0xffff;
 	*dst8 = ((u_int) boot_address >> 16) & 0xff;
 }
 
 /*
  * This function starts the AP (application processor) identified
  * by the APIC ID 'physicalCpu'.  It does quite a "song and dance"
  * to accomplish this.  This is necessary because of the nuances
  * of the different hardware we might encounter.  It isn't pretty,
  * but it seems to work.
  */
 static int
 start_ap(int apic_id)
 {
 	int vector, ms;
 	int cpus;
 
 	/* calculate the vector */
 	vector = (boot_address >> 12) & 0xff;
 
 	/* used as a watchpoint to signal AP startup */
 	cpus = mp_naps;
 
 	/*
 	 * first we do an INIT/RESET IPI this INIT IPI might be run, reseting
 	 * and running the target CPU. OR this INIT IPI might be latched (P5
 	 * bug), CPU waiting for STARTUP IPI. OR this INIT IPI might be
 	 * ignored.
 	 */
 
 	/* do an INIT IPI: assert RESET */
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_ASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_INIT, apic_id);
 
 	/* wait for pending status end */
 	lapic_ipi_wait(-1);
 
 	/* do an INIT IPI: deassert RESET */
 	lapic_ipi_raw(APIC_DEST_ALLESELF | APIC_TRIGMOD_LEVEL |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_INIT, 0);
 
 	/* wait for pending status end */
 	DELAY(10000);		/* wait ~10mS */
 	lapic_ipi_wait(-1);
 
 	/*
 	 * next we do a STARTUP IPI: the previous INIT IPI might still be
 	 * latched, (P5 bug) this 1st STARTUP would then terminate
 	 * immediately, and the previously started INIT IPI would continue. OR
 	 * the previous INIT IPI has already run. and this STARTUP IPI will
 	 * run. OR the previous INIT IPI was ignored. and this STARTUP IPI
 	 * will run.
 	 */
 
 	/* do a STARTUP IPI */
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_STARTUP |
 	    vector, apic_id);
 	lapic_ipi_wait(-1);
 	DELAY(200);		/* wait ~200uS */
 
 	/*
 	 * finally we do a 2nd STARTUP IPI: this 2nd STARTUP IPI should run IF
 	 * the previous STARTUP IPI was cancelled by a latched INIT IPI. OR
 	 * this STARTUP IPI will be ignored, as only ONE STARTUP IPI is
 	 * recognized after hardware RESET or INIT IPI.
 	 */
 
 	lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_STARTUP |
 	    vector, apic_id);
 	lapic_ipi_wait(-1);
 	DELAY(200);		/* wait ~200uS */
 
 	/* Wait up to 5 seconds for it to start. */
 	for (ms = 0; ms < 5000; ms++) {
 		if (mp_naps > cpus)
 			return 1;	/* return SUCCESS */
 		DELAY(1000);
 	}
 	return 0;		/* return FAILURE */
 }
 
 #ifdef COUNT_XINVLTLB_HITS
 u_int xhits_gbl[MAXCPU];
 u_int xhits_pg[MAXCPU];
 u_int xhits_rng[MAXCPU];
 SYSCTL_NODE(_debug, OID_AUTO, xhits, CTLFLAG_RW, 0, "");
 SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, global, CTLFLAG_RW, &xhits_gbl,
     sizeof(xhits_gbl), "IU", "");
 SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, page, CTLFLAG_RW, &xhits_pg,
     sizeof(xhits_pg), "IU", "");
 SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, range, CTLFLAG_RW, &xhits_rng,
     sizeof(xhits_rng), "IU", "");
 
 u_int ipi_global;
 u_int ipi_page;
 u_int ipi_range;
 u_int ipi_range_size;
 SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_global, CTLFLAG_RW, &ipi_global, 0, "");
 SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_page, CTLFLAG_RW, &ipi_page, 0, "");
 SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_range, CTLFLAG_RW, &ipi_range, 0, "");
 SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_range_size, CTLFLAG_RW, &ipi_range_size,
     0, "");
 
 u_int ipi_masked_global;
 u_int ipi_masked_page;
 u_int ipi_masked_range;
 u_int ipi_masked_range_size;
 SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_masked_global, CTLFLAG_RW,
     &ipi_masked_global, 0, "");
 SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_masked_page, CTLFLAG_RW,
     &ipi_masked_page, 0, "");
 SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_masked_range, CTLFLAG_RW,
     &ipi_masked_range, 0, "");
 SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_masked_range_size, CTLFLAG_RW,
     &ipi_masked_range_size, 0, "");
 #endif /* COUNT_XINVLTLB_HITS */
 
 /*
+ * Send an IPI to specified CPU handling the bitmap logic.
+ */
+static void
+ipi_send_cpu(int cpu, u_int ipi)
+{
+	u_int bitmap, old_pending, new_pending;
+
+	KASSERT(cpu_apic_ids[cpu] != -1, ("IPI to non-existent CPU %d", cpu));
+
+	if (IPI_IS_BITMAPED(ipi)) {
+		bitmap = 1 << ipi;
+		ipi = IPI_BITMAP_VECTOR;
+		do {
+			old_pending = cpu_ipi_pending[cpu];
+			new_pending = old_pending | bitmap;
+		} while  (!atomic_cmpset_int(&cpu_ipi_pending[cpu],
+		    old_pending, new_pending));	
+		if (old_pending)
+			return;
+	}
+	lapic_ipi_vectored(ipi, cpu_apic_ids[cpu]);
+}
+
+/*
  * Flush the TLB on all other CPU's
  */
 static void
 smp_tlb_shootdown(u_int vector, vm_offset_t addr1, vm_offset_t addr2)
 {
 	u_int ncpu;
 
 	ncpu = mp_ncpus - 1;	/* does not shootdown self */
 	if (ncpu < 1)
 		return;		/* no other cpus */
 	if (!(read_eflags() & PSL_I))
 		panic("%s: interrupts disabled", __func__);
 	mtx_lock_spin(&smp_ipi_mtx);
 	smp_tlb_addr1 = addr1;
 	smp_tlb_addr2 = addr2;
 	atomic_store_rel_int(&smp_tlb_wait, 0);
 	ipi_all_but_self(vector);
 	while (smp_tlb_wait < ncpu)
 		ia32_pause();
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 
 static void
-smp_targeted_tlb_shootdown(cpumask_t mask, u_int vector, vm_offset_t addr1, vm_offset_t addr2)
+smp_targeted_tlb_shootdown(cpuset_t mask, u_int vector, vm_offset_t addr1, vm_offset_t addr2)
 {
-	int ncpu, othercpus;
+	int cpu, ncpu, othercpus;
 
 	othercpus = mp_ncpus - 1;
-	if (mask == (u_int)-1) {
-		ncpu = othercpus;
-		if (ncpu < 1)
+	if (CPU_ISFULLSET(&mask)) {
+		if (othercpus < 1)
 			return;
 	} else {
-		mask &= ~PCPU_GET(cpumask);
-		if (mask == 0)
+		sched_pin();
+		CPU_NAND(&mask, PCPU_PTR(cpumask));
+		sched_unpin();
+		if (CPU_EMPTY(&mask))
 			return;
-		ncpu = bitcount32(mask);
-		if (ncpu > othercpus) {
-			/* XXX this should be a panic offence */
-			printf("SMP: tlb shootdown to %d other cpus (only have %d)\n",
-			    ncpu, othercpus);
-			ncpu = othercpus;
-		}
-		/* XXX should be a panic, implied by mask == 0 above */
-		if (ncpu < 1)
-			return;
 	}
 	if (!(read_eflags() & PSL_I))
 		panic("%s: interrupts disabled", __func__);
 	mtx_lock_spin(&smp_ipi_mtx);
 	smp_tlb_addr1 = addr1;
 	smp_tlb_addr2 = addr2;
 	atomic_store_rel_int(&smp_tlb_wait, 0);
-	if (mask == (u_int)-1)
+	if (CPU_ISFULLSET(&mask)) {
+		ncpu = othercpus;
 		ipi_all_but_self(vector);
-	else
-		ipi_selected(mask, vector);
+	} else {
+		ncpu = 0;
+		while ((cpu = cpusetobj_ffs(&mask)) != 0) {
+			cpu--;
+			CPU_CLR(cpu, &mask);
+			CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu,
+			    vector);
+			ipi_send_cpu(cpu, vector);
+			ncpu++;
+		}
+	}
 	while (smp_tlb_wait < ncpu)
 		ia32_pause();
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 
-/*
- * Send an IPI to specified CPU handling the bitmap logic.
- */
-static void
-ipi_send_cpu(int cpu, u_int ipi)
-{
-	u_int bitmap, old_pending, new_pending;
-
-	KASSERT(cpu_apic_ids[cpu] != -1, ("IPI to non-existent CPU %d", cpu));
-
-	if (IPI_IS_BITMAPED(ipi)) {
-		bitmap = 1 << ipi;
-		ipi = IPI_BITMAP_VECTOR;
-		do {
-			old_pending = cpu_ipi_pending[cpu];
-			new_pending = old_pending | bitmap;
-		} while  (!atomic_cmpset_int(&cpu_ipi_pending[cpu],
-		    old_pending, new_pending));	
-		if (old_pending)
-			return;
-	}
-	lapic_ipi_vectored(ipi, cpu_apic_ids[cpu]);
-}
-
 void
 smp_cache_flush(void)
 {
 
 	if (smp_started)
 		smp_tlb_shootdown(IPI_INVLCACHE, 0, 0);
 }
 
 void
 smp_invltlb(void)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLTLB, 0, 0);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_global++;
 #endif
 	}
 }
 
 void
 smp_invlpg(vm_offset_t addr)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLPG, addr, 0);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_page++;
 #endif
 	}
 }
 
 void
 smp_invlpg_range(vm_offset_t addr1, vm_offset_t addr2)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLRNG, addr1, addr2);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_range++;
 		ipi_range_size += (addr2 - addr1) / PAGE_SIZE;
 #endif
 	}
 }
 
 void
-smp_masked_invltlb(cpumask_t mask)
+smp_masked_invltlb(cpuset_t mask)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLTLB, 0, 0);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_masked_global++;
 #endif
 	}
 }
 
 void
-smp_masked_invlpg(cpumask_t mask, vm_offset_t addr)
+smp_masked_invlpg(cpuset_t mask, vm_offset_t addr)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLPG, addr, 0);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_masked_page++;
 #endif
 	}
 }
 
 void
-smp_masked_invlpg_range(cpumask_t mask, vm_offset_t addr1, vm_offset_t addr2)
+smp_masked_invlpg_range(cpuset_t mask, vm_offset_t addr1, vm_offset_t addr2)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLRNG, addr1, addr2);
 #ifdef COUNT_XINVLTLB_HITS
 		ipi_masked_range++;
 		ipi_masked_range_size += (addr2 - addr1) / PAGE_SIZE;
 #endif
 	}
 }
 
 void
 ipi_bitmap_handler(struct trapframe frame)
 {
 	struct trapframe *oldframe;
 	struct thread *td;
 	int cpu = PCPU_GET(cpuid);
 	u_int ipi_bitmap;
 
 	critical_enter();
 	td = curthread;
 	td->td_intr_nesting_level++;
 	oldframe = td->td_intr_frame;
 	td->td_intr_frame = &frame;
 	ipi_bitmap = atomic_readandclear_int(&cpu_ipi_pending[cpu]);
 	if (ipi_bitmap & (1 << IPI_PREEMPT)) {
 #ifdef COUNT_IPIS
 		(*ipi_preempt_counts[cpu])++;
 #endif
 		sched_preempt(td);
 	}
 	if (ipi_bitmap & (1 << IPI_AST)) {
 #ifdef COUNT_IPIS
 		(*ipi_ast_counts[cpu])++;
 #endif
 		/* Nothing to do for AST */
 	}
 	if (ipi_bitmap & (1 << IPI_HARDCLOCK)) {
 #ifdef COUNT_IPIS
 		(*ipi_hardclock_counts[cpu])++;
 #endif
 		hardclockintr();
 	}
 	td->td_intr_frame = oldframe;
 	td->td_intr_nesting_level--;
 	critical_exit();
 }
 
 /*
  * send an IPI to a set of cpus.
  */
 void
-ipi_selected(cpumask_t cpus, u_int ipi)
+ipi_selected(cpuset_t cpus, u_int ipi)
 {
 	int cpu;
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, cpus);
+		CPU_OR_ATOMIC(&ipi_nmi_pending, &cpus);
 
-	CTR3(KTR_SMP, "%s: cpus: %x ipi: %x", __func__, cpus, ipi);
-	while ((cpu = ffs(cpus)) != 0) {
+	while ((cpu = cpusetobj_ffs(&cpus)) != 0) {
 		cpu--;
-		cpus &= ~(1 << cpu);
+		CPU_CLR(cpu, &cpus);
+		CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, ipi);
 		ipi_send_cpu(cpu, ipi);
 	}
 }
 
 /*
  * send an IPI to a specific CPU.
  */
 void
 ipi_cpu(int cpu, u_int ipi)
 {
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, 1 << cpu);
+		CPU_SET_ATOMIC(cpu, &ipi_nmi_pending);
 
 	CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, ipi);
 	ipi_send_cpu(cpu, ipi);
 }
 
 /*
  * send an IPI to all CPUs EXCEPT myself
  */
 void
 ipi_all_but_self(u_int ipi)
 {
 
+	sched_pin();
 	if (IPI_IS_BITMAPED(ipi)) {
 		ipi_selected(PCPU_GET(other_cpus), ipi);
+		sched_unpin();
 		return;
 	}
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, PCPU_GET(other_cpus));
+		CPU_OR_ATOMIC(&ipi_nmi_pending, PCPU_PTR(other_cpus));
+	sched_unpin();
+
 	CTR2(KTR_SMP, "%s: ipi: %x", __func__, ipi);
 	lapic_ipi_vectored(ipi, APIC_IPI_DEST_OTHERS);
 }
 
 int
 ipi_nmi_handler()
 {
-	cpumask_t cpumask;
+	cpuset_t cpumask;
 
 	/*
 	 * As long as there is not a simple way to know about a NMI's
 	 * source, if the bitmask for the current CPU is present in
 	 * the global pending bitword an IPI_STOP_HARD has been issued
 	 * and should be handled.
 	 */
+	sched_pin();
 	cpumask = PCPU_GET(cpumask);
-	if ((ipi_nmi_pending & cpumask) == 0)
+	sched_unpin();
+	if (!CPU_OVERLAP(&ipi_nmi_pending, &cpumask))
 		return (1);
 
-	atomic_clear_int(&ipi_nmi_pending, cpumask);
+	CPU_NAND_ATOMIC(&ipi_nmi_pending, &cpumask);
 	cpustop_handler();
 	return (0);
 }
 
 /*
  * Handle an IPI_STOP by saving our current context and spinning until we
  * are resumed.
  */
 void
 cpustop_handler(void)
 {
-	cpumask_t cpumask;
+	cpuset_t cpumask;
 	u_int cpu;
 
+	sched_pin();
 	cpu = PCPU_GET(cpuid);
 	cpumask = PCPU_GET(cpumask);
+	sched_unpin();
 
 	savectx(&stoppcbs[cpu]);
 
 	/* Indicate that we are stopped */
-	atomic_set_int(&stopped_cpus, cpumask);
+	CPU_OR_ATOMIC(&stopped_cpus, &cpumask);
 
 	/* Wait for restart */
-	while (!(started_cpus & cpumask))
+	while (!CPU_OVERLAP(&started_cpus, &cpumask))
 	    ia32_pause();
 
-	atomic_clear_int(&started_cpus, cpumask);
-	atomic_clear_int(&stopped_cpus, cpumask);
+	CPU_NAND_ATOMIC(&started_cpus, &cpumask);
+	CPU_NAND_ATOMIC(&stopped_cpus, &cpumask);
 
 	if (cpu == 0 && cpustop_restartfunc != NULL) {
 		cpustop_restartfunc();
 		cpustop_restartfunc = NULL;
 	}
 }
 
 /*
  * This is called once the rest of the system is up and running and we're
  * ready to let the AP's out of the pen.
  */
 static void
 release_aps(void *dummy __unused)
 {
 
 	if (mp_ncpus == 1) 
 		return;
 	atomic_store_rel_int(&aps_ready, 1);
 	while (smp_started == 0)
 		ia32_pause();
 }
 SYSINIT(start_aps, SI_SUB_SMP, SI_ORDER_FIRST, release_aps, NULL);
 
 static int
 sysctl_hlt_cpus(SYSCTL_HANDLER_ARGS)
 {
-	cpumask_t mask;
+	cpuset_t mask;
 	int error;
 
 	mask = hlt_cpus_mask;
-	error = sysctl_handle_int(oidp, &mask, 0, req);
+	error = sysctl_handle_opaque(oidp, &mask, sizeof(mask), req);
 	if (error || !req->newptr)
 		return (error);
 
-	if (logical_cpus_mask != 0 &&
-	    (mask & logical_cpus_mask) == logical_cpus_mask)
+	if (!CPU_EMPTY(&logical_cpus_mask) &&
+	    CPU_SUBSET(&mask, &logical_cpus_mask))
 		hlt_logical_cpus = 1;
 	else
 		hlt_logical_cpus = 0;
 
 	if (! hyperthreading_allowed)
-		mask |= hyperthreading_cpus_mask;
+		CPU_OR(&mask, &hyperthreading_cpus_mask);
 
-	if ((mask & all_cpus) == all_cpus)
-		mask &= ~(1<<0);
+	if (CPU_SUBSET(&mask, &all_cpus))
+		CPU_CLR(0, &mask);
 	hlt_cpus_mask = mask;
 	return (error);
 }
-SYSCTL_PROC(_machdep, OID_AUTO, hlt_cpus, CTLTYPE_INT|CTLFLAG_RW,
-    0, 0, sysctl_hlt_cpus, "IU",
+SYSCTL_PROC(_machdep, OID_AUTO, hlt_cpus,
+    CTLTYPE_STRUCT | CTLFLAG_RW | CTLFLAG_MPSAFE, 0, 0, sysctl_hlt_cpus, "S",
     "Bitmap of CPUs to halt.  101 (binary) will halt CPUs 0 and 2.");
 
 static int
 sysctl_hlt_logical_cpus(SYSCTL_HANDLER_ARGS)
 {
 	int disable, error;
 
 	disable = hlt_logical_cpus;
 	error = sysctl_handle_int(oidp, &disable, 0, req);
 	if (error || !req->newptr)
 		return (error);
 
 	if (disable)
-		hlt_cpus_mask |= logical_cpus_mask;
+		CPU_OR(&hlt_cpus_mask, &logical_cpus_mask);
 	else
-		hlt_cpus_mask &= ~logical_cpus_mask;
+		CPU_NAND(&hlt_cpus_mask, &logical_cpus_mask);
 
 	if (! hyperthreading_allowed)
-		hlt_cpus_mask |= hyperthreading_cpus_mask;
+		CPU_OR(&hlt_cpus_mask, &hyperthreading_cpus_mask);
 
-	if ((hlt_cpus_mask & all_cpus) == all_cpus)
-		hlt_cpus_mask &= ~(1<<0);
+	if (CPU_SUBSET(&hlt_cpus_mask, &all_cpus))
+		CPU_CLR(0, &hlt_cpus_mask);
 
 	hlt_logical_cpus = disable;
 	return (error);
 }
 
 static int
 sysctl_hyperthreading_allowed(SYSCTL_HANDLER_ARGS)
 {
 	int allowed, error;
 
 	allowed = hyperthreading_allowed;
 	error = sysctl_handle_int(oidp, &allowed, 0, req);
 	if (error || !req->newptr)
 		return (error);
 
 #ifdef SCHED_ULE
 	/*
 	 * SCHED_ULE doesn't allow enabling/disabling HT cores at
 	 * run-time.
 	 */
 	if (allowed != hyperthreading_allowed)
 		return (ENOTSUP);
 	return (error);
 #endif
 
 	if (allowed)
-		hlt_cpus_mask &= ~hyperthreading_cpus_mask;
+		CPU_NAND(&hlt_cpus_mask, &hyperthreading_cpus_mask);
 	else
-		hlt_cpus_mask |= hyperthreading_cpus_mask;
+		CPU_OR(&hlt_cpus_mask, &hyperthreading_cpus_mask);
 
-	if (logical_cpus_mask != 0 &&
-	    (hlt_cpus_mask & logical_cpus_mask) == logical_cpus_mask)
+	if (!CPU_EMPTY(&logical_cpus_mask) &&
+	    CPU_SUBSET(&hlt_cpus_mask, &logical_cpus_mask))
 		hlt_logical_cpus = 1;
 	else
 		hlt_logical_cpus = 0;
 
-	if ((hlt_cpus_mask & all_cpus) == all_cpus)
-		hlt_cpus_mask &= ~(1<<0);
+	if (CPU_SUBSET(&hlt_cpus_mask, &all_cpus))
+		CPU_CLR(0, &hlt_cpus_mask);
 
 	hyperthreading_allowed = allowed;
 	return (error);
 }
 
 static void
 cpu_hlt_setup(void *dummy __unused)
 {
 
-	if (logical_cpus_mask != 0) {
+	if (!CPU_EMPTY(&logical_cpus_mask)) {
 		TUNABLE_INT_FETCH("machdep.hlt_logical_cpus",
 		    &hlt_logical_cpus);
 		sysctl_ctx_init(&logical_cpu_clist);
 		SYSCTL_ADD_PROC(&logical_cpu_clist,
 		    SYSCTL_STATIC_CHILDREN(_machdep), OID_AUTO,
 		    "hlt_logical_cpus", CTLTYPE_INT|CTLFLAG_RW, 0, 0,
 		    sysctl_hlt_logical_cpus, "IU", "");
 		SYSCTL_ADD_UINT(&logical_cpu_clist,
 		    SYSCTL_STATIC_CHILDREN(_machdep), OID_AUTO,
 		    "logical_cpus_mask", CTLTYPE_INT|CTLFLAG_RD,
 		    &logical_cpus_mask, 0, "");
 
 		if (hlt_logical_cpus)
-			hlt_cpus_mask |= logical_cpus_mask;
+			CPU_OR(&hlt_cpus_mask, &logical_cpus_mask);
 
 		/*
 		 * If necessary for security purposes, force
 		 * hyperthreading off, regardless of the value
 		 * of hlt_logical_cpus.
 		 */
-		if (hyperthreading_cpus_mask) {
+		if (!CPU_EMPTY(&hyperthreading_cpus_mask)) {
 			SYSCTL_ADD_PROC(&logical_cpu_clist,
 			    SYSCTL_STATIC_CHILDREN(_machdep), OID_AUTO,
 			    "hyperthreading_allowed", CTLTYPE_INT|CTLFLAG_RW,
 			    0, 0, sysctl_hyperthreading_allowed, "IU", "");
 			if (! hyperthreading_allowed)
-				hlt_cpus_mask |= hyperthreading_cpus_mask;
+				CPU_OR(&hlt_cpus_mask,
+				    &hyperthreading_cpus_mask);
 		}
 	}
 }
 SYSINIT(cpu_hlt, SI_SUB_SMP, SI_ORDER_ANY, cpu_hlt_setup, NULL);
 
 int
 mp_grab_cpu_hlt(void)
 {
-	cpumask_t mask;
+	cpuset_t mask;
 #ifdef MP_WATCHDOG
 	u_int cpuid;
 #endif
 	int retval;
 
 	mask = PCPU_GET(cpumask);
 #ifdef MP_WATCHDOG
 	cpuid = PCPU_GET(cpuid);
 	ap_watchdog(cpuid);
 #endif
 
 	retval = 0;
-	while (mask & hlt_cpus_mask) {
+	while (CPU_OVERLAP(&mask, &hlt_cpus_mask)) {
 		retval = 1;
 		__asm __volatile("sti; hlt" : : : "memory");
 	}
 	return (retval);
 }
 
 #ifdef COUNT_IPIS
 /*
  * Setup interrupt counters for IPI handlers.
  */
 static void
 mp_ipi_intrcnt(void *dummy)
 {
 	char buf[64];
 	int i;
 
 	CPU_FOREACH(i) {
 		snprintf(buf, sizeof(buf), "cpu%d:invltlb", i);
 		intrcnt_add(buf, &ipi_invltlb_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:invlrng", i);
 		intrcnt_add(buf, &ipi_invlrng_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:invlpg", i);
 		intrcnt_add(buf, &ipi_invlpg_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:preempt", i);
 		intrcnt_add(buf, &ipi_preempt_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:ast", i);
 		intrcnt_add(buf, &ipi_ast_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:rendezvous", i);
 		intrcnt_add(buf, &ipi_rendezvous_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:lazypmap", i);
 		intrcnt_add(buf, &ipi_lazypmap_counts[i]);
 		snprintf(buf, sizeof(buf), "cpu%d:hardclock", i);
 		intrcnt_add(buf, &ipi_hardclock_counts[i]);
 	}		
 }
 SYSINIT(mp_ipi_intrcnt, SI_SUB_INTR, SI_ORDER_MIDDLE, mp_ipi_intrcnt, NULL);
 #endif
Index: head/sys/i386/i386/pmap.c
===================================================================
--- head/sys/i386/i386/pmap.c	(revision 222812)
+++ head/sys/i386/i386/pmap.c	(revision 222813)
@@ -1,5238 +1,5264 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 1994 John S. Dyson
  * All rights reserved.
  * Copyright (c) 1994 David Greenman
  * All rights reserved.
  * Copyright (c) 2005-2010 Alan L. Cox <alc@cs.rice.edu>
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from:	@(#)pmap.c	7.7 (Berkeley)	5/12/91
  */
 /*-
  * Copyright (c) 2003 Networks Associates Technology, Inc.
  * All rights reserved.
  *
  * This software was developed for the FreeBSD Project by Jake Burkholder,
  * Safeport Network Services, and Network Associates Laboratories, the
  * Security Research Division of Network Associates, Inc. under
  * DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA
  * CHATS research program.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  *	Manages physical address maps.
  *
  *	In addition to hardware address maps, this
  *	module is called upon to provide software-use-only
  *	maps which may or may not be stored in the same
  *	form as hardware maps.  These pseudo-maps are
  *	used to store intermediate results from copy
  *	operations to and from address spaces.
  *
  *	Since the information managed by this module is
  *	also stored by the logical address mapping module,
  *	this module may throw away valid virtual-to-physical
  *	mappings at almost any time.  However, invalidations
  *	of virtual-to-physical mappings must be done as
  *	requested.
  *
  *	In order to cope with hardware architectures which
  *	make virtual-to-physical map invalidates expensive,
  *	this module may delay invalidate or reduced protection
  *	operations until such time as they are actually
  *	necessary.  This module is given full information as
  *	to which processors are currently using which maps,
  *	and to when physical maps must be made correct.
  */
 
 #include "opt_cpu.h"
 #include "opt_pmap.h"
 #include "opt_smp.h"
 #include "opt_xbox.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mman.h>
 #include <sys/msgbuf.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/sf_buf.h>
 #include <sys/sx.h>
 #include <sys/vmmeter.h>
 #include <sys/sched.h>
 #include <sys/sysctl.h>
 #ifdef SMP
 #include <sys/smp.h>
+#else
+#include <sys/cpuset.h>
 #endif
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 #include <vm/vm_reserv.h>
 #include <vm/uma.h>
 
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
 #include <machine/specialreg.h>
 #ifdef SMP
 #include <machine/smp.h>
 #endif
 
 #ifdef XBOX
 #include <machine/xbox.h>
 #endif
 
 #if !defined(CPU_DISABLE_SSE) && defined(I686_CPU)
 #define CPU_ENABLE_SSE
 #endif
 
 #ifndef PMAP_SHPGPERPROC
 #define PMAP_SHPGPERPROC 200
 #endif
 
 #if !defined(DIAGNOSTIC)
 #ifdef __GNUC_GNU_INLINE__
 #define PMAP_INLINE	__attribute__((__gnu_inline__)) inline
 #else
 #define PMAP_INLINE	extern inline
 #endif
 #else
 #define PMAP_INLINE
 #endif
 
 #define PV_STATS
 #ifdef PV_STATS
 #define PV_STAT(x)	do { x ; } while (0)
 #else
 #define PV_STAT(x)	do { } while (0)
 #endif
 
 #define	pa_index(pa)	((pa) >> PDRSHIFT)
 #define	pa_to_pvh(pa)	(&pv_table[pa_index(pa)])
 
 /*
  * Get PDEs and PTEs for user/kernel address space
  */
 #define	pmap_pde(m, v)	(&((m)->pm_pdir[(vm_offset_t)(v) >> PDRSHIFT]))
 #define pdir_pde(m, v) (m[(vm_offset_t)(v) >> PDRSHIFT])
 
 #define pmap_pde_v(pte)		((*(int *)pte & PG_V) != 0)
 #define pmap_pte_w(pte)		((*(int *)pte & PG_W) != 0)
 #define pmap_pte_m(pte)		((*(int *)pte & PG_M) != 0)
 #define pmap_pte_u(pte)		((*(int *)pte & PG_A) != 0)
 #define pmap_pte_v(pte)		((*(int *)pte & PG_V) != 0)
 
 #define pmap_pte_set_w(pte, v)	((v) ? atomic_set_int((u_int *)(pte), PG_W) : \
     atomic_clear_int((u_int *)(pte), PG_W))
 #define pmap_pte_set_prot(pte, v) ((*(int *)pte &= ~PG_PROT), (*(int *)pte |= (v)))
 
 struct pmap kernel_pmap_store;
 LIST_HEAD(pmaplist, pmap);
 static struct pmaplist allpmaps;
 static struct mtx allpmaps_lock;
 
 vm_offset_t virtual_avail;	/* VA of first avail page (after kernel bss) */
 vm_offset_t virtual_end;	/* VA of last avail page (end of kernel AS) */
 int pgeflag = 0;		/* PG_G or-in */
 int pseflag = 0;		/* PG_PS or-in */
 
 static int nkpt = NKPT;
 vm_offset_t kernel_vm_end = KERNBASE + NKPT * NBPDR;
 extern u_int32_t KERNend;
 extern u_int32_t KPTphys;
 
 #ifdef PAE
 pt_entry_t pg_nx;
 static uma_zone_t pdptzone;
 #endif
 
 SYSCTL_NODE(_vm, OID_AUTO, pmap, CTLFLAG_RD, 0, "VM/pmap parameters");
 
 static int pat_works = 1;
 SYSCTL_INT(_vm_pmap, OID_AUTO, pat_works, CTLFLAG_RD, &pat_works, 1,
     "Is page attribute table fully functional?");
 
 static int pg_ps_enabled = 1;
 SYSCTL_INT(_vm_pmap, OID_AUTO, pg_ps_enabled, CTLFLAG_RDTUN, &pg_ps_enabled, 0,
     "Are large page mappings enabled?");
 
 #define	PAT_INDEX_SIZE	8
 static int pat_index[PAT_INDEX_SIZE];	/* cache mode to PAT index conversion */
 
 /*
  * Data for the pv entry allocation mechanism
  */
 static int pv_entry_count = 0, pv_entry_max = 0, pv_entry_high_water = 0;
 static struct md_page *pv_table;
 static int shpgperproc = PMAP_SHPGPERPROC;
 
 struct pv_chunk *pv_chunkbase;		/* KVA block for pv_chunks */
 int pv_maxchunks;			/* How many chunks we have KVA for */
 vm_offset_t pv_vafree;			/* freelist stored in the PTE */
 
 /*
  * All those kernel PT submaps that BSD is so fond of
  */
 struct sysmaps {
 	struct	mtx lock;
 	pt_entry_t *CMAP1;
 	pt_entry_t *CMAP2;
 	caddr_t	CADDR1;
 	caddr_t	CADDR2;
 };
 static struct sysmaps sysmaps_pcpu[MAXCPU];
 pt_entry_t *CMAP1 = 0;
 static pt_entry_t *CMAP3;
 static pd_entry_t *KPTD;
 caddr_t CADDR1 = 0, ptvmmap = 0;
 static caddr_t CADDR3;
 struct msgbuf *msgbufp = 0;
 
 /*
  * Crashdump maps.
  */
 static caddr_t crashdumpmap;
 
 static pt_entry_t *PMAP1 = 0, *PMAP2;
 static pt_entry_t *PADDR1 = 0, *PADDR2;
 #ifdef SMP
 static int PMAP1cpu;
 static int PMAP1changedcpu;
 SYSCTL_INT(_debug, OID_AUTO, PMAP1changedcpu, CTLFLAG_RD, 
 	   &PMAP1changedcpu, 0,
 	   "Number of times pmap_pte_quick changed CPU with same PMAP1");
 #endif
 static int PMAP1changed;
 SYSCTL_INT(_debug, OID_AUTO, PMAP1changed, CTLFLAG_RD, 
 	   &PMAP1changed, 0,
 	   "Number of times pmap_pte_quick changed PMAP1");
 static int PMAP1unchanged;
 SYSCTL_INT(_debug, OID_AUTO, PMAP1unchanged, CTLFLAG_RD, 
 	   &PMAP1unchanged, 0,
 	   "Number of times pmap_pte_quick didn't change PMAP1");
 static struct mtx PMAP2mutex;
 
 static void	free_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t locked_pmap, int try);
 static void	pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa);
 static boolean_t pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa);
 static void	pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa);
 static void	pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va);
 static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap,
 		    vm_offset_t va);
 static int	pmap_pvh_wired_mappings(struct md_page *pvh, int count);
 
 static boolean_t pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va);
 static boolean_t pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot);
 static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va,
     vm_page_t m, vm_prot_t prot, vm_page_t mpte);
 static void pmap_flush_page(vm_page_t m);
 static void pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte);
 static void pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte);
 static boolean_t pmap_is_modified_pvh(struct md_page *pvh);
 static boolean_t pmap_is_referenced_pvh(struct md_page *pvh);
 static void pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode);
 static void pmap_kenter_pde(vm_offset_t va, pd_entry_t newpde);
 static vm_page_t pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va);
 static void pmap_pde_attr(pd_entry_t *pde, int cache_bits);
 static void pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va);
 static boolean_t pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva,
     vm_prot_t prot);
 static void pmap_pte_attr(pt_entry_t *pte, int cache_bits);
 static void pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva,
     vm_page_t *free);
 static int pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t sva,
     vm_page_t *free);
 static void pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte);
 static void pmap_remove_page(struct pmap *pmap, vm_offset_t va,
     vm_page_t *free);
 static void pmap_remove_entry(struct pmap *pmap, vm_page_t m,
 					vm_offset_t va);
 static void pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t m);
 static boolean_t pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va,
     vm_page_t m);
 static void pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde,
     pd_entry_t newpde);
 static void pmap_update_pde_invalidate(vm_offset_t va, pd_entry_t newpde);
 
 static vm_page_t pmap_allocpte(pmap_t pmap, vm_offset_t va, int flags);
 
 static vm_page_t _pmap_allocpte(pmap_t pmap, unsigned ptepindex, int flags);
 static int _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, vm_page_t *free);
 static pt_entry_t *pmap_pte_quick(pmap_t pmap, vm_offset_t va);
 static void pmap_pte_release(pt_entry_t *pte);
 static int pmap_unuse_pt(pmap_t, vm_offset_t, vm_page_t *);
 #ifdef PAE
 static void *pmap_pdpt_allocf(uma_zone_t zone, int bytes, u_int8_t *flags, int wait);
 #endif
 static void pmap_set_pg(void);
 
 CTASSERT(1 << PDESHIFT == sizeof(pd_entry_t));
 CTASSERT(1 << PTESHIFT == sizeof(pt_entry_t));
 
 /*
  * If you get an error here, then you set KVA_PAGES wrong! See the
  * description of KVA_PAGES in sys/i386/include/pmap.h. It must be
  * multiple of 4 for a normal kernel, or a multiple of 8 for a PAE.
  */
 CTASSERT(KERNBASE % (1 << 24) == 0);
 
 /*
  *	Bootstrap the system enough to run with virtual memory.
  *
  *	On the i386 this is called after mapping has already been enabled
  *	and just syncs the pmap module with what has already been done.
  *	[We can't call it easily with mapping off since the kernel is not
  *	mapped with PA == VA, hence we would have to relocate every address
  *	from the linked base (virtual) address "KERNBASE" to the actual
  *	(physical) address starting relative to 0]
  */
 void
 pmap_bootstrap(vm_paddr_t firstaddr)
 {
 	vm_offset_t va;
 	pt_entry_t *pte, *unused;
 	struct sysmaps *sysmaps;
 	int i;
 
 	/*
 	 * Initialize the first available kernel virtual address.  However,
 	 * using "firstaddr" may waste a few pages of the kernel virtual
 	 * address space, because locore may not have mapped every physical
 	 * page that it allocated.  Preferably, locore would provide a first
 	 * unused virtual address in addition to "firstaddr".
 	 */
 	virtual_avail = (vm_offset_t) KERNBASE + firstaddr;
 
 	virtual_end = VM_MAX_KERNEL_ADDRESS;
 
 	/*
 	 * Initialize the kernel pmap (which is statically allocated).
 	 */
 	PMAP_LOCK_INIT(kernel_pmap);
 	kernel_pmap->pm_pdir = (pd_entry_t *) (KERNBASE + (u_int)IdlePTD);
 #ifdef PAE
 	kernel_pmap->pm_pdpt = (pdpt_entry_t *) (KERNBASE + (u_int)IdlePDPT);
 #endif
 	kernel_pmap->pm_root = NULL;
-	kernel_pmap->pm_active = -1;	/* don't allow deactivation */
+	CPU_FILL(&kernel_pmap->pm_active);	/* don't allow deactivation */
 	TAILQ_INIT(&kernel_pmap->pm_pvchunk);
 	LIST_INIT(&allpmaps);
 
 	/*
 	 * Request a spin mutex so that changes to allpmaps cannot be
 	 * preempted by smp_rendezvous_cpus().  Otherwise,
 	 * pmap_update_pde_kernel() could access allpmaps while it is
 	 * being changed.
 	 */
 	mtx_init(&allpmaps_lock, "allpmaps", NULL, MTX_SPIN);
 	mtx_lock_spin(&allpmaps_lock);
 	LIST_INSERT_HEAD(&allpmaps, kernel_pmap, pm_list);
 	mtx_unlock_spin(&allpmaps_lock);
 
 	/*
 	 * Reserve some special page table entries/VA space for temporary
 	 * mapping of pages.
 	 */
 #define	SYSMAP(c, p, v, n)	\
 	v = (c)va; va += ((n)*PAGE_SIZE); p = pte; pte += (n);
 
 	va = virtual_avail;
 	pte = vtopte(va);
 
 	/*
 	 * CMAP1/CMAP2 are used for zeroing and copying pages.
 	 * CMAP3 is used for the idle process page zeroing.
 	 */
 	for (i = 0; i < MAXCPU; i++) {
 		sysmaps = &sysmaps_pcpu[i];
 		mtx_init(&sysmaps->lock, "SYSMAPS", NULL, MTX_DEF);
 		SYSMAP(caddr_t, sysmaps->CMAP1, sysmaps->CADDR1, 1)
 		SYSMAP(caddr_t, sysmaps->CMAP2, sysmaps->CADDR2, 1)
 	}
 	SYSMAP(caddr_t, CMAP1, CADDR1, 1)
 	SYSMAP(caddr_t, CMAP3, CADDR3, 1)
 
 	/*
 	 * Crashdump maps.
 	 */
 	SYSMAP(caddr_t, unused, crashdumpmap, MAXDUMPPGS)
 
 	/*
 	 * ptvmmap is used for reading arbitrary physical pages via /dev/mem.
 	 */
 	SYSMAP(caddr_t, unused, ptvmmap, 1)
 
 	/*
 	 * msgbufp is used to map the system message buffer.
 	 */
 	SYSMAP(struct msgbuf *, unused, msgbufp, atop(round_page(msgbufsize)))
 
 	/*
 	 * KPTmap is used by pmap_kextract().
 	 *
 	 * KPTmap is first initialized by locore.  However, that initial
 	 * KPTmap can only support NKPT page table pages.  Here, a larger
 	 * KPTmap is created that can support KVA_PAGES page table pages.
 	 */
 	SYSMAP(pt_entry_t *, KPTD, KPTmap, KVA_PAGES)
 
 	for (i = 0; i < NKPT; i++)
 		KPTD[i] = (KPTphys + (i << PAGE_SHIFT)) | pgeflag | PG_RW | PG_V;
 
 	/*
 	 * Adjust the start of the KPTD and KPTmap so that the implementation
 	 * of pmap_kextract() and pmap_growkernel() can be made simpler.
 	 */
 	KPTD -= KPTDI;
 	KPTmap -= i386_btop(KPTDI << PDRSHIFT);
 
 	/*
 	 * ptemap is used for pmap_pte_quick
 	 */
 	SYSMAP(pt_entry_t *, PMAP1, PADDR1, 1)
 	SYSMAP(pt_entry_t *, PMAP2, PADDR2, 1)
 
 	mtx_init(&PMAP2mutex, "PMAP2", NULL, MTX_DEF);
 
 	virtual_avail = va;
 
 	/*
 	 * Leave in place an identity mapping (virt == phys) for the low 1 MB
 	 * physical memory region that is used by the ACPI wakeup code.  This
 	 * mapping must not have PG_G set. 
 	 */
 #ifdef XBOX
 	/* FIXME: This is gross, but needed for the XBOX. Since we are in such
 	 * an early stadium, we cannot yet neatly map video memory ... :-(
 	 * Better fixes are very welcome! */
 	if (!arch_i386_is_xbox)
 #endif
 	for (i = 1; i < NKPT; i++)
 		PTD[i] = 0;
 
 	/* Initialize the PAT MSR if present. */
 	pmap_init_pat();
 
 	/* Turn on PG_G on kernel page(s) */
 	pmap_set_pg();
 }
 
 /*
  * Setup the PAT MSR.
  */
 void
 pmap_init_pat(void)
 {
 	int pat_table[PAT_INDEX_SIZE];
 	uint64_t pat_msr;
 	u_long cr0, cr4;
 	int i;
 
 	/* Set default PAT index table. */
 	for (i = 0; i < PAT_INDEX_SIZE; i++)
 		pat_table[i] = -1;
 	pat_table[PAT_WRITE_BACK] = 0;
 	pat_table[PAT_WRITE_THROUGH] = 1;
 	pat_table[PAT_UNCACHEABLE] = 3;
 	pat_table[PAT_WRITE_COMBINING] = 3;
 	pat_table[PAT_WRITE_PROTECTED] = 3;
 	pat_table[PAT_UNCACHED] = 3;
 
 	/* Bail if this CPU doesn't implement PAT. */
 	if ((cpu_feature & CPUID_PAT) == 0) {
 		for (i = 0; i < PAT_INDEX_SIZE; i++)
 			pat_index[i] = pat_table[i];
 		pat_works = 0;
 		return;
 	}
 
 	/*
 	 * Due to some Intel errata, we can only safely use the lower 4
 	 * PAT entries.
 	 *
 	 *   Intel Pentium III Processor Specification Update
 	 * Errata E.27 (Upper Four PAT Entries Not Usable With Mode B
 	 * or Mode C Paging)
 	 *
 	 *   Intel Pentium IV  Processor Specification Update
 	 * Errata N46 (PAT Index MSB May Be Calculated Incorrectly)
 	 */
 	if (cpu_vendor_id == CPU_VENDOR_INTEL &&
 	    !(CPUID_TO_FAMILY(cpu_id) == 6 && CPUID_TO_MODEL(cpu_id) >= 0xe))
 		pat_works = 0;
 
 	/* Initialize default PAT entries. */
 	pat_msr = PAT_VALUE(0, PAT_WRITE_BACK) |
 	    PAT_VALUE(1, PAT_WRITE_THROUGH) |
 	    PAT_VALUE(2, PAT_UNCACHED) |
 	    PAT_VALUE(3, PAT_UNCACHEABLE) |
 	    PAT_VALUE(4, PAT_WRITE_BACK) |
 	    PAT_VALUE(5, PAT_WRITE_THROUGH) |
 	    PAT_VALUE(6, PAT_UNCACHED) |
 	    PAT_VALUE(7, PAT_UNCACHEABLE);
 
 	if (pat_works) {
 		/*
 		 * Leave the indices 0-3 at the default of WB, WT, UC-, and UC.
 		 * Program 5 and 6 as WP and WC.
 		 * Leave 4 and 7 as WB and UC.
 		 */
 		pat_msr &= ~(PAT_MASK(5) | PAT_MASK(6));
 		pat_msr |= PAT_VALUE(5, PAT_WRITE_PROTECTED) |
 		    PAT_VALUE(6, PAT_WRITE_COMBINING);
 		pat_table[PAT_UNCACHED] = 2;
 		pat_table[PAT_WRITE_PROTECTED] = 5;
 		pat_table[PAT_WRITE_COMBINING] = 6;
 	} else {
 		/*
 		 * Just replace PAT Index 2 with WC instead of UC-.
 		 */
 		pat_msr &= ~PAT_MASK(2);
 		pat_msr |= PAT_VALUE(2, PAT_WRITE_COMBINING);
 		pat_table[PAT_WRITE_COMBINING] = 2;
 	}
 
 	/* Disable PGE. */
 	cr4 = rcr4();
 	load_cr4(cr4 & ~CR4_PGE);
 
 	/* Disable caches (CD = 1, NW = 0). */
 	cr0 = rcr0();
 	load_cr0((cr0 & ~CR0_NW) | CR0_CD);
 
 	/* Flushes caches and TLBs. */
 	wbinvd();
 	invltlb();
 
 	/* Update PAT and index table. */
 	wrmsr(MSR_PAT, pat_msr);
 	for (i = 0; i < PAT_INDEX_SIZE; i++)
 		pat_index[i] = pat_table[i];
 
 	/* Flush caches and TLBs again. */
 	wbinvd();
 	invltlb();
 
 	/* Restore caches and PGE. */
 	load_cr0(cr0);
 	load_cr4(cr4);
 }
 
 /*
  * Set PG_G on kernel pages.  Only the BSP calls this when SMP is turned on.
  */
 static void
 pmap_set_pg(void)
 {
 	pt_entry_t *pte;
 	vm_offset_t va, endva;
 
 	if (pgeflag == 0)
 		return;
 
 	endva = KERNBASE + KERNend;
 
 	if (pseflag) {
 		va = KERNBASE + KERNLOAD;
 		while (va  < endva) {
 			pdir_pde(PTD, va) |= pgeflag;
 			invltlb();	/* Play it safe, invltlb() every time */
 			va += NBPDR;
 		}
 	} else {
 		va = (vm_offset_t)btext;
 		while (va < endva) {
 			pte = vtopte(va);
 			if (*pte)
 				*pte |= pgeflag;
 			invltlb();	/* Play it safe, invltlb() every time */
 			va += PAGE_SIZE;
 		}
 	}
 }
 
 /*
  * Initialize a vm_page's machine-dependent fields.
  */
 void
 pmap_page_init(vm_page_t m)
 {
 
 	TAILQ_INIT(&m->md.pv_list);
 	m->md.pat_mode = PAT_WRITE_BACK;
 }
 
 #ifdef PAE
 static void *
 pmap_pdpt_allocf(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
 {
 
 	/* Inform UMA that this allocator uses kernel_map/object. */
 	*flags = UMA_SLAB_KERNEL;
 	return ((void *)kmem_alloc_contig(kernel_map, bytes, wait, 0x0ULL,
 	    0xffffffffULL, 1, 0, VM_MEMATTR_DEFAULT));
 }
 #endif
 
 /*
  * ABuse the pte nodes for unmapped kva to thread a kva freelist through.
  * Requirements:
  *  - Must deal with pages in order to ensure that none of the PG_* bits
  *    are ever set, PG_V in particular.
  *  - Assumes we can write to ptes without pte_store() atomic ops, even
  *    on PAE systems.  This should be ok.
  *  - Assumes nothing will ever test these addresses for 0 to indicate
  *    no mapping instead of correctly checking PG_V.
  *  - Assumes a vm_offset_t will fit in a pte (true for i386).
  * Because PG_V is never set, there can be no mappings to invalidate.
  */
 static vm_offset_t
 pmap_ptelist_alloc(vm_offset_t *head)
 {
 	pt_entry_t *pte;
 	vm_offset_t va;
 
 	va = *head;
 	if (va == 0)
 		return (va);	/* Out of memory */
 	pte = vtopte(va);
 	*head = *pte;
 	if (*head & PG_V)
 		panic("pmap_ptelist_alloc: va with PG_V set!");
 	*pte = 0;
 	return (va);
 }
 
 static void
 pmap_ptelist_free(vm_offset_t *head, vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	if (va & PG_V)
 		panic("pmap_ptelist_free: freeing va with PG_V set!");
 	pte = vtopte(va);
 	*pte = *head;		/* virtual! PG_V is 0 though */
 	*head = va;
 }
 
 static void
 pmap_ptelist_init(vm_offset_t *head, void *base, int npages)
 {
 	int i;
 	vm_offset_t va;
 
 	*head = 0;
 	for (i = npages - 1; i >= 0; i--) {
 		va = (vm_offset_t)base + i * PAGE_SIZE;
 		pmap_ptelist_free(head, va);
 	}
 }
 
 
 /*
  *	Initialize the pmap module.
  *	Called by vm_init, to initialize any structures that the pmap
  *	system needs to map virtual memory.
  */
 void
 pmap_init(void)
 {
 	vm_page_t mpte;
 	vm_size_t s;
 	int i, pv_npg;
 
 	/*
 	 * Initialize the vm page array entries for the kernel pmap's
 	 * page table pages.
 	 */ 
 	for (i = 0; i < NKPT; i++) {
 		mpte = PHYS_TO_VM_PAGE(KPTphys + (i << PAGE_SHIFT));
 		KASSERT(mpte >= vm_page_array &&
 		    mpte < &vm_page_array[vm_page_array_size],
 		    ("pmap_init: page table page is out of range"));
 		mpte->pindex = i + KPTDI;
 		mpte->phys_addr = KPTphys + (i << PAGE_SHIFT);
 	}
 
 	/*
 	 * Initialize the address space (zone) for the pv entries.  Set a
 	 * high water mark so that the system can recover from excessive
 	 * numbers of pv entries.
 	 */
 	TUNABLE_INT_FETCH("vm.pmap.shpgperproc", &shpgperproc);
 	pv_entry_max = shpgperproc * maxproc + cnt.v_page_count;
 	TUNABLE_INT_FETCH("vm.pmap.pv_entries", &pv_entry_max);
 	pv_entry_max = roundup(pv_entry_max, _NPCPV);
 	pv_entry_high_water = 9 * (pv_entry_max / 10);
 
 	/*
 	 * If the kernel is running in a virtual machine on an AMD Family 10h
 	 * processor, then it must assume that MCA is enabled by the virtual
 	 * machine monitor.
 	 */
 	if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD &&
 	    CPUID_TO_FAMILY(cpu_id) == 0x10)
 		workaround_erratum383 = 1;
 
 	/*
 	 * Are large page mappings supported and enabled?
 	 */
 	TUNABLE_INT_FETCH("vm.pmap.pg_ps_enabled", &pg_ps_enabled);
 	if (pseflag == 0)
 		pg_ps_enabled = 0;
 	else if (pg_ps_enabled) {
 		KASSERT(MAXPAGESIZES > 1 && pagesizes[1] == 0,
 		    ("pmap_init: can't assign to pagesizes[1]"));
 		pagesizes[1] = NBPDR;
 	}
 
 	/*
 	 * Calculate the size of the pv head table for superpages.
 	 */
 	for (i = 0; phys_avail[i + 1]; i += 2);
 	pv_npg = round_4mpage(phys_avail[(i - 2) + 1]) / NBPDR;
 
 	/*
 	 * Allocate memory for the pv head table for superpages.
 	 */
 	s = (vm_size_t)(pv_npg * sizeof(struct md_page));
 	s = round_page(s);
 	pv_table = (struct md_page *)kmem_alloc(kernel_map, s);
 	for (i = 0; i < pv_npg; i++)
 		TAILQ_INIT(&pv_table[i].pv_list);
 
 	pv_maxchunks = MAX(pv_entry_max / _NPCPV, maxproc);
 	pv_chunkbase = (struct pv_chunk *)kmem_alloc_nofault(kernel_map,
 	    PAGE_SIZE * pv_maxchunks);
 	if (pv_chunkbase == NULL)
 		panic("pmap_init: not enough kvm for pv chunks");
 	pmap_ptelist_init(&pv_vafree, pv_chunkbase, pv_maxchunks);
 #ifdef PAE
 	pdptzone = uma_zcreate("PDPT", NPGPTD * sizeof(pdpt_entry_t), NULL,
 	    NULL, NULL, NULL, (NPGPTD * sizeof(pdpt_entry_t)) - 1,
 	    UMA_ZONE_VM | UMA_ZONE_NOFREE);
 	uma_zone_set_allocf(pdptzone, pmap_pdpt_allocf);
 #endif
 }
 
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_max, CTLFLAG_RD, &pv_entry_max, 0,
 	"Max number of PV entries");
 SYSCTL_INT(_vm_pmap, OID_AUTO, shpgperproc, CTLFLAG_RD, &shpgperproc, 0,
 	"Page share factor per proc");
 
 SYSCTL_NODE(_vm_pmap, OID_AUTO, pde, CTLFLAG_RD, 0,
     "2/4MB page mapping counters");
 
 static u_long pmap_pde_demotions;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, demotions, CTLFLAG_RD,
     &pmap_pde_demotions, 0, "2/4MB page demotions");
 
 static u_long pmap_pde_mappings;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, mappings, CTLFLAG_RD,
     &pmap_pde_mappings, 0, "2/4MB page mappings");
 
 static u_long pmap_pde_p_failures;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, p_failures, CTLFLAG_RD,
     &pmap_pde_p_failures, 0, "2/4MB page promotion failures");
 
 static u_long pmap_pde_promotions;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, promotions, CTLFLAG_RD,
     &pmap_pde_promotions, 0, "2/4MB page promotions");
 
 /***************************************************
  * Low level helper routines.....
  ***************************************************/
 
 /*
  * Determine the appropriate bits to set in a PTE or PDE for a specified
  * caching mode.
  */
 int
 pmap_cache_bits(int mode, boolean_t is_pde)
 {
 	int cache_bits, pat_flag, pat_idx;
 
 	if (mode < 0 || mode >= PAT_INDEX_SIZE || pat_index[mode] < 0)
 		panic("Unknown caching mode %d\n", mode);
 
 	/* The PAT bit is different for PTE's and PDE's. */
 	pat_flag = is_pde ? PG_PDE_PAT : PG_PTE_PAT;
 
 	/* Map the caching mode to a PAT index. */
 	pat_idx = pat_index[mode];
 
 	/* Map the 3-bit index value into the PAT, PCD, and PWT bits. */
 	cache_bits = 0;
 	if (pat_idx & 0x4)
 		cache_bits |= pat_flag;
 	if (pat_idx & 0x2)
 		cache_bits |= PG_NC_PCD;
 	if (pat_idx & 0x1)
 		cache_bits |= PG_NC_PWT;
 	return (cache_bits);
 }
 
 /*
  * The caller is responsible for maintaining TLB consistency.
  */
 static void
 pmap_kenter_pde(vm_offset_t va, pd_entry_t newpde)
 {
 	pd_entry_t *pde;
 	pmap_t pmap;
 	boolean_t PTD_updated;
 
 	PTD_updated = FALSE;
 	mtx_lock_spin(&allpmaps_lock);
 	LIST_FOREACH(pmap, &allpmaps, pm_list) {
 		if ((pmap->pm_pdir[PTDPTDI] & PG_FRAME) == (PTDpde[0] &
 		    PG_FRAME))
 			PTD_updated = TRUE;
 		pde = pmap_pde(pmap, va);
 		pde_store(pde, newpde);
 	}
 	mtx_unlock_spin(&allpmaps_lock);
 	KASSERT(PTD_updated,
 	    ("pmap_kenter_pde: current page table is not in allpmaps"));
 }
 
 /*
  * After changing the page size for the specified virtual address in the page
  * table, flush the corresponding entries from the processor's TLB.  Only the
  * calling processor's TLB is affected.
  *
  * The calling thread must be pinned to a processor.
  */
 static void
 pmap_update_pde_invalidate(vm_offset_t va, pd_entry_t newpde)
 {
 	u_long cr4;
 
 	if ((newpde & PG_PS) == 0)
 		/* Demotion: flush a specific 2MB page mapping. */
 		invlpg(va);
 	else if ((newpde & PG_G) == 0)
 		/*
 		 * Promotion: flush every 4KB page mapping from the TLB
 		 * because there are too many to flush individually.
 		 */
 		invltlb();
 	else {
 		/*
 		 * Promotion: flush every 4KB page mapping from the TLB,
 		 * including any global (PG_G) mappings.
 		 */
 		cr4 = rcr4();
 		load_cr4(cr4 & ~CR4_PGE);
 		/*
 		 * Although preemption at this point could be detrimental to
 		 * performance, it would not lead to an error.  PG_G is simply
 		 * ignored if CR4.PGE is clear.  Moreover, in case this block
 		 * is re-entered, the load_cr4() either above or below will
 		 * modify CR4.PGE flushing the TLB.
 		 */
 		load_cr4(cr4 | CR4_PGE);
 	}
 }
 #ifdef SMP
 /*
  * For SMP, these functions have to use the IPI mechanism for coherence.
  *
  * N.B.: Before calling any of the following TLB invalidation functions,
  * the calling processor must ensure that all stores updating a non-
  * kernel page table are globally performed.  Otherwise, another
  * processor could cache an old, pre-update entry without being
  * invalidated.  This can happen one of two ways: (1) The pmap becomes
  * active on another processor after its pm_active field is checked by
  * one of the following functions but before a store updating the page
  * table is globally performed. (2) The pmap becomes active on another
  * processor before its pm_active field is checked but due to
  * speculative loads one of the following functions stills reads the
  * pmap as inactive on the other processor.
  * 
  * The kernel page table is exempt because its pm_active field is
  * immutable.  The kernel page table is always active on every
  * processor.
  */
 void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		invlpg(va);
 		smp_invlpg(va);
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			invlpg(va);
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invlpg(pmap->pm_active & other_cpus, va);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invlpg(other_cpus, va);
 	}
 	sched_unpin();
 }
 
 void
 pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 	vm_offset_t addr;
 
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		for (addr = sva; addr < eva; addr += PAGE_SIZE)
 			invlpg(addr);
 		smp_invlpg_range(sva, eva);
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			for (addr = sva; addr < eva; addr += PAGE_SIZE)
 				invlpg(addr);
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invlpg_range(pmap->pm_active & other_cpus,
-			    sva, eva);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invlpg_range(other_cpus, sva, eva);
 	}
 	sched_unpin();
 }
 
 void
 pmap_invalidate_all(pmap_t pmap)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		invltlb();
 		smp_invltlb();
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			invltlb();
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invltlb(pmap->pm_active & other_cpus);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invltlb(other_cpus);
 	}
 	sched_unpin();
 }
 
 void
 pmap_invalidate_cache(void)
 {
 
 	sched_pin();
 	wbinvd();
 	smp_cache_flush();
 	sched_unpin();
 }
 
 struct pde_action {
-	cpumask_t store;	/* processor that updates the PDE */
-	cpumask_t invalidate;	/* processors that invalidate their TLB */
+	cpuset_t store;		/* processor that updates the PDE */
+	cpuset_t invalidate;	/* processors that invalidate their TLB */
 	vm_offset_t va;
 	pd_entry_t *pde;
 	pd_entry_t newpde;
 };
 
 static void
 pmap_update_pde_kernel(void *arg)
 {
 	struct pde_action *act = arg;
 	pd_entry_t *pde;
 	pmap_t pmap;
 
-	if (act->store == PCPU_GET(cpumask))
+	sched_pin();
+	if (!CPU_CMP(&act->store, PCPU_PTR(cpumask))) {
+		sched_unpin();
+
 		/*
 		 * Elsewhere, this operation requires allpmaps_lock for
 		 * synchronization.  Here, it does not because it is being
 		 * performed in the context of an all_cpus rendezvous.
 		 */
 		LIST_FOREACH(pmap, &allpmaps, pm_list) {
 			pde = pmap_pde(pmap, act->va);
 			pde_store(pde, act->newpde);
 		}
+	} else
+		sched_unpin();
 }
 
 static void
 pmap_update_pde_user(void *arg)
 {
 	struct pde_action *act = arg;
 
-	if (act->store == PCPU_GET(cpumask))
+	sched_pin();
+	if (!CPU_CMP(&act->store, PCPU_PTR(cpumask))) {
+		sched_unpin();
 		pde_store(act->pde, act->newpde);
+	} else
+		sched_unpin();
 }
 
 static void
 pmap_update_pde_teardown(void *arg)
 {
 	struct pde_action *act = arg;
 
-	if ((act->invalidate & PCPU_GET(cpumask)) != 0)
+	sched_pin();
+	if (CPU_OVERLAP(&act->invalidate, PCPU_PTR(cpumask))) {
+		sched_unpin();
 		pmap_update_pde_invalidate(act->va, act->newpde);
+	} else
+		sched_unpin();
 }
 
 /*
  * Change the page size for the specified virtual address in a way that
  * prevents any possibility of the TLB ever having two entries that map the
  * same virtual address using different page sizes.  This is the recommended
  * workaround for Erratum 383 on AMD Family 10h processors.  It prevents a
  * machine check exception for a TLB state that is improperly diagnosed as a
  * hardware error.
  */
 static void
 pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde)
 {
 	struct pde_action act;
-	cpumask_t active, cpumask;
+	cpuset_t active, cpumask, other_cpus;
 
 	sched_pin();
 	cpumask = PCPU_GET(cpumask);
+	other_cpus = PCPU_GET(other_cpus);
 	if (pmap == kernel_pmap)
 		active = all_cpus;
 	else
 		active = pmap->pm_active;
-	if ((active & PCPU_GET(other_cpus)) != 0) {
+	if (CPU_OVERLAP(&active, &other_cpus)) {
 		act.store = cpumask;
 		act.invalidate = active;
 		act.va = va;
 		act.pde = pde;
 		act.newpde = newpde;
-		smp_rendezvous_cpus(cpumask | active,
+		CPU_OR(&cpumask, &active);
+		smp_rendezvous_cpus(cpumask,
 		    smp_no_rendevous_barrier, pmap == kernel_pmap ?
 		    pmap_update_pde_kernel : pmap_update_pde_user,
 		    pmap_update_pde_teardown, &act);
 	} else {
 		if (pmap == kernel_pmap)
 			pmap_kenter_pde(va, newpde);
 		else
 			pde_store(pde, newpde);
-		if ((active & cpumask) != 0)
+		if (CPU_OVERLAP(&active, &cpumask))
 			pmap_update_pde_invalidate(va, newpde);
 	}
 	sched_unpin();
 }
 #else /* !SMP */
 /*
  * Normal, non-SMP, 486+ invalidation functions.
  * We inline these within pmap.c for speed.
  */
 PMAP_INLINE void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
 
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		invlpg(va);
 }
 
 PMAP_INLINE void
 pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t addr;
 
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		for (addr = sva; addr < eva; addr += PAGE_SIZE)
 			invlpg(addr);
 }
 
 PMAP_INLINE void
 pmap_invalidate_all(pmap_t pmap)
 {
 
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		invltlb();
 }
 
 PMAP_INLINE void
 pmap_invalidate_cache(void)
 {
 
 	wbinvd();
 }
 
 static void
 pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde)
 {
 
 	if (pmap == kernel_pmap)
 		pmap_kenter_pde(va, newpde);
 	else
 		pde_store(pde, newpde);
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		pmap_update_pde_invalidate(va, newpde);
 }
 #endif /* !SMP */
 
 #define	PMAP_CLFLUSH_THRESHOLD	(2 * 1024 * 1024)
 
 void
 pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
 {
 
 	KASSERT((sva & PAGE_MASK) == 0,
 	    ("pmap_invalidate_cache_range: sva not page-aligned"));
 	KASSERT((eva & PAGE_MASK) == 0,
 	    ("pmap_invalidate_cache_range: eva not page-aligned"));
 
 	if (cpu_feature & CPUID_SS)
 		; /* If "Self Snoop" is supported, do nothing. */
 	else if ((cpu_feature & CPUID_CLFSH) != 0 &&
 	    eva - sva < PMAP_CLFLUSH_THRESHOLD) {
 
 		/*
 		 * Otherwise, do per-cache line flush.  Use the mfence
 		 * instruction to insure that previous stores are
 		 * included in the write-back.  The processor
 		 * propagates flush to other processors in the cache
 		 * coherence domain.
 		 */
 		mfence();
 		for (; sva < eva; sva += cpu_clflush_line_size)
 			clflush(sva);
 		mfence();
 	} else {
 
 		/*
 		 * No targeted cache flush methods are supported by CPU,
 		 * or the supplied range is bigger than 2MB.
 		 * Globally invalidate cache.
 		 */
 		pmap_invalidate_cache();
 	}
 }
 
 void
 pmap_invalidate_cache_pages(vm_page_t *pages, int count)
 {
 	int i;
 
 	if (count >= PMAP_CLFLUSH_THRESHOLD / PAGE_SIZE ||
 	    (cpu_feature & CPUID_CLFSH) == 0) {
 		pmap_invalidate_cache();
 	} else {
 		for (i = 0; i < count; i++)
 			pmap_flush_page(pages[i]);
 	}
 }
 
 /*
  * Are we current address space or kernel?  N.B. We return FALSE when
  * a pmap's page table is in use because a kernel thread is borrowing
  * it.  The borrowed page table can change spontaneously, making any
  * dependence on its continued use subject to a race condition.
  */
 static __inline int
 pmap_is_current(pmap_t pmap)
 {
 
 	return (pmap == kernel_pmap ||
 		(pmap == vmspace_pmap(curthread->td_proc->p_vmspace) &&
 	    (pmap->pm_pdir[PTDPTDI] & PG_FRAME) == (PTDpde[0] & PG_FRAME)));
 }
 
 /*
  * If the given pmap is not the current or kernel pmap, the returned pte must
  * be released by passing it to pmap_pte_release().
  */
 pt_entry_t *
 pmap_pte(pmap_t pmap, vm_offset_t va)
 {
 	pd_entry_t newpf;
 	pd_entry_t *pde;
 
 	pde = pmap_pde(pmap, va);
 	if (*pde & PG_PS)
 		return (pde);
 	if (*pde != 0) {
 		/* are we current address space or kernel? */
 		if (pmap_is_current(pmap))
 			return (vtopte(va));
 		mtx_lock(&PMAP2mutex);
 		newpf = *pde & PG_FRAME;
 		if ((*PMAP2 & PG_FRAME) != newpf) {
 			*PMAP2 = newpf | PG_RW | PG_V | PG_A | PG_M;
 			pmap_invalidate_page(kernel_pmap, (vm_offset_t)PADDR2);
 		}
 		return (PADDR2 + (i386_btop(va) & (NPTEPG - 1)));
 	}
 	return (NULL);
 }
 
 /*
  * Releases a pte that was obtained from pmap_pte().  Be prepared for the pte
  * being NULL.
  */
 static __inline void
 pmap_pte_release(pt_entry_t *pte)
 {
 
 	if ((pt_entry_t *)((vm_offset_t)pte & ~PAGE_MASK) == PADDR2)
 		mtx_unlock(&PMAP2mutex);
 }
 
 static __inline void
 invlcaddr(void *caddr)
 {
 
 	invlpg((u_int)caddr);
 }
 
 /*
  * Super fast pmap_pte routine best used when scanning
  * the pv lists.  This eliminates many coarse-grained
  * invltlb calls.  Note that many of the pv list
  * scans are across different pmaps.  It is very wasteful
  * to do an entire invltlb for checking a single mapping.
  *
  * If the given pmap is not the current pmap, vm_page_queue_mtx
  * must be held and curthread pinned to a CPU.
  */
 static pt_entry_t *
 pmap_pte_quick(pmap_t pmap, vm_offset_t va)
 {
 	pd_entry_t newpf;
 	pd_entry_t *pde;
 
 	pde = pmap_pde(pmap, va);
 	if (*pde & PG_PS)
 		return (pde);
 	if (*pde != 0) {
 		/* are we current address space or kernel? */
 		if (pmap_is_current(pmap))
 			return (vtopte(va));
 		mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 		KASSERT(curthread->td_pinned > 0, ("curthread not pinned"));
 		newpf = *pde & PG_FRAME;
 		if ((*PMAP1 & PG_FRAME) != newpf) {
 			*PMAP1 = newpf | PG_RW | PG_V | PG_A | PG_M;
 #ifdef SMP
 			PMAP1cpu = PCPU_GET(cpuid);
 #endif
 			invlcaddr(PADDR1);
 			PMAP1changed++;
 		} else
 #ifdef SMP
 		if (PMAP1cpu != PCPU_GET(cpuid)) {
 			PMAP1cpu = PCPU_GET(cpuid);
 			invlcaddr(PADDR1);
 			PMAP1changedcpu++;
 		} else
 #endif
 			PMAP1unchanged++;
 		return (PADDR1 + (i386_btop(va) & (NPTEPG - 1)));
 	}
 	return (0);
 }
 
 /*
  *	Routine:	pmap_extract
  *	Function:
  *		Extract the physical page address associated
  *		with the given map/virtual_address pair.
  */
 vm_paddr_t 
 pmap_extract(pmap_t pmap, vm_offset_t va)
 {
 	vm_paddr_t rtval;
 	pt_entry_t *pte;
 	pd_entry_t pde;
 
 	rtval = 0;
 	PMAP_LOCK(pmap);
 	pde = pmap->pm_pdir[va >> PDRSHIFT];
 	if (pde != 0) {
 		if ((pde & PG_PS) != 0)
 			rtval = (pde & PG_PS_FRAME) | (va & PDRMASK);
 		else {
 			pte = pmap_pte(pmap, va);
 			rtval = (*pte & PG_FRAME) | (va & PAGE_MASK);
 			pmap_pte_release(pte);
 		}
 	}
 	PMAP_UNLOCK(pmap);
 	return (rtval);
 }
 
 /*
  *	Routine:	pmap_extract_and_hold
  *	Function:
  *		Atomically extract and hold the physical page
  *		with the given pmap and virtual address pair
  *		if that mapping permits the given protection.
  */
 vm_page_t
 pmap_extract_and_hold(pmap_t pmap, vm_offset_t va, vm_prot_t prot)
 {
 	pd_entry_t pde;
 	pt_entry_t pte, *ptep;
 	vm_page_t m;
 	vm_paddr_t pa;
 
 	pa = 0;
 	m = NULL;
 	PMAP_LOCK(pmap);
 retry:
 	pde = *pmap_pde(pmap, va);
 	if (pde != 0) {
 		if (pde & PG_PS) {
 			if ((pde & PG_RW) || (prot & VM_PROT_WRITE) == 0) {
 				if (vm_page_pa_tryrelock(pmap, (pde & PG_PS_FRAME) |
 				       (va & PDRMASK), &pa))
 					goto retry;
 				m = PHYS_TO_VM_PAGE((pde & PG_PS_FRAME) |
 				    (va & PDRMASK));
 				vm_page_hold(m);
 			}
 		} else {
 			ptep = pmap_pte(pmap, va);
 			pte = *ptep;
 			pmap_pte_release(ptep);
 			if (pte != 0 &&
 			    ((pte & PG_RW) || (prot & VM_PROT_WRITE) == 0)) {
 				if (vm_page_pa_tryrelock(pmap, pte & PG_FRAME, &pa))
 					goto retry;
 				m = PHYS_TO_VM_PAGE(pte & PG_FRAME);
 				vm_page_hold(m);
 			}
 		}
 	}
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pmap);
 	return (m);
 }
 
 /***************************************************
  * Low level mapping routines.....
  ***************************************************/
 
 /*
  * Add a wired page to the kva.
  * Note: not SMP coherent.
  *
  * This function may be used before pmap_bootstrap() is called.
  */
 PMAP_INLINE void 
 pmap_kenter(vm_offset_t va, vm_paddr_t pa)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_store(pte, pa | PG_RW | PG_V | pgeflag);
 }
 
 static __inline void
 pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_store(pte, pa | PG_RW | PG_V | pgeflag | pmap_cache_bits(mode, 0));
 }
 
 /*
  * Remove a page from the kernel pagetables.
  * Note: not SMP coherent.
  *
  * This function may be used before pmap_bootstrap() is called.
  */
 PMAP_INLINE void
 pmap_kremove(vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_clear(pte);
 }
 
 /*
  *	Used to map a range of physical addresses into kernel
  *	virtual address space.
  *
  *	The value passed in '*virt' is a suggested virtual address for
  *	the mapping. Architectures which can support a direct-mapped
  *	physical to virtual region can return the appropriate address
  *	within that region, leaving '*virt' unchanged. Other
  *	architectures should map the pages starting at '*virt' and
  *	update '*virt' with the first usable address after the mapped
  *	region.
  */
 vm_offset_t
 pmap_map(vm_offset_t *virt, vm_paddr_t start, vm_paddr_t end, int prot)
 {
 	vm_offset_t va, sva;
 
 	va = sva = *virt;
 	while (start < end) {
 		pmap_kenter(va, start);
 		va += PAGE_SIZE;
 		start += PAGE_SIZE;
 	}
 	pmap_invalidate_range(kernel_pmap, sva, va);
 	*virt = va;
 	return (sva);
 }
 
 
 /*
  * Add a list of wired pages to the kva
  * this routine is only used for temporary
  * kernel mappings that do not need to have
  * page modification or references recorded.
  * Note that old mappings are simply written
  * over.  The page *must* be wired.
  * Note: SMP coherent.  Uses a ranged shootdown IPI.
  */
 void
 pmap_qenter(vm_offset_t sva, vm_page_t *ma, int count)
 {
 	pt_entry_t *endpte, oldpte, pa, *pte;
 	vm_page_t m;
 
 	oldpte = 0;
 	pte = vtopte(sva);
 	endpte = pte + count;
 	while (pte < endpte) {
 		m = *ma++;
 		pa = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(m->md.pat_mode, 0);
 		if ((*pte & (PG_FRAME | PG_PTE_CACHE)) != pa) {
 			oldpte |= *pte;
 			pte_store(pte, pa | pgeflag | PG_RW | PG_V);
 		}
 		pte++;
 	}
 	if (__predict_false((oldpte & PG_V) != 0))
 		pmap_invalidate_range(kernel_pmap, sva, sva + count *
 		    PAGE_SIZE);
 }
 
 /*
  * This routine tears out page mappings from the
  * kernel -- it is meant only for temporary mappings.
  * Note: SMP coherent.  Uses a ranged shootdown IPI.
  */
 void
 pmap_qremove(vm_offset_t sva, int count)
 {
 	vm_offset_t va;
 
 	va = sva;
 	while (count-- > 0) {
 		pmap_kremove(va);
 		va += PAGE_SIZE;
 	}
 	pmap_invalidate_range(kernel_pmap, sva, va);
 }
 
 /***************************************************
  * Page table page management routines.....
  ***************************************************/
 static __inline void
 pmap_free_zero_pages(vm_page_t free)
 {
 	vm_page_t m;
 
 	while (free != NULL) {
 		m = free;
 		free = m->right;
 		/* Preserve the page's PG_ZERO setting. */
 		vm_page_free_toq(m);
 	}
 }
 
 /*
  * Schedule the specified unused page table page to be freed.  Specifically,
  * add the page to the specified list of pages that will be released to the
  * physical memory manager after the TLB has been updated.
  */
 static __inline void
 pmap_add_delayed_free_list(vm_page_t m, vm_page_t *free, boolean_t set_PG_ZERO)
 {
 
 	if (set_PG_ZERO)
 		m->flags |= PG_ZERO;
 	else
 		m->flags &= ~PG_ZERO;
 	m->right = *free;
 	*free = m;
 }
 
 /*
  * Inserts the specified page table page into the specified pmap's collection
  * of idle page table pages.  Each of a pmap's page table pages is responsible
  * for mapping a distinct range of virtual addresses.  The pmap's collection is
  * ordered by this virtual address range.
  */
 static void
 pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte)
 {
 	vm_page_t root;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	root = pmap->pm_root;
 	if (root == NULL) {
 		mpte->left = NULL;
 		mpte->right = NULL;
 	} else {
 		root = vm_page_splay(mpte->pindex, root);
 		if (mpte->pindex < root->pindex) {
 			mpte->left = root->left;
 			mpte->right = root;
 			root->left = NULL;
 		} else if (mpte->pindex == root->pindex)
 			panic("pmap_insert_pt_page: pindex already inserted");
 		else {
 			mpte->right = root->right;
 			mpte->left = root;
 			root->right = NULL;
 		}
 	}
 	pmap->pm_root = mpte;
 }
 
 /*
  * Looks for a page table page mapping the specified virtual address in the
  * specified pmap's collection of idle page table pages.  Returns NULL if there
  * is no page table page corresponding to the specified virtual address.
  */
 static vm_page_t
 pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va)
 {
 	vm_page_t mpte;
 	vm_pindex_t pindex = va >> PDRSHIFT;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if ((mpte = pmap->pm_root) != NULL && mpte->pindex != pindex) {
 		mpte = vm_page_splay(pindex, mpte);
 		if ((pmap->pm_root = mpte)->pindex != pindex)
 			mpte = NULL;
 	}
 	return (mpte);
 }
 
 /*
  * Removes the specified page table page from the specified pmap's collection
  * of idle page table pages.  The specified page table page must be a member of
  * the pmap's collection.
  */
 static void
 pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte)
 {
 	vm_page_t root;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if (mpte != pmap->pm_root)
 		vm_page_splay(mpte->pindex, pmap->pm_root);
 	if (mpte->left == NULL)
 		root = mpte->right;
 	else {
 		root = vm_page_splay(mpte->pindex, mpte->left);
 		root->right = mpte->right;
 	}
 	pmap->pm_root = root;
 }
 
 /*
  * This routine unholds page table pages, and if the hold count
  * drops to zero, then it decrements the wire count.
  */
 static __inline int
 pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, vm_page_t *free)
 {
 
 	--m->wire_count;
 	if (m->wire_count == 0)
 		return (_pmap_unwire_pte_hold(pmap, m, free));
 	else
 		return (0);
 }
 
 static int 
 _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, vm_page_t *free)
 {
 	vm_offset_t pteva;
 
 	/*
 	 * unmap the page table page
 	 */
 	pmap->pm_pdir[m->pindex] = 0;
 	--pmap->pm_stats.resident_count;
 
 	/*
 	 * This is a release store so that the ordinary store unmapping
 	 * the page table page is globally performed before TLB shoot-
 	 * down is begun.
 	 */
 	atomic_subtract_rel_int(&cnt.v_wire_count, 1);
 
 	/*
 	 * Do an invltlb to make the invalidated mapping
 	 * take effect immediately.
 	 */
 	pteva = VM_MAXUSER_ADDRESS + i386_ptob(m->pindex);
 	pmap_invalidate_page(pmap, pteva);
 
 	/* 
 	 * Put page on a list so that it is released after
 	 * *ALL* TLB shootdown is done
 	 */
 	pmap_add_delayed_free_list(m, free, TRUE);
 
 	return (1);
 }
 
 /*
  * After removing a page table entry, this routine is used to
  * conditionally free the page, and manage the hold/wire counts.
  */
 static int
 pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t *free)
 {
 	pd_entry_t ptepde;
 	vm_page_t mpte;
 
 	if (va >= VM_MAXUSER_ADDRESS)
 		return (0);
 	ptepde = *pmap_pde(pmap, va);
 	mpte = PHYS_TO_VM_PAGE(ptepde & PG_FRAME);
 	return (pmap_unwire_pte_hold(pmap, mpte, free));
 }
 
 /*
  * Initialize the pmap for the swapper process.
  */
 void
 pmap_pinit0(pmap_t pmap)
 {
 
 	PMAP_LOCK_INIT(pmap);
 	/*
 	 * Since the page table directory is shared with the kernel pmap,
 	 * which is already included in the list "allpmaps", this pmap does
 	 * not need to be inserted into that list.
 	 */
 	pmap->pm_pdir = (pd_entry_t *)(KERNBASE + (vm_offset_t)IdlePTD);
 #ifdef PAE
 	pmap->pm_pdpt = (pdpt_entry_t *)(KERNBASE + (vm_offset_t)IdlePDPT);
 #endif
 	pmap->pm_root = NULL;
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 	PCPU_SET(curpmap, pmap);
 	TAILQ_INIT(&pmap->pm_pvchunk);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 }
 
 /*
  * Initialize a preallocated and zeroed pmap structure,
  * such as one in a vmspace structure.
  */
 int
 pmap_pinit(pmap_t pmap)
 {
 	vm_page_t m, ptdpg[NPGPTD];
 	vm_paddr_t pa;
 	static int color;
 	int i;
 
 	PMAP_LOCK_INIT(pmap);
 
 	/*
 	 * No need to allocate page table space yet but we do need a valid
 	 * page directory table.
 	 */
 	if (pmap->pm_pdir == NULL) {
 		pmap->pm_pdir = (pd_entry_t *)kmem_alloc_nofault(kernel_map,
 		    NBPTD);
 
 		if (pmap->pm_pdir == NULL) {
 			PMAP_LOCK_DESTROY(pmap);
 			return (0);
 		}
 #ifdef PAE
 		pmap->pm_pdpt = uma_zalloc(pdptzone, M_WAITOK | M_ZERO);
 		KASSERT(((vm_offset_t)pmap->pm_pdpt &
 		    ((NPGPTD * sizeof(pdpt_entry_t)) - 1)) == 0,
 		    ("pmap_pinit: pdpt misaligned"));
 		KASSERT(pmap_kextract((vm_offset_t)pmap->pm_pdpt) < (4ULL<<30),
 		    ("pmap_pinit: pdpt above 4g"));
 #endif
 		pmap->pm_root = NULL;
 	}
 	KASSERT(pmap->pm_root == NULL,
 	    ("pmap_pinit: pmap has reserved page table page(s)"));
 
 	/*
 	 * allocate the page directory page(s)
 	 */
 	for (i = 0; i < NPGPTD;) {
 		m = vm_page_alloc(NULL, color++,
 		    VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED |
 		    VM_ALLOC_ZERO);
 		if (m == NULL)
 			VM_WAIT;
 		else {
 			ptdpg[i++] = m;
 		}
 	}
 
 	pmap_qenter((vm_offset_t)pmap->pm_pdir, ptdpg, NPGPTD);
 
 	for (i = 0; i < NPGPTD; i++) {
 		if ((ptdpg[i]->flags & PG_ZERO) == 0)
 			bzero(pmap->pm_pdir + (i * NPDEPG), PAGE_SIZE);
 	}
 
 	mtx_lock_spin(&allpmaps_lock);
 	LIST_INSERT_HEAD(&allpmaps, pmap, pm_list);
 	/* Copy the kernel page table directory entries. */
 	bcopy(PTD + KPTDI, pmap->pm_pdir + KPTDI, nkpt * sizeof(pd_entry_t));
 	mtx_unlock_spin(&allpmaps_lock);
 
 	/* install self-referential address mapping entry(s) */
 	for (i = 0; i < NPGPTD; i++) {
 		pa = VM_PAGE_TO_PHYS(ptdpg[i]);
 		pmap->pm_pdir[PTDPTDI + i] = pa | PG_V | PG_RW | PG_A | PG_M;
 #ifdef PAE
 		pmap->pm_pdpt[i] = pa | PG_V;
 #endif
 	}
 
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 	TAILQ_INIT(&pmap->pm_pvchunk);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 
 	return (1);
 }
 
 /*
  * this routine is called if the page table page is not
  * mapped correctly.
  */
 static vm_page_t
 _pmap_allocpte(pmap_t pmap, unsigned ptepindex, int flags)
 {
 	vm_paddr_t ptepa;
 	vm_page_t m;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("_pmap_allocpte: flags is neither M_NOWAIT nor M_WAITOK"));
 
 	/*
 	 * Allocate a page table page.
 	 */
 	if ((m = vm_page_alloc(NULL, ptepindex, VM_ALLOC_NOOBJ |
 	    VM_ALLOC_WIRED | VM_ALLOC_ZERO)) == NULL) {
 		if (flags & M_WAITOK) {
 			PMAP_UNLOCK(pmap);
 			vm_page_unlock_queues();
 			VM_WAIT;
 			vm_page_lock_queues();
 			PMAP_LOCK(pmap);
 		}
 
 		/*
 		 * Indicate the need to retry.  While waiting, the page table
 		 * page may have been allocated.
 		 */
 		return (NULL);
 	}
 	if ((m->flags & PG_ZERO) == 0)
 		pmap_zero_page(m);
 
 	/*
 	 * Map the pagetable page into the process address space, if
 	 * it isn't already there.
 	 */
 
 	pmap->pm_stats.resident_count++;
 
 	ptepa = VM_PAGE_TO_PHYS(m);
 	pmap->pm_pdir[ptepindex] =
 		(pd_entry_t) (ptepa | PG_U | PG_RW | PG_V | PG_A | PG_M);
 
 	return (m);
 }
 
 static vm_page_t
 pmap_allocpte(pmap_t pmap, vm_offset_t va, int flags)
 {
 	unsigned ptepindex;
 	pd_entry_t ptepa;
 	vm_page_t m;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("pmap_allocpte: flags is neither M_NOWAIT nor M_WAITOK"));
 
 	/*
 	 * Calculate pagetable page index
 	 */
 	ptepindex = va >> PDRSHIFT;
 retry:
 	/*
 	 * Get the page directory entry
 	 */
 	ptepa = pmap->pm_pdir[ptepindex];
 
 	/*
 	 * This supports switching from a 4MB page to a
 	 * normal 4K page.
 	 */
 	if (ptepa & PG_PS) {
 		(void)pmap_demote_pde(pmap, &pmap->pm_pdir[ptepindex], va);
 		ptepa = pmap->pm_pdir[ptepindex];
 	}
 
 	/*
 	 * If the page table page is mapped, we just increment the
 	 * hold count, and activate it.
 	 */
 	if (ptepa) {
 		m = PHYS_TO_VM_PAGE(ptepa & PG_FRAME);
 		m->wire_count++;
 	} else {
 		/*
 		 * Here if the pte page isn't mapped, or if it has
 		 * been deallocated. 
 		 */
 		m = _pmap_allocpte(pmap, ptepindex, flags);
 		if (m == NULL && (flags & M_WAITOK))
 			goto retry;
 	}
 	return (m);
 }
 
 
 /***************************************************
 * Pmap allocation/deallocation routines.
  ***************************************************/
 
 #ifdef SMP
 /*
  * Deal with a SMP shootdown of other users of the pmap that we are
  * trying to dispose of.  This can be a bit hairy.
  */
-static cpumask_t *lazymask;
+static cpuset_t *lazymask;
 static u_int lazyptd;
 static volatile u_int lazywait;
 
 void pmap_lazyfix_action(void);
 
 void
 pmap_lazyfix_action(void)
 {
-	cpumask_t mymask = PCPU_GET(cpumask);
 
 #ifdef COUNT_IPIS
 	(*ipi_lazypmap_counts[PCPU_GET(cpuid)])++;
 #endif
 	if (rcr3() == lazyptd)
 		load_cr3(PCPU_GET(curpcb)->pcb_cr3);
-	atomic_clear_int(lazymask, mymask);
+	CPU_CLR_ATOMIC(PCPU_GET(cpuid), lazymask);
 	atomic_store_rel_int(&lazywait, 1);
 }
 
 static void
-pmap_lazyfix_self(cpumask_t mymask)
+pmap_lazyfix_self(cpuset_t mymask)
 {
 
 	if (rcr3() == lazyptd)
 		load_cr3(PCPU_GET(curpcb)->pcb_cr3);
-	atomic_clear_int(lazymask, mymask);
+	CPU_NAND_ATOMIC(lazymask, &mymask);
 }
 
 
 static void
 pmap_lazyfix(pmap_t pmap)
 {
-	cpumask_t mymask, mask;
+	cpuset_t mymask, mask;
 	u_int spins;
+	int lsb;
 
-	while ((mask = pmap->pm_active) != 0) {
+	mask = pmap->pm_active;
+	while (!CPU_EMPTY(&mask)) {
 		spins = 50000000;
-		mask = mask & -mask;	/* Find least significant set bit */
+
+		/* Find least significant set bit. */
+		lsb = cpusetobj_ffs(&mask);
+		MPASS(lsb != 0);
+		lsb--;
+		CPU_SETOF(lsb, &mask);
 		mtx_lock_spin(&smp_ipi_mtx);
 #ifdef PAE
 		lazyptd = vtophys(pmap->pm_pdpt);
 #else
 		lazyptd = vtophys(pmap->pm_pdir);
 #endif
 		mymask = PCPU_GET(cpumask);
-		if (mask == mymask) {
+		if (!CPU_CMP(&mask, &mymask)) {
 			lazymask = &pmap->pm_active;
 			pmap_lazyfix_self(mymask);
 		} else {
 			atomic_store_rel_int((u_int *)&lazymask,
 			    (u_int)&pmap->pm_active);
 			atomic_store_rel_int(&lazywait, 0);
 			ipi_selected(mask, IPI_LAZYPMAP);
 			while (lazywait == 0) {
 				ia32_pause();
 				if (--spins == 0)
 					break;
 			}
 		}
 		mtx_unlock_spin(&smp_ipi_mtx);
 		if (spins == 0)
 			printf("pmap_lazyfix: spun for 50000000\n");
+		mask = pmap->pm_active;
 	}
 }
 
 #else	/* SMP */
 
 /*
  * Cleaning up on uniprocessor is easy.  For various reasons, we're
  * unlikely to have to even execute this code, including the fact
  * that the cleanup is deferred until the parent does a wait(2), which
  * means that another userland process has run.
  */
 static void
 pmap_lazyfix(pmap_t pmap)
 {
 	u_int cr3;
 
 	cr3 = vtophys(pmap->pm_pdir);
 	if (cr3 == rcr3()) {
 		load_cr3(PCPU_GET(curpcb)->pcb_cr3);
-		pmap->pm_active &= ~(PCPU_GET(cpumask));
+		CPU_CLR(PCPU_GET(cpuid), &pmap->pm_active); 
 	}
 }
 #endif	/* SMP */
 
 /*
  * Release any resources held by the given physical map.
  * Called when a pmap initialized by pmap_pinit is being released.
  * Should only be called if the map contains no valid mappings.
  */
 void
 pmap_release(pmap_t pmap)
 {
 	vm_page_t m, ptdpg[NPGPTD];
 	int i;
 
 	KASSERT(pmap->pm_stats.resident_count == 0,
 	    ("pmap_release: pmap resident count %ld != 0",
 	    pmap->pm_stats.resident_count));
 	KASSERT(pmap->pm_root == NULL,
 	    ("pmap_release: pmap has reserved page table page(s)"));
 
 	pmap_lazyfix(pmap);
 	mtx_lock_spin(&allpmaps_lock);
 	LIST_REMOVE(pmap, pm_list);
 	mtx_unlock_spin(&allpmaps_lock);
 
 	for (i = 0; i < NPGPTD; i++)
 		ptdpg[i] = PHYS_TO_VM_PAGE(pmap->pm_pdir[PTDPTDI + i] &
 		    PG_FRAME);
 
 	bzero(pmap->pm_pdir + PTDPTDI, (nkpt + NPGPTD) *
 	    sizeof(*pmap->pm_pdir));
 
 	pmap_qremove((vm_offset_t)pmap->pm_pdir, NPGPTD);
 
 	for (i = 0; i < NPGPTD; i++) {
 		m = ptdpg[i];
 #ifdef PAE
 		KASSERT(VM_PAGE_TO_PHYS(m) == (pmap->pm_pdpt[i] & PG_FRAME),
 		    ("pmap_release: got wrong ptd page"));
 #endif
 		m->wire_count--;
 		atomic_subtract_int(&cnt.v_wire_count, 1);
 		vm_page_free_zero(m);
 	}
 	PMAP_LOCK_DESTROY(pmap);
 }
 
 static int
 kvm_size(SYSCTL_HANDLER_ARGS)
 {
 	unsigned long ksize = VM_MAX_KERNEL_ADDRESS - KERNBASE;
 
 	return (sysctl_handle_long(oidp, &ksize, 0, req));
 }
 SYSCTL_PROC(_vm, OID_AUTO, kvm_size, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_size, "IU", "Size of KVM");
 
 static int
 kvm_free(SYSCTL_HANDLER_ARGS)
 {
 	unsigned long kfree = VM_MAX_KERNEL_ADDRESS - kernel_vm_end;
 
 	return (sysctl_handle_long(oidp, &kfree, 0, req));
 }
 SYSCTL_PROC(_vm, OID_AUTO, kvm_free, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_free, "IU", "Amount of KVM free");
 
 /*
  * grow the number of kernel page table entries, if needed
  */
 void
 pmap_growkernel(vm_offset_t addr)
 {
 	vm_paddr_t ptppaddr;
 	vm_page_t nkpg;
 	pd_entry_t newpdir;
 
 	mtx_assert(&kernel_map->system_mtx, MA_OWNED);
 	addr = roundup2(addr, NBPDR);
 	if (addr - 1 >= kernel_map->max_offset)
 		addr = kernel_map->max_offset;
 	while (kernel_vm_end < addr) {
 		if (pdir_pde(PTD, kernel_vm_end)) {
 			kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK;
 			if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 				kernel_vm_end = kernel_map->max_offset;
 				break;
 			}
 			continue;
 		}
 
 		nkpg = vm_page_alloc(NULL, kernel_vm_end >> PDRSHIFT,
 		    VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED |
 		    VM_ALLOC_ZERO);
 		if (nkpg == NULL)
 			panic("pmap_growkernel: no memory to grow kernel");
 
 		nkpt++;
 
 		if ((nkpg->flags & PG_ZERO) == 0)
 			pmap_zero_page(nkpg);
 		ptppaddr = VM_PAGE_TO_PHYS(nkpg);
 		newpdir = (pd_entry_t) (ptppaddr | PG_V | PG_RW | PG_A | PG_M);
 		pdir_pde(KPTD, kernel_vm_end) = pgeflag | newpdir;
 
 		pmap_kenter_pde(kernel_vm_end, newpdir);
 		kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK;
 		if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 			kernel_vm_end = kernel_map->max_offset;
 			break;
 		}
 	}
 }
 
 
 /***************************************************
  * page management routines.
  ***************************************************/
 
 CTASSERT(sizeof(struct pv_chunk) == PAGE_SIZE);
 CTASSERT(_NPCM == 11);
 
 static __inline struct pv_chunk *
 pv_to_chunk(pv_entry_t pv)
 {
 
 	return ((struct pv_chunk *)((uintptr_t)pv & ~(uintptr_t)PAGE_MASK));
 }
 
 #define PV_PMAP(pv) (pv_to_chunk(pv)->pc_pmap)
 
 #define	PC_FREE0_9	0xfffffffful	/* Free values for index 0 through 9 */
 #define	PC_FREE10	0x0000fffful	/* Free values for index 10 */
 
 static uint32_t pc_freemask[11] = {
 	PC_FREE0_9, PC_FREE0_9, PC_FREE0_9,
 	PC_FREE0_9, PC_FREE0_9, PC_FREE0_9,
 	PC_FREE0_9, PC_FREE0_9, PC_FREE0_9,
 	PC_FREE0_9, PC_FREE10
 };
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_count, CTLFLAG_RD, &pv_entry_count, 0,
 	"Current number of pv entries");
 
 #ifdef PV_STATS
 static int pc_chunk_count, pc_chunk_allocs, pc_chunk_frees, pc_chunk_tryfail;
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_count, CTLFLAG_RD, &pc_chunk_count, 0,
 	"Current number of pv entry chunks");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_allocs, CTLFLAG_RD, &pc_chunk_allocs, 0,
 	"Current number of pv entry chunks allocated");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_frees, CTLFLAG_RD, &pc_chunk_frees, 0,
 	"Current number of pv entry chunks frees");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_tryfail, CTLFLAG_RD, &pc_chunk_tryfail, 0,
 	"Number of times tried to get a chunk page but failed.");
 
 static long pv_entry_frees, pv_entry_allocs;
 static int pv_entry_spare;
 
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_frees, CTLFLAG_RD, &pv_entry_frees, 0,
 	"Current number of pv entry frees");
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_allocs, CTLFLAG_RD, &pv_entry_allocs, 0,
 	"Current number of pv entry allocs");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_spare, CTLFLAG_RD, &pv_entry_spare, 0,
 	"Current number of spare pv entries");
 
 static int pmap_collect_inactive, pmap_collect_active;
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pmap_collect_inactive, CTLFLAG_RD, &pmap_collect_inactive, 0,
 	"Current number times pmap_collect called on inactive queue");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pmap_collect_active, CTLFLAG_RD, &pmap_collect_active, 0,
 	"Current number times pmap_collect called on active queue");
 #endif
 
 /*
  * We are in a serious low memory condition.  Resort to
  * drastic measures to free some pages so we can allocate
  * another pv entry chunk.  This is normally called to
  * unmap inactive pages, and if necessary, active pages.
  */
 static void
 pmap_collect(pmap_t locked_pmap, struct vpgqueues *vpq)
 {
 	pd_entry_t *pde;
 	pmap_t pmap;
 	pt_entry_t *pte, tpte;
 	pv_entry_t next_pv, pv;
 	vm_offset_t va;
 	vm_page_t m, free;
 
 	sched_pin();
 	TAILQ_FOREACH(m, &vpq->pl, pageq) {
 		if (m->hold_count || m->busy)
 			continue;
 		TAILQ_FOREACH_SAFE(pv, &m->md.pv_list, pv_list, next_pv) {
 			va = pv->pv_va;
 			pmap = PV_PMAP(pv);
 			/* Avoid deadlock and lock recursion. */
 			if (pmap > locked_pmap)
 				PMAP_LOCK(pmap);
 			else if (pmap != locked_pmap && !PMAP_TRYLOCK(pmap))
 				continue;
 			pmap->pm_stats.resident_count--;
 			pde = pmap_pde(pmap, va);
 			KASSERT((*pde & PG_PS) == 0, ("pmap_collect: found"
 			    " a 4mpage in page %p's pv list", m));
 			pte = pmap_pte_quick(pmap, va);
 			tpte = pte_load_clear(pte);
 			KASSERT((tpte & PG_W) == 0,
 			    ("pmap_collect: wired pte %#jx", (uintmax_t)tpte));
 			if (tpte & PG_A)
 				vm_page_flag_set(m, PG_REFERENCED);
 			if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 			free = NULL;
 			pmap_unuse_pt(pmap, va, &free);
 			pmap_invalidate_page(pmap, va);
 			pmap_free_zero_pages(free);
 			TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 			free_pv_entry(pmap, pv);
 			if (pmap != locked_pmap)
 				PMAP_UNLOCK(pmap);
 		}
 		if (TAILQ_EMPTY(&m->md.pv_list) &&
 		    TAILQ_EMPTY(&pa_to_pvh(VM_PAGE_TO_PHYS(m))->pv_list))
 			vm_page_flag_clear(m, PG_WRITEABLE);
 	}
 	sched_unpin();
 }
 
 
 /*
  * free the pv_entry back to the free list
  */
 static void
 free_pv_entry(pmap_t pmap, pv_entry_t pv)
 {
 	vm_page_t m;
 	struct pv_chunk *pc;
 	int idx, field, bit;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	PV_STAT(pv_entry_frees++);
 	PV_STAT(pv_entry_spare++);
 	pv_entry_count--;
 	pc = pv_to_chunk(pv);
 	idx = pv - &pc->pc_pventry[0];
 	field = idx / 32;
 	bit = idx % 32;
 	pc->pc_map[field] |= 1ul << bit;
 	/* move to head of list */
 	TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 	for (idx = 0; idx < _NPCM; idx++)
 		if (pc->pc_map[idx] != pc_freemask[idx]) {
 			TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 			return;
 		}
 	PV_STAT(pv_entry_spare -= _NPCPV);
 	PV_STAT(pc_chunk_count--);
 	PV_STAT(pc_chunk_frees++);
 	/* entire chunk is free, return it */
 	m = PHYS_TO_VM_PAGE(pmap_kextract((vm_offset_t)pc));
 	pmap_qremove((vm_offset_t)pc, 1);
 	vm_page_unwire(m, 0);
 	vm_page_free(m);
 	pmap_ptelist_free(&pv_vafree, (vm_offset_t)pc);
 }
 
 /*
  * get a new pv_entry, allocating a block from the system
  * when needed.
  */
 static pv_entry_t
 get_pv_entry(pmap_t pmap, int try)
 {
 	static const struct timeval printinterval = { 60, 0 };
 	static struct timeval lastprint;
 	static vm_pindex_t colour;
 	struct vpgqueues *pq;
 	int bit, field;
 	pv_entry_t pv;
 	struct pv_chunk *pc;
 	vm_page_t m;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PV_STAT(pv_entry_allocs++);
 	pv_entry_count++;
 	if (pv_entry_count > pv_entry_high_water)
 		if (ratecheck(&lastprint, &printinterval))
 			printf("Approaching the limit on PV entries, consider "
 			    "increasing either the vm.pmap.shpgperproc or the "
 			    "vm.pmap.pv_entry_max tunable.\n");
 	pq = NULL;
 retry:
 	pc = TAILQ_FIRST(&pmap->pm_pvchunk);
 	if (pc != NULL) {
 		for (field = 0; field < _NPCM; field++) {
 			if (pc->pc_map[field]) {
 				bit = bsfl(pc->pc_map[field]);
 				break;
 			}
 		}
 		if (field < _NPCM) {
 			pv = &pc->pc_pventry[field * 32 + bit];
 			pc->pc_map[field] &= ~(1ul << bit);
 			/* If this was the last item, move it to tail */
 			for (field = 0; field < _NPCM; field++)
 				if (pc->pc_map[field] != 0) {
 					PV_STAT(pv_entry_spare--);
 					return (pv);	/* not full, return */
 				}
 			TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 			TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc, pc_list);
 			PV_STAT(pv_entry_spare--);
 			return (pv);
 		}
 	}
 	/*
 	 * Access to the ptelist "pv_vafree" is synchronized by the page
 	 * queues lock.  If "pv_vafree" is currently non-empty, it will
 	 * remain non-empty until pmap_ptelist_alloc() completes.
 	 */
 	if (pv_vafree == 0 || (m = vm_page_alloc(NULL, colour, (pq ==
 	    &vm_page_queues[PQ_ACTIVE] ? VM_ALLOC_SYSTEM : VM_ALLOC_NORMAL) |
 	    VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) {
 		if (try) {
 			pv_entry_count--;
 			PV_STAT(pc_chunk_tryfail++);
 			return (NULL);
 		}
 		/*
 		 * Reclaim pv entries: At first, destroy mappings to
 		 * inactive pages.  After that, if a pv chunk entry
 		 * is still needed, destroy mappings to active pages.
 		 */
 		if (pq == NULL) {
 			PV_STAT(pmap_collect_inactive++);
 			pq = &vm_page_queues[PQ_INACTIVE];
 		} else if (pq == &vm_page_queues[PQ_INACTIVE]) {
 			PV_STAT(pmap_collect_active++);
 			pq = &vm_page_queues[PQ_ACTIVE];
 		} else
 			panic("get_pv_entry: increase vm.pmap.shpgperproc");
 		pmap_collect(pmap, pq);
 		goto retry;
 	}
 	PV_STAT(pc_chunk_count++);
 	PV_STAT(pc_chunk_allocs++);
 	colour++;
 	pc = (struct pv_chunk *)pmap_ptelist_alloc(&pv_vafree);
 	pmap_qenter((vm_offset_t)pc, &m, 1);
 	pc->pc_pmap = pmap;
 	pc->pc_map[0] = pc_freemask[0] & ~1ul;	/* preallocated bit 0 */
 	for (field = 1; field < _NPCM; field++)
 		pc->pc_map[field] = pc_freemask[field];
 	pv = &pc->pc_pventry[0];
 	TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 	PV_STAT(pv_entry_spare += _NPCPV - 1);
 	return (pv);
 }
 
 static __inline pv_entry_t
 pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		if (pmap == PV_PMAP(pv) && va == pv->pv_va) {
 			TAILQ_REMOVE(&pvh->pv_list, pv, pv_list);
 			break;
 		}
 	}
 	return (pv);
 }
 
 static void
 pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	vm_offset_t va_last;
 	vm_page_t m;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	KASSERT((pa & PDRMASK) == 0,
 	    ("pmap_pv_demote_pde: pa is not 4mpage aligned"));
 
 	/*
 	 * Transfer the 4mpage's pv entry for this mapping to the first
 	 * page's pv list.
 	 */
 	pvh = pa_to_pvh(pa);
 	va = trunc_4mpage(va);
 	pv = pmap_pvh_remove(pvh, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pv_demote_pde: pv not found"));
 	m = PHYS_TO_VM_PAGE(pa);
 	TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 	/* Instantiate the remaining NPTEPG - 1 pv entries. */
 	va_last = va + NBPDR - PAGE_SIZE;
 	do {
 		m++;
 		KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 		    ("pmap_pv_demote_pde: page %p is not managed", m));
 		va += PAGE_SIZE;
 		pmap_insert_entry(pmap, va, m);
 	} while (va < va_last);
 }
 
 static void
 pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	vm_offset_t va_last;
 	vm_page_t m;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	KASSERT((pa & PDRMASK) == 0,
 	    ("pmap_pv_promote_pde: pa is not 4mpage aligned"));
 
 	/*
 	 * Transfer the first page's pv entry for this mapping to the
 	 * 4mpage's pv list.  Aside from avoiding the cost of a call
 	 * to get_pv_entry(), a transfer avoids the possibility that
 	 * get_pv_entry() calls pmap_collect() and that pmap_collect()
 	 * removes one of the mappings that is being promoted.
 	 */
 	m = PHYS_TO_VM_PAGE(pa);
 	va = trunc_4mpage(va);
 	pv = pmap_pvh_remove(&m->md, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pv_promote_pde: pv not found"));
 	pvh = pa_to_pvh(pa);
 	TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_list);
 	/* Free the remaining NPTEPG - 1 pv entries. */
 	va_last = va + NBPDR - PAGE_SIZE;
 	do {
 		m++;
 		va += PAGE_SIZE;
 		pmap_pvh_free(&m->md, pmap, va);
 	} while (va < va_last);
 }
 
 static void
 pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	pv = pmap_pvh_remove(pvh, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pvh_free: pv not found"));
 	free_pv_entry(pmap, pv);
 }
 
 static void
 pmap_remove_entry(pmap_t pmap, vm_page_t m, vm_offset_t va)
 {
 	struct md_page *pvh;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	pmap_pvh_free(&m->md, pmap, va);
 	if (TAILQ_EMPTY(&m->md.pv_list)) {
 		pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 		if (TAILQ_EMPTY(&pvh->pv_list))
 			vm_page_flag_clear(m, PG_WRITEABLE);
 	}
 }
 
 /*
  * Create a pv entry for page at pa for
  * (pmap, va).
  */
 static void
 pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	pv_entry_t pv;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	pv = get_pv_entry(pmap, FALSE);
 	pv->pv_va = va;
 	TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 }
 
 /*
  * Conditionally create a pv entry.
  */
 static boolean_t
 pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	pv_entry_t pv;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if (pv_entry_count < pv_entry_high_water && 
 	    (pv = get_pv_entry(pmap, TRUE)) != NULL) {
 		pv->pv_va = va;
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 /*
  * Create the pv entries for each of the pages within a superpage.
  */
 static boolean_t
 pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if (pv_entry_count < pv_entry_high_water && 
 	    (pv = get_pv_entry(pmap, TRUE)) != NULL) {
 		pv->pv_va = va;
 		pvh = pa_to_pvh(pa);
 		TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_list);
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 /*
  * Fills a page table page with mappings to consecutive physical pages.
  */
 static void
 pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte)
 {
 	pt_entry_t *pte;
 
 	for (pte = firstpte; pte < firstpte + NPTEPG; pte++) {
 		*pte = newpte;	
 		newpte += PAGE_SIZE;
 	}
 }
 
 /*
  * Tries to demote a 2- or 4MB page mapping.  If demotion fails, the
  * 2- or 4MB page mapping is invalidated.
  */
 static boolean_t
 pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va)
 {
 	pd_entry_t newpde, oldpde;
 	pt_entry_t *firstpte, newpte;
 	vm_paddr_t mptepa;
 	vm_page_t free, mpte;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpde = *pde;
 	KASSERT((oldpde & (PG_PS | PG_V)) == (PG_PS | PG_V),
 	    ("pmap_demote_pde: oldpde is missing PG_PS and/or PG_V"));
 	mpte = pmap_lookup_pt_page(pmap, va);
 	if (mpte != NULL)
 		pmap_remove_pt_page(pmap, mpte);
 	else {
 		KASSERT((oldpde & PG_W) == 0,
 		    ("pmap_demote_pde: page table page for a wired mapping"
 		    " is missing"));
 
 		/*
 		 * Invalidate the 2- or 4MB page mapping and return
 		 * "failure" if the mapping was never accessed or the
 		 * allocation of the new page table page fails.
 		 */
 		if ((oldpde & PG_A) == 0 || (mpte = vm_page_alloc(NULL,
 		    va >> PDRSHIFT, VM_ALLOC_NOOBJ | VM_ALLOC_NORMAL |
 		    VM_ALLOC_WIRED)) == NULL) {
 			free = NULL;
 			pmap_remove_pde(pmap, pde, trunc_4mpage(va), &free);
 			pmap_invalidate_page(pmap, trunc_4mpage(va));
 			pmap_free_zero_pages(free);
 			CTR2(KTR_PMAP, "pmap_demote_pde: failure for va %#x"
 			    " in pmap %p", va, pmap);
 			return (FALSE);
 		}
 		if (va < VM_MAXUSER_ADDRESS)
 			pmap->pm_stats.resident_count++;
 	}
 	mptepa = VM_PAGE_TO_PHYS(mpte);
 
 	/*
 	 * If the page mapping is in the kernel's address space, then the
 	 * KPTmap can provide access to the page table page.  Otherwise,
 	 * temporarily map the page table page (mpte) into the kernel's
 	 * address space at either PADDR1 or PADDR2. 
 	 */
 	if (va >= KERNBASE)
 		firstpte = &KPTmap[i386_btop(trunc_4mpage(va))];
 	else if (curthread->td_pinned > 0 && mtx_owned(&vm_page_queue_mtx)) {
 		if ((*PMAP1 & PG_FRAME) != mptepa) {
 			*PMAP1 = mptepa | PG_RW | PG_V | PG_A | PG_M;
 #ifdef SMP
 			PMAP1cpu = PCPU_GET(cpuid);
 #endif
 			invlcaddr(PADDR1);
 			PMAP1changed++;
 		} else
 #ifdef SMP
 		if (PMAP1cpu != PCPU_GET(cpuid)) {
 			PMAP1cpu = PCPU_GET(cpuid);
 			invlcaddr(PADDR1);
 			PMAP1changedcpu++;
 		} else
 #endif
 			PMAP1unchanged++;
 		firstpte = PADDR1;
 	} else {
 		mtx_lock(&PMAP2mutex);
 		if ((*PMAP2 & PG_FRAME) != mptepa) {
 			*PMAP2 = mptepa | PG_RW | PG_V | PG_A | PG_M;
 			pmap_invalidate_page(kernel_pmap, (vm_offset_t)PADDR2);
 		}
 		firstpte = PADDR2;
 	}
 	newpde = mptepa | PG_M | PG_A | (oldpde & PG_U) | PG_RW | PG_V;
 	KASSERT((oldpde & PG_A) != 0,
 	    ("pmap_demote_pde: oldpde is missing PG_A"));
 	KASSERT((oldpde & (PG_M | PG_RW)) != PG_RW,
 	    ("pmap_demote_pde: oldpde is missing PG_M"));
 	newpte = oldpde & ~PG_PS;
 	if ((newpte & PG_PDE_PAT) != 0)
 		newpte ^= PG_PDE_PAT | PG_PTE_PAT;
 
 	/*
 	 * If the page table page is new, initialize it.
 	 */
 	if (mpte->wire_count == 1) {
 		mpte->wire_count = NPTEPG;
 		pmap_fill_ptp(firstpte, newpte);
 	}
 	KASSERT((*firstpte & PG_FRAME) == (newpte & PG_FRAME),
 	    ("pmap_demote_pde: firstpte and newpte map different physical"
 	    " addresses"));
 
 	/*
 	 * If the mapping has changed attributes, update the page table
 	 * entries.
 	 */ 
 	if ((*firstpte & PG_PTE_PROMOTE) != (newpte & PG_PTE_PROMOTE))
 		pmap_fill_ptp(firstpte, newpte);
 	
 	/*
 	 * Demote the mapping.  This pmap is locked.  The old PDE has
 	 * PG_A set.  If the old PDE has PG_RW set, it also has PG_M
 	 * set.  Thus, there is no danger of a race with another
 	 * processor changing the setting of PG_A and/or PG_M between
 	 * the read above and the store below. 
 	 */
 	if (workaround_erratum383)
 		pmap_update_pde(pmap, va, pde, newpde);
 	else if (pmap == kernel_pmap)
 		pmap_kenter_pde(va, newpde);
 	else
 		pde_store(pde, newpde);	
 	if (firstpte == PADDR2)
 		mtx_unlock(&PMAP2mutex);
 
 	/*
 	 * Invalidate the recursive mapping of the page table page.
 	 */
 	pmap_invalidate_page(pmap, (vm_offset_t)vtopte(va));
 
 	/*
 	 * Demote the pv entry.  This depends on the earlier demotion
 	 * of the mapping.  Specifically, the (re)creation of a per-
 	 * page pv entry might trigger the execution of pmap_collect(),
 	 * which might reclaim a newly (re)created per-page pv entry
 	 * and destroy the associated mapping.  In order to destroy
 	 * the mapping, the PDE must have already changed from mapping
 	 * the 2mpage to referencing the page table page.
 	 */
 	if ((oldpde & PG_MANAGED) != 0)
 		pmap_pv_demote_pde(pmap, va, oldpde & PG_PS_FRAME);
 
 	pmap_pde_demotions++;
 	CTR2(KTR_PMAP, "pmap_demote_pde: success for va %#x"
 	    " in pmap %p", va, pmap);
 	return (TRUE);
 }
 
 /*
  * pmap_remove_pde: do the things to unmap a superpage in a process
  */
 static void
 pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva,
     vm_page_t *free)
 {
 	struct md_page *pvh;
 	pd_entry_t oldpde;
 	vm_offset_t eva, va;
 	vm_page_t m, mpte;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((sva & PDRMASK) == 0,
 	    ("pmap_remove_pde: sva is not 4mpage aligned"));
 	oldpde = pte_load_clear(pdq);
 	if (oldpde & PG_W)
 		pmap->pm_stats.wired_count -= NBPDR / PAGE_SIZE;
 
 	/*
 	 * Machines that don't support invlpg, also don't support
 	 * PG_G.
 	 */
 	if (oldpde & PG_G)
 		pmap_invalidate_page(kernel_pmap, sva);
 	pmap->pm_stats.resident_count -= NBPDR / PAGE_SIZE;
 	if (oldpde & PG_MANAGED) {
 		pvh = pa_to_pvh(oldpde & PG_PS_FRAME);
 		pmap_pvh_free(pvh, pmap, sva);
 		eva = sva + NBPDR;
 		for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME);
 		    va < eva; va += PAGE_SIZE, m++) {
 			if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 			if (oldpde & PG_A)
 				vm_page_flag_set(m, PG_REFERENCED);
 			if (TAILQ_EMPTY(&m->md.pv_list) &&
 			    TAILQ_EMPTY(&pvh->pv_list))
 				vm_page_flag_clear(m, PG_WRITEABLE);
 		}
 	}
 	if (pmap == kernel_pmap) {
 		if (!pmap_demote_pde(pmap, pdq, sva))
 			panic("pmap_remove_pde: failed demotion");
 	} else {
 		mpte = pmap_lookup_pt_page(pmap, sva);
 		if (mpte != NULL) {
 			pmap_remove_pt_page(pmap, mpte);
 			pmap->pm_stats.resident_count--;
 			KASSERT(mpte->wire_count == NPTEPG,
 			    ("pmap_remove_pde: pte page wire count error"));
 			mpte->wire_count = 0;
 			pmap_add_delayed_free_list(mpte, free, FALSE);
 			atomic_subtract_int(&cnt.v_wire_count, 1);
 		}
 	}
 }
 
 /*
  * pmap_remove_pte: do the things to unmap a page in a process
  */
 static int
 pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t va, vm_page_t *free)
 {
 	pt_entry_t oldpte;
 	vm_page_t m;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpte = pte_load_clear(ptq);
 	if (oldpte & PG_W)
 		pmap->pm_stats.wired_count -= 1;
 	/*
 	 * Machines that don't support invlpg, also don't support
 	 * PG_G.
 	 */
 	if (oldpte & PG_G)
 		pmap_invalidate_page(kernel_pmap, va);
 	pmap->pm_stats.resident_count -= 1;
 	if (oldpte & PG_MANAGED) {
 		m = PHYS_TO_VM_PAGE(oldpte & PG_FRAME);
 		if ((oldpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			vm_page_dirty(m);
 		if (oldpte & PG_A)
 			vm_page_flag_set(m, PG_REFERENCED);
 		pmap_remove_entry(pmap, m, va);
 	}
 	return (pmap_unuse_pt(pmap, va, free));
 }
 
 /*
  * Remove a single page from a process address space
  */
 static void
 pmap_remove_page(pmap_t pmap, vm_offset_t va, vm_page_t *free)
 {
 	pt_entry_t *pte;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	KASSERT(curthread->td_pinned > 0, ("curthread not pinned"));
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if ((pte = pmap_pte_quick(pmap, va)) == NULL || *pte == 0)
 		return;
 	pmap_remove_pte(pmap, pte, va, free);
 	pmap_invalidate_page(pmap, va);
 }
 
 /*
  *	Remove the given range of addresses from the specified map.
  *
  *	It is assumed that the start and end are properly
  *	rounded to the page size.
  */
 void
 pmap_remove(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t pdnxt;
 	pd_entry_t ptpaddr;
 	pt_entry_t *pte;
 	vm_page_t free = NULL;
 	int anyvalid;
 
 	/*
 	 * Perform an unsynchronized read.  This is, however, safe.
 	 */
 	if (pmap->pm_stats.resident_count == 0)
 		return;
 
 	anyvalid = 0;
 
 	vm_page_lock_queues();
 	sched_pin();
 	PMAP_LOCK(pmap);
 
 	/*
 	 * special handling of removing one page.  a very
 	 * common operation and easy to short circuit some
 	 * code.
 	 */
 	if ((sva + PAGE_SIZE == eva) && 
 	    ((pmap->pm_pdir[(sva >> PDRSHIFT)] & PG_PS) == 0)) {
 		pmap_remove_page(pmap, sva, &free);
 		goto out;
 	}
 
 	for (; sva < eva; sva = pdnxt) {
 		unsigned pdirindex;
 
 		/*
 		 * Calculate index for next page table.
 		 */
 		pdnxt = (sva + NBPDR) & ~PDRMASK;
 		if (pdnxt < sva)
 			pdnxt = eva;
 		if (pmap->pm_stats.resident_count == 0)
 			break;
 
 		pdirindex = sva >> PDRSHIFT;
 		ptpaddr = pmap->pm_pdir[pdirindex];
 
 		/*
 		 * Weed out invalid mappings. Note: we assume that the page
 		 * directory table is always allocated, and in kernel virtual.
 		 */
 		if (ptpaddr == 0)
 			continue;
 
 		/*
 		 * Check for large page.
 		 */
 		if ((ptpaddr & PG_PS) != 0) {
 			/*
 			 * Are we removing the entire large page?  If not,
 			 * demote the mapping and fall through.
 			 */
 			if (sva + NBPDR == pdnxt && eva >= pdnxt) {
 				/*
 				 * The TLB entry for a PG_G mapping is
 				 * invalidated by pmap_remove_pde().
 				 */
 				if ((ptpaddr & PG_G) == 0)
 					anyvalid = 1;
 				pmap_remove_pde(pmap,
 				    &pmap->pm_pdir[pdirindex], sva, &free);
 				continue;
 			} else if (!pmap_demote_pde(pmap,
 			    &pmap->pm_pdir[pdirindex], sva)) {
 				/* The large page mapping was destroyed. */
 				continue;
 			}
 		}
 
 		/*
 		 * Limit our scan to either the end of the va represented
 		 * by the current page table page, or to the end of the
 		 * range being removed.
 		 */
 		if (pdnxt > eva)
 			pdnxt = eva;
 
 		for (pte = pmap_pte_quick(pmap, sva); sva != pdnxt; pte++,
 		    sva += PAGE_SIZE) {
 			if (*pte == 0)
 				continue;
 
 			/*
 			 * The TLB entry for a PG_G mapping is invalidated
 			 * by pmap_remove_pte().
 			 */
 			if ((*pte & PG_G) == 0)
 				anyvalid = 1;
 			if (pmap_remove_pte(pmap, pte, sva, &free))
 				break;
 		}
 	}
 out:
 	sched_unpin();
 	if (anyvalid)
 		pmap_invalidate_all(pmap);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 	pmap_free_zero_pages(free);
 }
 
 /*
  *	Routine:	pmap_remove_all
  *	Function:
  *		Removes this physical page from
  *		all physical maps in which it resides.
  *		Reflects back modify bits to the pager.
  *
  *	Notes:
  *		Original versions of this routine were very
  *		inefficient because they iteratively called
  *		pmap_remove (slow...)
  */
 
 void
 pmap_remove_all(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	pmap_t pmap;
 	pt_entry_t *pte, tpte;
 	pd_entry_t *pde;
 	vm_offset_t va;
 	vm_page_t free;
 
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_remove_all: page %p is fictitious", m));
 	free = NULL;
 	vm_page_lock_queues();
 	sched_pin();
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	while ((pv = TAILQ_FIRST(&pvh->pv_list)) != NULL) {
 		va = pv->pv_va;
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, va);
 		(void)pmap_demote_pde(pmap, pde, va);
 		PMAP_UNLOCK(pmap);
 	}
 	while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pmap->pm_stats.resident_count--;
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_remove_all: found"
 		    " a 4mpage in page %p's pv list", m));
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		tpte = pte_load_clear(pte);
 		if (tpte & PG_W)
 			pmap->pm_stats.wired_count--;
 		if (tpte & PG_A)
 			vm_page_flag_set(m, PG_REFERENCED);
 
 		/*
 		 * Update the vm_page_t clean and reference bits.
 		 */
 		if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			vm_page_dirty(m);
 		pmap_unuse_pt(pmap, pv->pv_va, &free);
 		pmap_invalidate_page(pmap, pv->pv_va);
 		TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 		free_pv_entry(pmap, pv);
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	sched_unpin();
 	vm_page_unlock_queues();
 	pmap_free_zero_pages(free);
 }
 
 /*
  * pmap_protect_pde: do the things to protect a 4mpage in a process
  */
 static boolean_t
 pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva, vm_prot_t prot)
 {
 	pd_entry_t newpde, oldpde;
 	vm_offset_t eva, va;
 	vm_page_t m;
 	boolean_t anychanged;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((sva & PDRMASK) == 0,
 	    ("pmap_protect_pde: sva is not 4mpage aligned"));
 	anychanged = FALSE;
 retry:
 	oldpde = newpde = *pde;
 	if (oldpde & PG_MANAGED) {
 		eva = sva + NBPDR;
 		for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME);
 		    va < eva; va += PAGE_SIZE, m++)
 			if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 	}
 	if ((prot & VM_PROT_WRITE) == 0)
 		newpde &= ~(PG_RW | PG_M);
 #ifdef PAE
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpde |= pg_nx;
 #endif
 	if (newpde != oldpde) {
 		if (!pde_cmpset(pde, oldpde, newpde))
 			goto retry;
 		if (oldpde & PG_G)
 			pmap_invalidate_page(pmap, sva);
 		else
 			anychanged = TRUE;
 	}
 	return (anychanged);
 }
 
 /*
  *	Set the physical protection on the
  *	specified range of this map as requested.
  */
 void
 pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
 {
 	vm_offset_t pdnxt;
 	pd_entry_t ptpaddr;
 	pt_entry_t *pte;
 	int anychanged;
 
 	if ((prot & VM_PROT_READ) == VM_PROT_NONE) {
 		pmap_remove(pmap, sva, eva);
 		return;
 	}
 
 #ifdef PAE
 	if ((prot & (VM_PROT_WRITE|VM_PROT_EXECUTE)) ==
 	    (VM_PROT_WRITE|VM_PROT_EXECUTE))
 		return;
 #else
 	if (prot & VM_PROT_WRITE)
 		return;
 #endif
 
 	anychanged = 0;
 
 	vm_page_lock_queues();
 	sched_pin();
 	PMAP_LOCK(pmap);
 	for (; sva < eva; sva = pdnxt) {
 		pt_entry_t obits, pbits;
 		unsigned pdirindex;
 
 		pdnxt = (sva + NBPDR) & ~PDRMASK;
 		if (pdnxt < sva)
 			pdnxt = eva;
 
 		pdirindex = sva >> PDRSHIFT;
 		ptpaddr = pmap->pm_pdir[pdirindex];
 
 		/*
 		 * Weed out invalid mappings. Note: we assume that the page
 		 * directory table is always allocated, and in kernel virtual.
 		 */
 		if (ptpaddr == 0)
 			continue;
 
 		/*
 		 * Check for large page.
 		 */
 		if ((ptpaddr & PG_PS) != 0) {
 			/*
 			 * Are we protecting the entire large page?  If not,
 			 * demote the mapping and fall through.
 			 */
 			if (sva + NBPDR == pdnxt && eva >= pdnxt) {
 				/*
 				 * The TLB entry for a PG_G mapping is
 				 * invalidated by pmap_protect_pde().
 				 */
 				if (pmap_protect_pde(pmap,
 				    &pmap->pm_pdir[pdirindex], sva, prot))
 					anychanged = 1;
 				continue;
 			} else if (!pmap_demote_pde(pmap,
 			    &pmap->pm_pdir[pdirindex], sva)) {
 				/* The large page mapping was destroyed. */
 				continue;
 			}
 		}
 
 		if (pdnxt > eva)
 			pdnxt = eva;
 
 		for (pte = pmap_pte_quick(pmap, sva); sva != pdnxt; pte++,
 		    sva += PAGE_SIZE) {
 			vm_page_t m;
 
 retry:
 			/*
 			 * Regardless of whether a pte is 32 or 64 bits in
 			 * size, PG_RW, PG_A, and PG_M are among the least
 			 * significant 32 bits.
 			 */
 			obits = pbits = *pte;
 			if ((pbits & PG_V) == 0)
 				continue;
 
 			if ((prot & VM_PROT_WRITE) == 0) {
 				if ((pbits & (PG_MANAGED | PG_M | PG_RW)) ==
 				    (PG_MANAGED | PG_M | PG_RW)) {
 					m = PHYS_TO_VM_PAGE(pbits & PG_FRAME);
 					vm_page_dirty(m);
 				}
 				pbits &= ~(PG_RW | PG_M);
 			}
 #ifdef PAE
 			if ((prot & VM_PROT_EXECUTE) == 0)
 				pbits |= pg_nx;
 #endif
 
 			if (pbits != obits) {
 #ifdef PAE
 				if (!atomic_cmpset_64(pte, obits, pbits))
 					goto retry;
 #else
 				if (!atomic_cmpset_int((u_int *)pte, obits,
 				    pbits))
 					goto retry;
 #endif
 				if (obits & PG_G)
 					pmap_invalidate_page(pmap, sva);
 				else
 					anychanged = 1;
 			}
 		}
 	}
 	sched_unpin();
 	if (anychanged)
 		pmap_invalidate_all(pmap);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Tries to promote the 512 or 1024, contiguous 4KB page mappings that are
  * within a single page table page (PTP) to a single 2- or 4MB page mapping.
  * For promotion to occur, two conditions must be met: (1) the 4KB page
  * mappings must map aligned, contiguous physical memory and (2) the 4KB page
  * mappings must have identical characteristics.
  *
  * Managed (PG_MANAGED) mappings within the kernel address space are not
  * promoted.  The reason is that kernel PDEs are replicated in each pmap but
  * pmap_clear_ptes() and pmap_ts_referenced() only read the PDE from the kernel
  * pmap.
  */
 static void
 pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va)
 {
 	pd_entry_t newpde;
 	pt_entry_t *firstpte, oldpte, pa, *pte;
 	vm_offset_t oldpteva;
 	vm_page_t mpte;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * Examine the first PTE in the specified PTP.  Abort if this PTE is
 	 * either invalid, unused, or does not map the first 4KB physical page
 	 * within a 2- or 4MB page.
 	 */
 	firstpte = pmap_pte_quick(pmap, trunc_4mpage(va));
 setpde:
 	newpde = *firstpte;
 	if ((newpde & ((PG_FRAME & PDRMASK) | PG_A | PG_V)) != (PG_A | PG_V)) {
 		pmap_pde_p_failures++;
 		CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#x"
 		    " in pmap %p", va, pmap);
 		return;
 	}
 	if ((*firstpte & PG_MANAGED) != 0 && pmap == kernel_pmap) {
 		pmap_pde_p_failures++;
 		CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#x"
 		    " in pmap %p", va, pmap);
 		return;
 	}
 	if ((newpde & (PG_M | PG_RW)) == PG_RW) {
 		/*
 		 * When PG_M is already clear, PG_RW can be cleared without
 		 * a TLB invalidation.
 		 */
 		if (!atomic_cmpset_int((u_int *)firstpte, newpde, newpde &
 		    ~PG_RW))  
 			goto setpde;
 		newpde &= ~PG_RW;
 	}
 
 	/* 
 	 * Examine each of the other PTEs in the specified PTP.  Abort if this
 	 * PTE maps an unexpected 4KB physical page or does not have identical
 	 * characteristics to the first PTE.
 	 */
 	pa = (newpde & (PG_PS_FRAME | PG_A | PG_V)) + NBPDR - PAGE_SIZE;
 	for (pte = firstpte + NPTEPG - 1; pte > firstpte; pte--) {
 setpte:
 		oldpte = *pte;
 		if ((oldpte & (PG_FRAME | PG_A | PG_V)) != pa) {
 			pmap_pde_p_failures++;
 			CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#x"
 			    " in pmap %p", va, pmap);
 			return;
 		}
 		if ((oldpte & (PG_M | PG_RW)) == PG_RW) {
 			/*
 			 * When PG_M is already clear, PG_RW can be cleared
 			 * without a TLB invalidation.
 			 */
 			if (!atomic_cmpset_int((u_int *)pte, oldpte,
 			    oldpte & ~PG_RW))
 				goto setpte;
 			oldpte &= ~PG_RW;
 			oldpteva = (oldpte & PG_FRAME & PDRMASK) |
 			    (va & ~PDRMASK);
 			CTR2(KTR_PMAP, "pmap_promote_pde: protect for va %#x"
 			    " in pmap %p", oldpteva, pmap);
 		}
 		if ((oldpte & PG_PTE_PROMOTE) != (newpde & PG_PTE_PROMOTE)) {
 			pmap_pde_p_failures++;
 			CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#x"
 			    " in pmap %p", va, pmap);
 			return;
 		}
 		pa -= PAGE_SIZE;
 	}
 
 	/*
 	 * Save the page table page in its current state until the PDE
 	 * mapping the superpage is demoted by pmap_demote_pde() or
 	 * destroyed by pmap_remove_pde(). 
 	 */
 	mpte = PHYS_TO_VM_PAGE(*pde & PG_FRAME);
 	KASSERT(mpte >= vm_page_array &&
 	    mpte < &vm_page_array[vm_page_array_size],
 	    ("pmap_promote_pde: page table page is out of range"));
 	KASSERT(mpte->pindex == va >> PDRSHIFT,
 	    ("pmap_promote_pde: page table page's pindex is wrong"));
 	pmap_insert_pt_page(pmap, mpte);
 
 	/*
 	 * Promote the pv entries.
 	 */
 	if ((newpde & PG_MANAGED) != 0)
 		pmap_pv_promote_pde(pmap, va, newpde & PG_PS_FRAME);
 
 	/*
 	 * Propagate the PAT index to its proper position.
 	 */
 	if ((newpde & PG_PTE_PAT) != 0)
 		newpde ^= PG_PDE_PAT | PG_PTE_PAT;
 
 	/*
 	 * Map the superpage.
 	 */
 	if (workaround_erratum383)
 		pmap_update_pde(pmap, va, pde, PG_PS | newpde);
 	else if (pmap == kernel_pmap)
 		pmap_kenter_pde(va, PG_PS | newpde);
 	else
 		pde_store(pde, PG_PS | newpde);
 
 	pmap_pde_promotions++;
 	CTR2(KTR_PMAP, "pmap_promote_pde: success for va %#x"
 	    " in pmap %p", va, pmap);
 }
 
 /*
  *	Insert the given physical page (p) at
  *	the specified virtual address (v) in the
  *	target physical map with the protection requested.
  *
  *	If specified, the page will be wired down, meaning
  *	that the related pte can not be reclaimed.
  *
  *	NB:  This is the only routine which MAY NOT lazy-evaluate
  *	or lose information.  That is, this routine must actually
  *	insert this page into the given map NOW.
  */
 void
 pmap_enter(pmap_t pmap, vm_offset_t va, vm_prot_t access, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	pt_entry_t newpte, origpte;
 	pv_entry_t pv;
 	vm_paddr_t opa, pa;
 	vm_page_t mpte, om;
 	boolean_t invlva;
 
 	va = trunc_page(va);
 	KASSERT(va <= VM_MAX_KERNEL_ADDRESS, ("pmap_enter: toobig"));
 	KASSERT(va < UPT_MIN_ADDRESS || va >= UPT_MAX_ADDRESS,
 	    ("pmap_enter: invalid to pmap_enter page table pages (va: 0x%x)",
 	    va));
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 	    (m->oflags & VPO_BUSY) != 0,
 	    ("pmap_enter: page %p is not busy", m));
 
 	mpte = NULL;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	sched_pin();
 
 	/*
 	 * In the case that a page table page is not
 	 * resident, we are creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		mpte = pmap_allocpte(pmap, va, M_WAITOK);
 	}
 
 	pde = pmap_pde(pmap, va);
 	if ((*pde & PG_PS) != 0)
 		panic("pmap_enter: attempted pmap_enter on 4MB page");
 	pte = pmap_pte_quick(pmap, va);
 
 	/*
 	 * Page Directory table entry not valid, we need a new PT page
 	 */
 	if (pte == NULL) {
 		panic("pmap_enter: invalid page directory pdir=%#jx, va=%#x",
 			(uintmax_t)pmap->pm_pdir[PTDPTDI], va);
 	}
 
 	pa = VM_PAGE_TO_PHYS(m);
 	om = NULL;
 	origpte = *pte;
 	opa = origpte & PG_FRAME;
 
 	/*
 	 * Mapping has not changed, must be protection or wiring change.
 	 */
 	if (origpte && (opa == pa)) {
 		/*
 		 * Wiring change, just update stats. We don't worry about
 		 * wiring PT pages as they remain resident as long as there
 		 * are valid mappings in them. Hence, if a user page is wired,
 		 * the PT page will be also.
 		 */
 		if (wired && ((origpte & PG_W) == 0))
 			pmap->pm_stats.wired_count++;
 		else if (!wired && (origpte & PG_W))
 			pmap->pm_stats.wired_count--;
 
 		/*
 		 * Remove extra pte reference
 		 */
 		if (mpte)
 			mpte->wire_count--;
 
 		if (origpte & PG_MANAGED) {
 			om = m;
 			pa |= PG_MANAGED;
 		}
 		goto validate;
 	} 
 
 	pv = NULL;
 
 	/*
 	 * Mapping has changed, invalidate old range and fall through to
 	 * handle validating new mapping.
 	 */
 	if (opa) {
 		if (origpte & PG_W)
 			pmap->pm_stats.wired_count--;
 		if (origpte & PG_MANAGED) {
 			om = PHYS_TO_VM_PAGE(opa);
 			pv = pmap_pvh_remove(&om->md, pmap, va);
 		}
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			KASSERT(mpte->wire_count > 0,
 			    ("pmap_enter: missing reference to page table page,"
 			     " va: 0x%x", va));
 		}
 	} else
 		pmap->pm_stats.resident_count++;
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0) {
 		KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva,
 		    ("pmap_enter: managed mapping within the clean submap"));
 		if (pv == NULL)
 			pv = get_pv_entry(pmap, FALSE);
 		pv->pv_va = va;
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 		pa |= PG_MANAGED;
 	} else if (pv != NULL)
 		free_pv_entry(pmap, pv);
 
 	/*
 	 * Increment counters
 	 */
 	if (wired)
 		pmap->pm_stats.wired_count++;
 
 validate:
 	/*
 	 * Now validate mapping with desired protection/wiring.
 	 */
 	newpte = (pt_entry_t)(pa | pmap_cache_bits(m->md.pat_mode, 0) | PG_V);
 	if ((prot & VM_PROT_WRITE) != 0) {
 		newpte |= PG_RW;
 		if ((newpte & PG_MANAGED) != 0)
 			vm_page_flag_set(m, PG_WRITEABLE);
 	}
 #ifdef PAE
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpte |= pg_nx;
 #endif
 	if (wired)
 		newpte |= PG_W;
 	if (va < VM_MAXUSER_ADDRESS)
 		newpte |= PG_U;
 	if (pmap == kernel_pmap)
 		newpte |= pgeflag;
 
 	/*
 	 * if the mapping or permission bits are different, we need
 	 * to update the pte.
 	 */
 	if ((origpte & ~(PG_M|PG_A)) != newpte) {
 		newpte |= PG_A;
 		if ((access & VM_PROT_WRITE) != 0)
 			newpte |= PG_M;
 		if (origpte & PG_V) {
 			invlva = FALSE;
 			origpte = pte_load_store(pte, newpte);
 			if (origpte & PG_A) {
 				if (origpte & PG_MANAGED)
 					vm_page_flag_set(om, PG_REFERENCED);
 				if (opa != VM_PAGE_TO_PHYS(m))
 					invlva = TRUE;
 #ifdef PAE
 				if ((origpte & PG_NX) == 0 &&
 				    (newpte & PG_NX) != 0)
 					invlva = TRUE;
 #endif
 			}
 			if ((origpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 				if ((origpte & PG_MANAGED) != 0)
 					vm_page_dirty(om);
 				if ((prot & VM_PROT_WRITE) == 0)
 					invlva = TRUE;
 			}
 			if ((origpte & PG_MANAGED) != 0 &&
 			    TAILQ_EMPTY(&om->md.pv_list) &&
 			    TAILQ_EMPTY(&pa_to_pvh(opa)->pv_list))
 				vm_page_flag_clear(om, PG_WRITEABLE);
 			if (invlva)
 				pmap_invalidate_page(pmap, va);
 		} else
 			pte_store(pte, newpte);
 	}
 
 	/*
 	 * If both the page table page and the reservation are fully
 	 * populated, then attempt promotion.
 	 */
 	if ((mpte == NULL || mpte->wire_count == NPTEPG) &&
 	    pg_ps_enabled && vm_reserv_level_iffullpop(m) == 0)
 		pmap_promote_pde(pmap, pde, va);
 
 	sched_unpin();
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Tries to create a 2- or 4MB page mapping.  Returns TRUE if successful and
  * FALSE otherwise.  Fails if (1) a page table page cannot be allocated without
  * blocking, (2) a mapping already exists at the specified virtual address, or
  * (3) a pv entry cannot be allocated without reclaiming another pv entry. 
  */
 static boolean_t
 pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 	pd_entry_t *pde, newpde;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	pde = pmap_pde(pmap, va);
 	if (*pde != 0) {
 		CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return (FALSE);
 	}
 	newpde = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(m->md.pat_mode, 1) |
 	    PG_PS | PG_V;
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0) {
 		newpde |= PG_MANAGED;
 
 		/*
 		 * Abort this mapping if its PV entry could not be created.
 		 */
 		if (!pmap_pv_insert_pde(pmap, va, VM_PAGE_TO_PHYS(m))) {
 			CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return (FALSE);
 		}
 	}
 #ifdef PAE
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpde |= pg_nx;
 #endif
 	if (va < VM_MAXUSER_ADDRESS)
 		newpde |= PG_U;
 
 	/*
 	 * Increment counters.
 	 */
 	pmap->pm_stats.resident_count += NBPDR / PAGE_SIZE;
 
 	/*
 	 * Map the superpage.
 	 */
 	pde_store(pde, newpde);
 
 	pmap_pde_mappings++;
 	CTR2(KTR_PMAP, "pmap_enter_pde: success for va %#lx"
 	    " in pmap %p", va, pmap);
 	return (TRUE);
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 pmap_enter_object(pmap_t pmap, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	vm_offset_t va;
 	vm_page_t m, mpte;
 	vm_pindex_t diff, psize;
 
 	VM_OBJECT_LOCK_ASSERT(m_start->object, MA_OWNED);
 	psize = atop(end - start);
 	mpte = NULL;
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		va = start + ptoa(diff);
 		if ((va & PDRMASK) == 0 && va + NBPDR <= end &&
 		    (VM_PAGE_TO_PHYS(m) & PDRMASK) == 0 &&
 		    pg_ps_enabled && vm_reserv_level_iffullpop(m) == 0 &&
 		    pmap_enter_pde(pmap, va, m, prot))
 			m = &m[NBPDR / PAGE_SIZE - 1];
 		else
 			mpte = pmap_enter_quick_locked(pmap, va, m, prot,
 			    mpte);
 		m = TAILQ_NEXT(m, listq);
 	}
 	vm_page_unlock_queues();
  	PMAP_UNLOCK(pmap);
 }
 
 /*
  * this code makes some *MAJOR* assumptions:
  * 1. Current pmap & pmap exists.
  * 2. Not wired.
  * 3. Read access.
  * 4. No page table pages.
  * but is *MUCH* faster than pmap_enter...
  */
 
 void
 pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	(void)pmap_enter_quick_locked(pmap, va, m, prot, NULL);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 static vm_page_t
 pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, vm_page_t mpte)
 {
 	pt_entry_t *pte;
 	vm_paddr_t pa;
 	vm_page_t free;
 
 	KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva ||
 	    (m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0,
 	    ("pmap_enter_quick_locked: managed mapping within the clean submap"));
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * In the case that a page table page is not
 	 * resident, we are creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		unsigned ptepindex;
 		pd_entry_t ptepa;
 
 		/*
 		 * Calculate pagetable page index
 		 */
 		ptepindex = va >> PDRSHIFT;
 		if (mpte && (mpte->pindex == ptepindex)) {
 			mpte->wire_count++;
 		} else {
 			/*
 			 * Get the page directory entry
 			 */
 			ptepa = pmap->pm_pdir[ptepindex];
 
 			/*
 			 * If the page table page is mapped, we just increment
 			 * the hold count, and activate it.
 			 */
 			if (ptepa) {
 				if (ptepa & PG_PS)
 					return (NULL);
 				mpte = PHYS_TO_VM_PAGE(ptepa & PG_FRAME);
 				mpte->wire_count++;
 			} else {
 				mpte = _pmap_allocpte(pmap, ptepindex,
 				    M_NOWAIT);
 				if (mpte == NULL)
 					return (mpte);
 			}
 		}
 	} else {
 		mpte = NULL;
 	}
 
 	/*
 	 * This call to vtopte makes the assumption that we are
 	 * entering the page into the current pmap.  In order to support
 	 * quick entry into any pmap, one would likely use pmap_pte_quick.
 	 * But that isn't as quick as vtopte.
 	 */
 	pte = vtopte(va);
 	if (*pte) {
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0 &&
 	    !pmap_try_insert_pv_entry(pmap, va, m)) {
 		if (mpte != NULL) {
 			free = NULL;
 			if (pmap_unwire_pte_hold(pmap, mpte, &free)) {
 				pmap_invalidate_page(pmap, va);
 				pmap_free_zero_pages(free);
 			}
 			
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Increment counters
 	 */
 	pmap->pm_stats.resident_count++;
 
 	pa = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(m->md.pat_mode, 0);
 #ifdef PAE
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		pa |= pg_nx;
 #endif
 
 	/*
 	 * Now validate mapping with RO protection
 	 */
 	if (m->flags & (PG_FICTITIOUS|PG_UNMANAGED))
 		pte_store(pte, pa | PG_V | PG_U);
 	else
 		pte_store(pte, pa | PG_V | PG_U | PG_MANAGED);
 	return (mpte);
 }
 
 /*
  * Make a temporary mapping for a physical address.  This is only intended
  * to be used for panic dumps.
  */
 void *
 pmap_kenter_temporary(vm_paddr_t pa, int i)
 {
 	vm_offset_t va;
 
 	va = (vm_offset_t)crashdumpmap + (i * PAGE_SIZE);
 	pmap_kenter(va, pa);
 	invlpg(va);
 	return ((void *)crashdumpmap);
 }
 
 /*
  * This code maps large physical mmap regions into the
  * processor address space.  Note that some shortcuts
  * are taken, but the code works.
  */
 void
 pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_object_t object,
     vm_pindex_t pindex, vm_size_t size)
 {
 	pd_entry_t *pde;
 	vm_paddr_t pa, ptepa;
 	vm_page_t p;
 	int pat_mode;
 
 	VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
 	KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG,
 	    ("pmap_object_init_pt: non-device object"));
 	if (pseflag && 
 	    (addr & (NBPDR - 1)) == 0 && (size & (NBPDR - 1)) == 0) {
 		if (!vm_object_populate(object, pindex, pindex + atop(size)))
 			return;
 		p = vm_page_lookup(object, pindex);
 		KASSERT(p->valid == VM_PAGE_BITS_ALL,
 		    ("pmap_object_init_pt: invalid page %p", p));
 		pat_mode = p->md.pat_mode;
 
 		/*
 		 * Abort the mapping if the first page is not physically
 		 * aligned to a 2/4MB page boundary.
 		 */
 		ptepa = VM_PAGE_TO_PHYS(p);
 		if (ptepa & (NBPDR - 1))
 			return;
 
 		/*
 		 * Skip the first page.  Abort the mapping if the rest of
 		 * the pages are not physically contiguous or have differing
 		 * memory attributes.
 		 */
 		p = TAILQ_NEXT(p, listq);
 		for (pa = ptepa + PAGE_SIZE; pa < ptepa + size;
 		    pa += PAGE_SIZE) {
 			KASSERT(p->valid == VM_PAGE_BITS_ALL,
 			    ("pmap_object_init_pt: invalid page %p", p));
 			if (pa != VM_PAGE_TO_PHYS(p) ||
 			    pat_mode != p->md.pat_mode)
 				return;
 			p = TAILQ_NEXT(p, listq);
 		}
 
 		/*
 		 * Map using 2/4MB pages.  Since "ptepa" is 2/4M aligned and
 		 * "size" is a multiple of 2/4M, adding the PAT setting to
 		 * "pa" will not affect the termination of this loop.
 		 */
 		PMAP_LOCK(pmap);
 		for (pa = ptepa | pmap_cache_bits(pat_mode, 1); pa < ptepa +
 		    size; pa += NBPDR) {
 			pde = pmap_pde(pmap, addr);
 			if (*pde == 0) {
 				pde_store(pde, pa | PG_PS | PG_M | PG_A |
 				    PG_U | PG_RW | PG_V);
 				pmap->pm_stats.resident_count += NBPDR /
 				    PAGE_SIZE;
 				pmap_pde_mappings++;
 			}
 			/* Else continue on if the PDE is already valid. */
 			addr += NBPDR;
 		}
 		PMAP_UNLOCK(pmap);
 	}
 }
 
 /*
  *	Routine:	pmap_change_wiring
  *	Function:	Change the wiring attribute for a map/virtual-address
  *			pair.
  *	In/out conditions:
  *			The mapping must already exist in the pmap.
  */
 void
 pmap_change_wiring(pmap_t pmap, vm_offset_t va, boolean_t wired)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	boolean_t are_queues_locked;
 
 	are_queues_locked = FALSE;
 retry:
 	PMAP_LOCK(pmap);
 	pde = pmap_pde(pmap, va);
 	if ((*pde & PG_PS) != 0) {
 		if (!wired != ((*pde & PG_W) == 0)) {
 			if (!are_queues_locked) {
 				are_queues_locked = TRUE;
 				if (!mtx_trylock(&vm_page_queue_mtx)) {
 					PMAP_UNLOCK(pmap);
 					vm_page_lock_queues();
 					goto retry;
 				}
 			}
 			if (!pmap_demote_pde(pmap, pde, va))
 				panic("pmap_change_wiring: demotion failed");
 		} else
 			goto out;
 	}
 	pte = pmap_pte(pmap, va);
 
 	if (wired && !pmap_pte_w(pte))
 		pmap->pm_stats.wired_count++;
 	else if (!wired && pmap_pte_w(pte))
 		pmap->pm_stats.wired_count--;
 
 	/*
 	 * Wiring is not a hardware characteristic so there is no need to
 	 * invalidate TLB.
 	 */
 	pmap_pte_set_w(pte, wired);
 	pmap_pte_release(pte);
 out:
 	if (are_queues_locked)
 		vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 
 
 /*
  *	Copy the range specified by src_addr/len
  *	from the source map to the range dst_addr/len
  *	in the destination map.
  *
  *	This routine is only advisory and need not do anything.
  */
 
 void
 pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, vm_size_t len,
     vm_offset_t src_addr)
 {
 	vm_page_t   free;
 	vm_offset_t addr;
 	vm_offset_t end_addr = src_addr + len;
 	vm_offset_t pdnxt;
 
 	if (dst_addr != src_addr)
 		return;
 
 	if (!pmap_is_current(src_pmap))
 		return;
 
 	vm_page_lock_queues();
 	if (dst_pmap < src_pmap) {
 		PMAP_LOCK(dst_pmap);
 		PMAP_LOCK(src_pmap);
 	} else {
 		PMAP_LOCK(src_pmap);
 		PMAP_LOCK(dst_pmap);
 	}
 	sched_pin();
 	for (addr = src_addr; addr < end_addr; addr = pdnxt) {
 		pt_entry_t *src_pte, *dst_pte;
 		vm_page_t dstmpte, srcmpte;
 		pd_entry_t srcptepaddr;
 		unsigned ptepindex;
 
 		KASSERT(addr < UPT_MIN_ADDRESS,
 		    ("pmap_copy: invalid to pmap_copy page tables"));
 
 		pdnxt = (addr + NBPDR) & ~PDRMASK;
 		if (pdnxt < addr)
 			pdnxt = end_addr;
 		ptepindex = addr >> PDRSHIFT;
 
 		srcptepaddr = src_pmap->pm_pdir[ptepindex];
 		if (srcptepaddr == 0)
 			continue;
 			
 		if (srcptepaddr & PG_PS) {
 			if (dst_pmap->pm_pdir[ptepindex] == 0 &&
 			    ((srcptepaddr & PG_MANAGED) == 0 ||
 			    pmap_pv_insert_pde(dst_pmap, addr, srcptepaddr &
 			    PG_PS_FRAME))) {
 				dst_pmap->pm_pdir[ptepindex] = srcptepaddr &
 				    ~PG_W;
 				dst_pmap->pm_stats.resident_count +=
 				    NBPDR / PAGE_SIZE;
 			}
 			continue;
 		}
 
 		srcmpte = PHYS_TO_VM_PAGE(srcptepaddr & PG_FRAME);
 		KASSERT(srcmpte->wire_count > 0,
 		    ("pmap_copy: source page table page is unused"));
 
 		if (pdnxt > end_addr)
 			pdnxt = end_addr;
 
 		src_pte = vtopte(addr);
 		while (addr < pdnxt) {
 			pt_entry_t ptetemp;
 			ptetemp = *src_pte;
 			/*
 			 * we only virtual copy managed pages
 			 */
 			if ((ptetemp & PG_MANAGED) != 0) {
 				dstmpte = pmap_allocpte(dst_pmap, addr,
 				    M_NOWAIT);
 				if (dstmpte == NULL)
 					goto out;
 				dst_pte = pmap_pte_quick(dst_pmap, addr);
 				if (*dst_pte == 0 &&
 				    pmap_try_insert_pv_entry(dst_pmap, addr,
 				    PHYS_TO_VM_PAGE(ptetemp & PG_FRAME))) {
 					/*
 					 * Clear the wired, modified, and
 					 * accessed (referenced) bits
 					 * during the copy.
 					 */
 					*dst_pte = ptetemp & ~(PG_W | PG_M |
 					    PG_A);
 					dst_pmap->pm_stats.resident_count++;
 	 			} else {
 					free = NULL;
 					if (pmap_unwire_pte_hold(dst_pmap,
 					    dstmpte, &free)) {
 						pmap_invalidate_page(dst_pmap,
 						    addr);
 						pmap_free_zero_pages(free);
 					}
 					goto out;
 				}
 				if (dstmpte->wire_count >= srcmpte->wire_count)
 					break;
 			}
 			addr += PAGE_SIZE;
 			src_pte++;
 		}
 	}
 out:
 	sched_unpin();
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(src_pmap);
 	PMAP_UNLOCK(dst_pmap);
 }	
 
 static __inline void
 pagezero(void *page)
 {
 #if defined(I686_CPU)
 	if (cpu_class == CPUCLASS_686) {
 #if defined(CPU_ENABLE_SSE)
 		if (cpu_feature & CPUID_SSE2)
 			sse2_pagezero(page);
 		else
 #endif
 			i686_pagezero(page);
 	} else
 #endif
 		bzero(page, PAGE_SIZE);
 }
 
 /*
  *	pmap_zero_page zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  */
 void
 pmap_zero_page(vm_page_t m)
 {
 	struct sysmaps *sysmaps;
 
 	sysmaps = &sysmaps_pcpu[PCPU_GET(cpuid)];
 	mtx_lock(&sysmaps->lock);
 	if (*sysmaps->CMAP2)
 		panic("pmap_zero_page: CMAP2 busy");
 	sched_pin();
 	*sysmaps->CMAP2 = PG_V | PG_RW | VM_PAGE_TO_PHYS(m) | PG_A | PG_M |
 	    pmap_cache_bits(m->md.pat_mode, 0);
 	invlcaddr(sysmaps->CADDR2);
 	pagezero(sysmaps->CADDR2);
 	*sysmaps->CMAP2 = 0;
 	sched_unpin();
 	mtx_unlock(&sysmaps->lock);
 }
 
 /*
  *	pmap_zero_page_area zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  *
  *	off and size may not cover an area beyond a single hardware page.
  */
 void
 pmap_zero_page_area(vm_page_t m, int off, int size)
 {
 	struct sysmaps *sysmaps;
 
 	sysmaps = &sysmaps_pcpu[PCPU_GET(cpuid)];
 	mtx_lock(&sysmaps->lock);
 	if (*sysmaps->CMAP2)
 		panic("pmap_zero_page_area: CMAP2 busy");
 	sched_pin();
 	*sysmaps->CMAP2 = PG_V | PG_RW | VM_PAGE_TO_PHYS(m) | PG_A | PG_M |
 	    pmap_cache_bits(m->md.pat_mode, 0);
 	invlcaddr(sysmaps->CADDR2);
 	if (off == 0 && size == PAGE_SIZE) 
 		pagezero(sysmaps->CADDR2);
 	else
 		bzero((char *)sysmaps->CADDR2 + off, size);
 	*sysmaps->CMAP2 = 0;
 	sched_unpin();
 	mtx_unlock(&sysmaps->lock);
 }
 
 /*
  *	pmap_zero_page_idle zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.  This
  *	is intended to be called from the vm_pagezero process only and
  *	outside of Giant.
  */
 void
 pmap_zero_page_idle(vm_page_t m)
 {
 
 	if (*CMAP3)
 		panic("pmap_zero_page_idle: CMAP3 busy");
 	sched_pin();
 	*CMAP3 = PG_V | PG_RW | VM_PAGE_TO_PHYS(m) | PG_A | PG_M |
 	    pmap_cache_bits(m->md.pat_mode, 0);
 	invlcaddr(CADDR3);
 	pagezero(CADDR3);
 	*CMAP3 = 0;
 	sched_unpin();
 }
 
 /*
  *	pmap_copy_page copies the specified (machine independent)
  *	page by mapping the page into virtual memory and using
  *	bcopy to copy the page, one machine dependent page at a
  *	time.
  */
 void
 pmap_copy_page(vm_page_t src, vm_page_t dst)
 {
 	struct sysmaps *sysmaps;
 
 	sysmaps = &sysmaps_pcpu[PCPU_GET(cpuid)];
 	mtx_lock(&sysmaps->lock);
 	if (*sysmaps->CMAP1)
 		panic("pmap_copy_page: CMAP1 busy");
 	if (*sysmaps->CMAP2)
 		panic("pmap_copy_page: CMAP2 busy");
 	sched_pin();
 	invlpg((u_int)sysmaps->CADDR1);
 	invlpg((u_int)sysmaps->CADDR2);
 	*sysmaps->CMAP1 = PG_V | VM_PAGE_TO_PHYS(src) | PG_A |
 	    pmap_cache_bits(src->md.pat_mode, 0);
 	*sysmaps->CMAP2 = PG_V | PG_RW | VM_PAGE_TO_PHYS(dst) | PG_A | PG_M |
 	    pmap_cache_bits(dst->md.pat_mode, 0);
 	bcopy(sysmaps->CADDR1, sysmaps->CADDR2, PAGE_SIZE);
 	*sysmaps->CMAP1 = 0;
 	*sysmaps->CMAP2 = 0;
 	sched_unpin();
 	mtx_unlock(&sysmaps->lock);
 }
 
 /*
  * Returns true if the pmap's pv is one of the first
  * 16 pvs linked to from this page.  This count may
  * be changed upwards or downwards in the future; it
  * is only necessary that true be returned for a small
  * subset of pmaps for proper page aging.
  */
 boolean_t
 pmap_page_exists_quick(pmap_t pmap, vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	int loops = 0;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_page_exists_quick: page %p is not managed", m));
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		if (PV_PMAP(pv) == pmap) {
 			rv = TRUE;
 			break;
 		}
 		loops++;
 		if (loops >= 16)
 			break;
 	}
 	if (!rv && loops < 16) {
 		pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 		TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 			if (PV_PMAP(pv) == pmap) {
 				rv = TRUE;
 				break;
 			}
 			loops++;
 			if (loops >= 16)
 				break;
 		}
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  *	pmap_page_wired_mappings:
  *
  *	Return the number of managed mappings to the given physical page
  *	that are wired.
  */
 int
 pmap_page_wired_mappings(vm_page_t m)
 {
 	int count;
 
 	count = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	count = pmap_pvh_wired_mappings(&m->md, count);
 	count = pmap_pvh_wired_mappings(pa_to_pvh(VM_PAGE_TO_PHYS(m)), count);
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  *	pmap_pvh_wired_mappings:
  *
  *	Return the updated number "count" of managed mappings that are wired.
  */
 static int
 pmap_pvh_wired_mappings(struct md_page *pvh, int count)
 {
 	pmap_t pmap;
 	pt_entry_t *pte;
 	pv_entry_t pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	sched_pin();
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		if ((*pte & PG_W) != 0)
 			count++;
 		PMAP_UNLOCK(pmap);
 	}
 	sched_unpin();
 	return (count);
 }
 
 /*
  * Returns TRUE if the given page is mapped individually or as part of
  * a 4mpage.  Otherwise, returns FALSE.
  */
 boolean_t
 pmap_page_is_mapped(vm_page_t m)
 {
 	boolean_t rv;
 
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0)
 		return (FALSE);
 	vm_page_lock_queues();
 	rv = !TAILQ_EMPTY(&m->md.pv_list) ||
 	    !TAILQ_EMPTY(&pa_to_pvh(VM_PAGE_TO_PHYS(m))->pv_list);
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Remove all pages from specified address space
  * this aids process exit speeds.  Also, this code
  * is special cased for current process only, but
  * can have the more generic (and slightly slower)
  * mode enabled.  This is much faster than pmap_remove
  * in the case of running down an entire address space.
  */
 void
 pmap_remove_pages(pmap_t pmap)
 {
 	pt_entry_t *pte, tpte;
 	vm_page_t free = NULL;
 	vm_page_t m, mpte, mt;
 	pv_entry_t pv;
 	struct md_page *pvh;
 	struct pv_chunk *pc, *npc;
 	int field, idx;
 	int32_t bit;
 	uint32_t inuse, bitmask;
 	int allfree;
 
 	if (pmap != PCPU_GET(curpmap)) {
 		printf("warning: pmap_remove_pages called with non-current pmap\n");
 		return;
 	}
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	sched_pin();
 	TAILQ_FOREACH_SAFE(pc, &pmap->pm_pvchunk, pc_list, npc) {
 		allfree = 1;
 		for (field = 0; field < _NPCM; field++) {
 			inuse = (~(pc->pc_map[field])) & pc_freemask[field];
 			while (inuse != 0) {
 				bit = bsfl(inuse);
 				bitmask = 1UL << bit;
 				idx = field * 32 + bit;
 				pv = &pc->pc_pventry[idx];
 				inuse &= ~bitmask;
 
 				pte = pmap_pde(pmap, pv->pv_va);
 				tpte = *pte;
 				if ((tpte & PG_PS) == 0) {
 					pte = vtopte(pv->pv_va);
 					tpte = *pte & ~PG_PTE_PAT;
 				}
 
 				if (tpte == 0) {
 					printf(
 					    "TPTE at %p  IS ZERO @ VA %08x\n",
 					    pte, pv->pv_va);
 					panic("bad pte");
 				}
 
 /*
  * We cannot remove wired pages from a process' mapping at this time
  */
 				if (tpte & PG_W) {
 					allfree = 0;
 					continue;
 				}
 
 				m = PHYS_TO_VM_PAGE(tpte & PG_FRAME);
 				KASSERT(m->phys_addr == (tpte & PG_FRAME),
 				    ("vm_page_t %p phys_addr mismatch %016jx %016jx",
 				    m, (uintmax_t)m->phys_addr,
 				    (uintmax_t)tpte));
 
 				KASSERT(m < &vm_page_array[vm_page_array_size],
 					("pmap_remove_pages: bad tpte %#jx",
 					(uintmax_t)tpte));
 
 				pte_clear(pte);
 
 				/*
 				 * Update the vm_page_t clean/reference bits.
 				 */
 				if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 					if ((tpte & PG_PS) != 0) {
 						for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++)
 							vm_page_dirty(mt);
 					} else
 						vm_page_dirty(m);
 				}
 
 				/* Mark free */
 				PV_STAT(pv_entry_frees++);
 				PV_STAT(pv_entry_spare++);
 				pv_entry_count--;
 				pc->pc_map[field] |= bitmask;
 				if ((tpte & PG_PS) != 0) {
 					pmap->pm_stats.resident_count -= NBPDR / PAGE_SIZE;
 					pvh = pa_to_pvh(tpte & PG_PS_FRAME);
 					TAILQ_REMOVE(&pvh->pv_list, pv, pv_list);
 					if (TAILQ_EMPTY(&pvh->pv_list)) {
 						for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++)
 							if (TAILQ_EMPTY(&mt->md.pv_list))
 								vm_page_flag_clear(mt, PG_WRITEABLE);
 					}
 					mpte = pmap_lookup_pt_page(pmap, pv->pv_va);
 					if (mpte != NULL) {
 						pmap_remove_pt_page(pmap, mpte);
 						pmap->pm_stats.resident_count--;
 						KASSERT(mpte->wire_count == NPTEPG,
 						    ("pmap_remove_pages: pte page wire count error"));
 						mpte->wire_count = 0;
 						pmap_add_delayed_free_list(mpte, &free, FALSE);
 						atomic_subtract_int(&cnt.v_wire_count, 1);
 					}
 				} else {
 					pmap->pm_stats.resident_count--;
 					TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 					if (TAILQ_EMPTY(&m->md.pv_list)) {
 						pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 						if (TAILQ_EMPTY(&pvh->pv_list))
 							vm_page_flag_clear(m, PG_WRITEABLE);
 					}
 					pmap_unuse_pt(pmap, pv->pv_va, &free);
 				}
 			}
 		}
 		if (allfree) {
 			PV_STAT(pv_entry_spare -= _NPCPV);
 			PV_STAT(pc_chunk_count--);
 			PV_STAT(pc_chunk_frees++);
 			TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 			m = PHYS_TO_VM_PAGE(pmap_kextract((vm_offset_t)pc));
 			pmap_qremove((vm_offset_t)pc, 1);
 			vm_page_unwire(m, 0);
 			vm_page_free(m);
 			pmap_ptelist_free(&pv_vafree, (vm_offset_t)pc);
 		}
 	}
 	sched_unpin();
 	pmap_invalidate_all(pmap);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 	pmap_free_zero_pages(free);
 }
 
 /*
  *	pmap_is_modified:
  *
  *	Return whether or not the specified physical page was modified
  *	in any physical maps.
  */
 boolean_t
 pmap_is_modified(vm_page_t m)
 {
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_modified: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no PTEs can have PG_M set.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return (FALSE);
 	vm_page_lock_queues();
 	rv = pmap_is_modified_pvh(&m->md) ||
 	    pmap_is_modified_pvh(pa_to_pvh(VM_PAGE_TO_PHYS(m)));
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Returns TRUE if any of the given mappings were used to modify
  * physical memory.  Otherwise, returns FALSE.  Both page and 2mpage
  * mappings are supported.
  */
 static boolean_t
 pmap_is_modified_pvh(struct md_page *pvh)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 	pmap_t pmap;
 	boolean_t rv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	rv = FALSE;
 	sched_pin();
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		rv = (*pte & (PG_M | PG_RW)) == (PG_M | PG_RW);
 		PMAP_UNLOCK(pmap);
 		if (rv)
 			break;
 	}
 	sched_unpin();
 	return (rv);
 }
 
 /*
  *	pmap_is_prefaultable:
  *
  *	Return whether or not the specified virtual address is elgible
  *	for prefault.
  */
 boolean_t
 pmap_is_prefaultable(pmap_t pmap, vm_offset_t addr)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	boolean_t rv;
 
 	rv = FALSE;
 	PMAP_LOCK(pmap);
 	pde = pmap_pde(pmap, addr);
 	if (*pde != 0 && (*pde & PG_PS) == 0) {
 		pte = vtopte(addr);
 		rv = *pte == 0;
 	}
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 /*
  *	pmap_is_referenced:
  *
  *	Return whether or not the specified physical page was referenced
  *	in any physical maps.
  */
 boolean_t
 pmap_is_referenced(vm_page_t m)
 {
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_referenced: page %p is not managed", m));
 	vm_page_lock_queues();
 	rv = pmap_is_referenced_pvh(&m->md) ||
 	    pmap_is_referenced_pvh(pa_to_pvh(VM_PAGE_TO_PHYS(m)));
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Returns TRUE if any of the given mappings were referenced and FALSE
  * otherwise.  Both page and 4mpage mappings are supported.
  */
 static boolean_t
 pmap_is_referenced_pvh(struct md_page *pvh)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 	pmap_t pmap;
 	boolean_t rv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	rv = FALSE;
 	sched_pin();
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		rv = (*pte & (PG_A | PG_V)) == (PG_A | PG_V);
 		PMAP_UNLOCK(pmap);
 		if (rv)
 			break;
 	}
 	sched_unpin();
 	return (rv);
 }
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 void
 pmap_remove_write(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t next_pv, pv;
 	pmap_t pmap;
 	pd_entry_t *pde;
 	pt_entry_t oldpte, *pte;
 	vm_offset_t va;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	sched_pin();
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_list, next_pv) {
 		va = pv->pv_va;
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, va);
 		if ((*pde & PG_RW) != 0)
 			(void)pmap_demote_pde(pmap, pde, va);
 		PMAP_UNLOCK(pmap);
 	}
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_clear_write: found"
 		    " a 4mpage in page %p's pv list", m));
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 retry:
 		oldpte = *pte;
 		if ((oldpte & PG_RW) != 0) {
 			/*
 			 * Regardless of whether a pte is 32 or 64 bits
 			 * in size, PG_RW and PG_M are among the least
 			 * significant 32 bits.
 			 */
 			if (!atomic_cmpset_int((u_int *)pte, oldpte,
 			    oldpte & ~(PG_RW | PG_M)))
 				goto retry;
 			if ((oldpte & PG_M) != 0)
 				vm_page_dirty(m);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	sched_unpin();
 	vm_page_unlock_queues();
 }
 
 /*
  *	pmap_ts_referenced:
  *
  *	Return a count of reference bits for a page, clearing those bits.
  *	It is not necessary for every reference bit to be cleared, but it
  *	is necessary that 0 only be returned when there are truly no
  *	reference bits set.
  *
  *	XXX: The exact number of bits to check and clear is a matter that
  *	should be tested and standardized at some point in the future for
  *	optimal aging of shared pages.
  */
 int
 pmap_ts_referenced(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t pv, pvf, pvn;
 	pmap_t pmap;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t *pte;
 	vm_offset_t va;
 	int rtval = 0;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_ts_referenced: page %p is not managed", m));
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	vm_page_lock_queues();
 	sched_pin();
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_list, pvn) {
 		va = pv->pv_va;
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, va);
 		oldpde = *pde;
 		if ((oldpde & PG_A) != 0) {
 			if (pmap_demote_pde(pmap, pde, va)) {
 				if ((oldpde & PG_W) == 0) {
 					/*
 					 * Remove the mapping to a single page
 					 * so that a subsequent access may
 					 * repromote.  Since the underlying
 					 * page table page is fully populated,
 					 * this removal never frees a page
 					 * table page.
 					 */
 					va += VM_PAGE_TO_PHYS(m) - (oldpde &
 					    PG_PS_FRAME);
 					pmap_remove_page(pmap, va, NULL);
 					rtval++;
 					if (rtval > 4) {
 						PMAP_UNLOCK(pmap);
 						goto out;
 					}
 				}
 			}
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	if ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		pvf = pv;
 		do {
 			pvn = TAILQ_NEXT(pv, pv_list);
 			TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 			TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 			pmap = PV_PMAP(pv);
 			PMAP_LOCK(pmap);
 			pde = pmap_pde(pmap, pv->pv_va);
 			KASSERT((*pde & PG_PS) == 0, ("pmap_ts_referenced:"
 			    " found a 4mpage in page %p's pv list", m));
 			pte = pmap_pte_quick(pmap, pv->pv_va);
 			if ((*pte & PG_A) != 0) {
 				atomic_clear_int((u_int *)pte, PG_A);
 				pmap_invalidate_page(pmap, pv->pv_va);
 				rtval++;
 				if (rtval > 4)
 					pvn = NULL;
 			}
 			PMAP_UNLOCK(pmap);
 		} while ((pv = pvn) != NULL && pv != pvf);
 	}
 out:
 	sched_unpin();
 	vm_page_unlock_queues();
 	return (rtval);
 }
 
 /*
  *	Clear the modify bits on the specified physical page.
  */
 void
 pmap_clear_modify(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t next_pv, pv;
 	pmap_t pmap;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t oldpte, *pte;
 	vm_offset_t va;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("pmap_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no PTEs can have PG_M set.
 	 * If the object containing the page is locked and the page is not
 	 * VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	sched_pin();
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_list, next_pv) {
 		va = pv->pv_va;
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, va);
 		oldpde = *pde;
 		if ((oldpde & PG_RW) != 0) {
 			if (pmap_demote_pde(pmap, pde, va)) {
 				if ((oldpde & PG_W) == 0) {
 					/*
 					 * Write protect the mapping to a
 					 * single page so that a subsequent
 					 * write access may repromote.
 					 */
 					va += VM_PAGE_TO_PHYS(m) - (oldpde &
 					    PG_PS_FRAME);
 					pte = pmap_pte_quick(pmap, va);
 					oldpte = *pte;
 					if ((oldpte & PG_V) != 0) {
 						/*
 						 * Regardless of whether a pte is 32 or 64 bits
 						 * in size, PG_RW and PG_M are among the least
 						 * significant 32 bits.
 						 */
 						while (!atomic_cmpset_int((u_int *)pte,
 						    oldpte,
 						    oldpte & ~(PG_M | PG_RW)))
 							oldpte = *pte;
 						vm_page_dirty(m);
 						pmap_invalidate_page(pmap, va);
 					}
 				}
 			}
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_clear_modify: found"
 		    " a 4mpage in page %p's pv list", m));
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		if ((*pte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 			/*
 			 * Regardless of whether a pte is 32 or 64 bits
 			 * in size, PG_M is among the least significant
 			 * 32 bits. 
 			 */
 			atomic_clear_int((u_int *)pte, PG_M);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	sched_unpin();
 	vm_page_unlock_queues();
 }
 
 /*
  *	pmap_clear_reference:
  *
  *	Clear the reference bit on the specified physical page.
  */
 void
 pmap_clear_reference(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t next_pv, pv;
 	pmap_t pmap;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t *pte;
 	vm_offset_t va;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_reference: page %p is not managed", m));
 	vm_page_lock_queues();
 	sched_pin();
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_list, next_pv) {
 		va = pv->pv_va;
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, va);
 		oldpde = *pde;
 		if ((oldpde & PG_A) != 0) {
 			if (pmap_demote_pde(pmap, pde, va)) {
 				/*
 				 * Remove the mapping to a single page so
 				 * that a subsequent access may repromote.
 				 * Since the underlying page table page is
 				 * fully populated, this removal never frees
 				 * a page table page.
 				 */
 				va += VM_PAGE_TO_PHYS(m) - (oldpde &
 				    PG_PS_FRAME);
 				pmap_remove_page(pmap, va, NULL);
 			}
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_clear_reference: found"
 		    " a 4mpage in page %p's pv list", m));
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		if ((*pte & PG_A) != 0) {
 			/*
 			 * Regardless of whether a pte is 32 or 64 bits
 			 * in size, PG_A is among the least significant
 			 * 32 bits. 
 			 */
 			atomic_clear_int((u_int *)pte, PG_A);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	sched_unpin();
 	vm_page_unlock_queues();
 }
 
 /*
  * Miscellaneous support routines follow
  */
 
 /* Adjust the cache mode for a 4KB page mapped via a PTE. */
 static __inline void
 pmap_pte_attr(pt_entry_t *pte, int cache_bits)
 {
 	u_int opte, npte;
 
 	/*
 	 * The cache mode bits are all in the low 32-bits of the
 	 * PTE, so we can just spin on updating the low 32-bits.
 	 */
 	do {
 		opte = *(u_int *)pte;
 		npte = opte & ~PG_PTE_CACHE;
 		npte |= cache_bits;
 	} while (npte != opte && !atomic_cmpset_int((u_int *)pte, opte, npte));
 }
 
 /* Adjust the cache mode for a 2/4MB page mapped via a PDE. */
 static __inline void
 pmap_pde_attr(pd_entry_t *pde, int cache_bits)
 {
 	u_int opde, npde;
 
 	/*
 	 * The cache mode bits are all in the low 32-bits of the
 	 * PDE, so we can just spin on updating the low 32-bits.
 	 */
 	do {
 		opde = *(u_int *)pde;
 		npde = opde & ~PG_PDE_CACHE;
 		npde |= cache_bits;
 	} while (npde != opde && !atomic_cmpset_int((u_int *)pde, opde, npde));
 }
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  */
 void *
 pmap_mapdev_attr(vm_paddr_t pa, vm_size_t size, int mode)
 {
 	vm_offset_t va, offset;
 	vm_size_t tmpsize;
 
 	offset = pa & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 	pa = pa & PG_FRAME;
 
 	if (pa < KERNLOAD && pa + size <= KERNLOAD)
 		va = KERNBASE + pa;
 	else
 		va = kmem_alloc_nofault(kernel_map, size);
 	if (!va)
 		panic("pmap_mapdev: Couldn't alloc kernel virtual memory");
 
 	for (tmpsize = 0; tmpsize < size; tmpsize += PAGE_SIZE)
 		pmap_kenter_attr(va + tmpsize, pa + tmpsize, mode);
 	pmap_invalidate_range(kernel_pmap, va, va + tmpsize);
 	pmap_invalidate_cache_range(va, va + size);
 	return ((void *)(va + offset));
 }
 
 void *
 pmap_mapdev(vm_paddr_t pa, vm_size_t size)
 {
 
 	return (pmap_mapdev_attr(pa, size, PAT_UNCACHEABLE));
 }
 
 void *
 pmap_mapbios(vm_paddr_t pa, vm_size_t size)
 {
 
 	return (pmap_mapdev_attr(pa, size, PAT_WRITE_BACK));
 }
 
 void
 pmap_unmapdev(vm_offset_t va, vm_size_t size)
 {
 	vm_offset_t base, offset, tmpva;
 
 	if (va >= KERNBASE && va + size <= KERNBASE + KERNLOAD)
 		return;
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 	for (tmpva = base; tmpva < (base + size); tmpva += PAGE_SIZE)
 		pmap_kremove(tmpva);
 	pmap_invalidate_range(kernel_pmap, va, tmpva);
 	kmem_free(kernel_map, base, size);
 }
 
 /*
  * Sets the memory attribute for the specified page.
  */
 void
 pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma)
 {
 
 	m->md.pat_mode = ma;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return;
 
 	/*
 	 * If "m" is a normal page, flush it from the cache.
 	 * See pmap_invalidate_cache_range().
 	 *
 	 * First, try to find an existing mapping of the page by sf
 	 * buffer. sf_buf_invalidate_cache() modifies mapping and
 	 * flushes the cache.
 	 */    
 	if (sf_buf_invalidate_cache(m))
 		return;
 
 	/*
 	 * If page is not mapped by sf buffer, but CPU does not
 	 * support self snoop, map the page transient and do
 	 * invalidation. In the worst case, whole cache is flushed by
 	 * pmap_invalidate_cache_range().
 	 */
 	if ((cpu_feature & CPUID_SS) == 0)
 		pmap_flush_page(m);
 }
 
 static void
 pmap_flush_page(vm_page_t m)
 {
 	struct sysmaps *sysmaps;
 	vm_offset_t sva, eva;
 
 	if ((cpu_feature & CPUID_CLFSH) != 0) {
 		sysmaps = &sysmaps_pcpu[PCPU_GET(cpuid)];
 		mtx_lock(&sysmaps->lock);
 		if (*sysmaps->CMAP2)
 			panic("pmap_flush_page: CMAP2 busy");
 		sched_pin();
 		*sysmaps->CMAP2 = PG_V | PG_RW | VM_PAGE_TO_PHYS(m) |
 		    PG_A | PG_M | pmap_cache_bits(m->md.pat_mode, 0);
 		invlcaddr(sysmaps->CADDR2);
 		sva = (vm_offset_t)sysmaps->CADDR2;
 		eva = sva + PAGE_SIZE;
 
 		/*
 		 * Use mfence despite the ordering implied by
 		 * mtx_{un,}lock() because clflush is not guaranteed
 		 * to be ordered by any other instruction.
 		 */
 		mfence();
 		for (; sva < eva; sva += cpu_clflush_line_size)
 			clflush(sva);
 		mfence();
 		*sysmaps->CMAP2 = 0;
 		sched_unpin();
 		mtx_unlock(&sysmaps->lock);
 	} else
 		pmap_invalidate_cache();
 }
 
 /*
  * Changes the specified virtual address range's memory type to that given by
  * the parameter "mode".  The specified virtual address range must be
  * completely contained within either the kernel map.
  *
  * Returns zero if the change completed successfully, and either EINVAL or
  * ENOMEM if the change failed.  Specifically, EINVAL is returned if some part
  * of the virtual address range was not mapped, and ENOMEM is returned if
  * there was insufficient memory available to complete the change.
  */
 int
 pmap_change_attr(vm_offset_t va, vm_size_t size, int mode)
 {
 	vm_offset_t base, offset, tmpva;
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	int cache_bits_pte, cache_bits_pde;
 	boolean_t changed;
 
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 
 	/*
 	 * Only supported on kernel virtual addresses above the recursive map.
 	 */
 	if (base < VM_MIN_KERNEL_ADDRESS)
 		return (EINVAL);
 
 	cache_bits_pde = pmap_cache_bits(mode, 1);
 	cache_bits_pte = pmap_cache_bits(mode, 0);
 	changed = FALSE;
 
 	/*
 	 * Pages that aren't mapped aren't supported.  Also break down
 	 * 2/4MB pages into 4KB pages if required.
 	 */
 	PMAP_LOCK(kernel_pmap);
 	for (tmpva = base; tmpva < base + size; ) {
 		pde = pmap_pde(kernel_pmap, tmpva);
 		if (*pde == 0) {
 			PMAP_UNLOCK(kernel_pmap);
 			return (EINVAL);
 		}
 		if (*pde & PG_PS) {
 			/*
 			 * If the current 2/4MB page already has
 			 * the required memory type, then we need not
 			 * demote this page.  Just increment tmpva to
 			 * the next 2/4MB page frame.
 			 */
 			if ((*pde & PG_PDE_CACHE) == cache_bits_pde) {
 				tmpva = trunc_4mpage(tmpva) + NBPDR;
 				continue;
 			}
 
 			/*
 			 * If the current offset aligns with a 2/4MB
 			 * page frame and there is at least 2/4MB left
 			 * within the range, then we need not break
 			 * down this page into 4KB pages.
 			 */
 			if ((tmpva & PDRMASK) == 0 &&
 			    tmpva + PDRMASK < base + size) {
 				tmpva += NBPDR;
 				continue;
 			}
 			if (!pmap_demote_pde(kernel_pmap, pde, tmpva)) {
 				PMAP_UNLOCK(kernel_pmap);
 				return (ENOMEM);
 			}
 		}
 		pte = vtopte(tmpva);
 		if (*pte == 0) {
 			PMAP_UNLOCK(kernel_pmap);
 			return (EINVAL);
 		}
 		tmpva += PAGE_SIZE;
 	}
 	PMAP_UNLOCK(kernel_pmap);
 
 	/*
 	 * Ok, all the pages exist, so run through them updating their
 	 * cache mode if required.
 	 */
 	for (tmpva = base; tmpva < base + size; ) {
 		pde = pmap_pde(kernel_pmap, tmpva);
 		if (*pde & PG_PS) {
 			if ((*pde & PG_PDE_CACHE) != cache_bits_pde) {
 				pmap_pde_attr(pde, cache_bits_pde);
 				changed = TRUE;
 			}
 			tmpva = trunc_4mpage(tmpva) + NBPDR;
 		} else {
 			pte = vtopte(tmpva);
 			if ((*pte & PG_PTE_CACHE) != cache_bits_pte) {
 				pmap_pte_attr(pte, cache_bits_pte);
 				changed = TRUE;
 			}
 			tmpva += PAGE_SIZE;
 		}
 	}
 
 	/*
 	 * Flush CPU caches to make sure any data isn't cached that
 	 * shouldn't be, etc.
 	 */
 	if (changed) {
 		pmap_invalidate_range(kernel_pmap, base, tmpva);
 		pmap_invalidate_cache_range(base, tmpva);
 	}
 	return (0);
 }
 
 /*
  * perform the pmap work for mincore
  */
 int
 pmap_mincore(pmap_t pmap, vm_offset_t addr, vm_paddr_t *locked_pa)
 {
 	pd_entry_t *pdep;
 	pt_entry_t *ptep, pte;
 	vm_paddr_t pa;
 	int val;
 
 	PMAP_LOCK(pmap);
 retry:
 	pdep = pmap_pde(pmap, addr);
 	if (*pdep != 0) {
 		if (*pdep & PG_PS) {
 			pte = *pdep;
 			/* Compute the physical address of the 4KB page. */
 			pa = ((*pdep & PG_PS_FRAME) | (addr & PDRMASK)) &
 			    PG_FRAME;
 			val = MINCORE_SUPER;
 		} else {
 			ptep = pmap_pte(pmap, addr);
 			pte = *ptep;
 			pmap_pte_release(ptep);
 			pa = pte & PG_FRAME;
 			val = 0;
 		}
 	} else {
 		pte = 0;
 		pa = 0;
 		val = 0;
 	}
 	if ((pte & PG_V) != 0) {
 		val |= MINCORE_INCORE;
 		if ((pte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			val |= MINCORE_MODIFIED | MINCORE_MODIFIED_OTHER;
 		if ((pte & PG_A) != 0)
 			val |= MINCORE_REFERENCED | MINCORE_REFERENCED_OTHER;
 	}
 	if ((val & (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER)) !=
 	    (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER) &&
 	    (pte & (PG_MANAGED | PG_V)) == (PG_MANAGED | PG_V)) {
 		/* Ensure that "PHYS_TO_VM_PAGE(pa)->object" doesn't change. */
 		if (vm_page_pa_tryrelock(pmap, pa, locked_pa))
 			goto retry;
 	} else
 		PA_UNLOCK_COND(*locked_pa);
 	PMAP_UNLOCK(pmap);
 	return (val);
 }
 
 void
 pmap_activate(struct thread *td)
 {
 	pmap_t	pmap, oldpmap;
 	u_int32_t  cr3;
 
 	critical_enter();
 	pmap = vmspace_pmap(td->td_proc->p_vmspace);
 	oldpmap = PCPU_GET(curpmap);
 #if defined(SMP)
-	atomic_clear_int(&oldpmap->pm_active, PCPU_GET(cpumask));
-	atomic_set_int(&pmap->pm_active, PCPU_GET(cpumask));
+	CPU_NAND_ATOMIC(&oldpmap->pm_active, PCPU_PTR(cpumask));
+	CPU_OR_ATOMIC(&pmap->pm_active, PCPU_PTR(cpumask));
 #else
-	oldpmap->pm_active &= ~1;
-	pmap->pm_active |= 1;
+	CPU_NAND(&oldpmap->pm_active, PCPU_PTR(cpumask));
+	CPU_OR(&pmap->pm_active, PCPU_PTR(cpumask));
 #endif
 #ifdef PAE
 	cr3 = vtophys(pmap->pm_pdpt);
 #else
 	cr3 = vtophys(pmap->pm_pdir);
 #endif
 	/*
 	 * pmap_activate is for the current thread on the current cpu
 	 */
 	td->td_pcb->pcb_cr3 = cr3;
 	load_cr3(cr3);
 	PCPU_SET(curpmap, pmap);
 	critical_exit();
 }
 
 void
 pmap_sync_icache(pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 }
 
 /*
  *	Increase the starting virtual address of the given mapping if a
  *	different alignment might result in more superpage mappings.
  */
 void
 pmap_align_superpage(vm_object_t object, vm_ooffset_t offset,
     vm_offset_t *addr, vm_size_t size)
 {
 	vm_offset_t superpage_offset;
 
 	if (size < NBPDR)
 		return;
 	if (object != NULL && (object->flags & OBJ_COLORED) != 0)
 		offset += ptoa(object->pg_color);
 	superpage_offset = offset & PDRMASK;
 	if (size - ((NBPDR - superpage_offset) & PDRMASK) < NBPDR ||
 	    (*addr & PDRMASK) == superpage_offset)
 		return;
 	if ((*addr & PDRMASK) < superpage_offset)
 		*addr = (*addr & ~PDRMASK) + superpage_offset;
 	else
 		*addr = ((*addr + PDRMASK) & ~PDRMASK) + superpage_offset;
 }
 
 
 #if defined(PMAP_DEBUG)
 pmap_pid_dump(int pid)
 {
 	pmap_t pmap;
 	struct proc *p;
 	int npte = 0;
 	int index;
 
 	sx_slock(&allproc_lock);
 	FOREACH_PROC_IN_SYSTEM(p) {
 		if (p->p_pid != pid)
 			continue;
 
 		if (p->p_vmspace) {
 			int i,j;
 			index = 0;
 			pmap = vmspace_pmap(p->p_vmspace);
 			for (i = 0; i < NPDEPTD; i++) {
 				pd_entry_t *pde;
 				pt_entry_t *pte;
 				vm_offset_t base = i << PDRSHIFT;
 				
 				pde = &pmap->pm_pdir[i];
 				if (pde && pmap_pde_v(pde)) {
 					for (j = 0; j < NPTEPG; j++) {
 						vm_offset_t va = base + (j << PAGE_SHIFT);
 						if (va >= (vm_offset_t) VM_MIN_KERNEL_ADDRESS) {
 							if (index) {
 								index = 0;
 								printf("\n");
 							}
 							sx_sunlock(&allproc_lock);
 							return (npte);
 						}
 						pte = pmap_pte(pmap, va);
 						if (pte && pmap_pte_v(pte)) {
 							pt_entry_t pa;
 							vm_page_t m;
 							pa = *pte;
 							m = PHYS_TO_VM_PAGE(pa & PG_FRAME);
 							printf("va: 0x%x, pt: 0x%x, h: %d, w: %d, f: 0x%x",
 								va, pa, m->hold_count, m->wire_count, m->flags);
 							npte++;
 							index++;
 							if (index >= 2) {
 								index = 0;
 								printf("\n");
 							} else {
 								printf(" ");
 							}
 						}
 					}
 				}
 			}
 		}
 	}
 	sx_sunlock(&allproc_lock);
 	return (npte);
 }
 #endif
 
 #if defined(DEBUG)
 
 static void	pads(pmap_t pm);
 void		pmap_pvdump(vm_offset_t pa);
 
 /* print address space of pmap*/
 static void
 pads(pmap_t pm)
 {
 	int i, j;
 	vm_paddr_t va;
 	pt_entry_t *ptep;
 
 	if (pm == kernel_pmap)
 		return;
 	for (i = 0; i < NPDEPTD; i++)
 		if (pm->pm_pdir[i])
 			for (j = 0; j < NPTEPG; j++) {
 				va = (i << PDRSHIFT) + (j << PAGE_SHIFT);
 				if (pm == kernel_pmap && va < KERNBASE)
 					continue;
 				if (pm != kernel_pmap && va > UPT_MAX_ADDRESS)
 					continue;
 				ptep = pmap_pte(pm, va);
 				if (pmap_pte_v(ptep))
 					printf("%x:%x ", va, *ptep);
 			};
 
 }
 
 void
 pmap_pvdump(vm_paddr_t pa)
 {
 	pv_entry_t pv;
 	pmap_t pmap;
 	vm_page_t m;
 
 	printf("pa %x", pa);
 	m = PHYS_TO_VM_PAGE(pa);
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		printf(" -> pmap %p, va %x", (void *)pmap, pv->pv_va);
 		pads(pmap);
 	}
 	printf(" ");
 }
 #endif
Index: head/sys/i386/i386/vm_machdep.c
===================================================================
--- head/sys/i386/i386/vm_machdep.c	(revision 222812)
+++ head/sys/i386/i386/vm_machdep.c	(revision 222813)
@@ -1,957 +1,965 @@
 /*-
  * Copyright (c) 1982, 1986 The Regents of the University of California.
  * Copyright (c) 1989, 1990 William Jolitz
  * Copyright (c) 1994 John Dyson
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department, and William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)vm_machdep.c	7.3 (Berkeley) 5/13/91
  *	Utah $Hdr: vm_machdep.c 1.16.1.1 89/06/23$
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_isa.h"
 #include "opt_npx.h"
 #include "opt_reset.h"
 #include "opt_cpu.h"
 #include "opt_xbox.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bio.h>
 #include <sys/buf.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mbuf.h>
 #include <sys/mutex.h>
 #include <sys/pioctl.h>
 #include <sys/proc.h>
 #include <sys/sysent.h>
 #include <sys/sf_buf.h>
 #include <sys/smp.h>
 #include <sys/sched.h>
 #include <sys/sysctl.h>
 #include <sys/unistd.h>
 #include <sys/vnode.h>
 #include <sys/vmmeter.h>
 
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
 #include <machine/pcb_ext.h>
 #include <machine/smp.h>
 #include <machine/vm86.h>
 
 #ifdef CPU_ELAN
 #include <machine/elan_mmcr.h>
 #endif
 
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_param.h>
 
 #ifdef XEN
 #include <xen/hypervisor.h>
 #endif
 #ifdef PC98
 #include <pc98/cbus/cbus.h>
 #else
 #include <x86/isa/isa.h>
 #endif
 
 #ifdef XBOX
 #include <machine/xbox.h>
 #endif
 
 #ifndef NSFBUFS
 #define	NSFBUFS		(512 + maxusers * 16)
 #endif
 
 static void	cpu_reset_real(void);
 #ifdef SMP
 static void	cpu_reset_proxy(void);
 static u_int	cpu_reset_proxyid;
 static volatile u_int	cpu_reset_proxy_active;
 #endif
 static void	sf_buf_init(void *arg);
 SYSINIT(sock_sf, SI_SUB_MBUF, SI_ORDER_ANY, sf_buf_init, NULL);
 
 LIST_HEAD(sf_head, sf_buf);
 
 /*
  * A hash table of active sendfile(2) buffers
  */
 static struct sf_head *sf_buf_active;
 static u_long sf_buf_hashmask;
 
 #define	SF_BUF_HASH(m)	(((m) - vm_page_array) & sf_buf_hashmask)
 
 static TAILQ_HEAD(, sf_buf) sf_buf_freelist;
 static u_int	sf_buf_alloc_want;
 
 /*
  * A lock used to synchronize access to the hash table and free list
  */
 static struct mtx sf_buf_lock;
 
 extern int	_ucodesel, _udatasel;
 
 /*
  * Finish a fork operation, with process p2 nearly set up.
  * Copy and update the pcb, set up the stack so that the child
  * ready to run and return to user mode.
  */
 void
 cpu_fork(td1, p2, td2, flags)
 	register struct thread *td1;
 	register struct proc *p2;
 	struct thread *td2;
 	int flags;
 {
 	register struct proc *p1;
 	struct pcb *pcb2;
 	struct mdproc *mdp2;
 
 	p1 = td1->td_proc;
 	if ((flags & RFPROC) == 0) {
 		if ((flags & RFMEM) == 0) {
 			/* unshare user LDT */
 			struct mdproc *mdp1 = &p1->p_md;
 			struct proc_ldt *pldt, *pldt1;
 
 			mtx_lock_spin(&dt_lock);
 			if ((pldt1 = mdp1->md_ldt) != NULL &&
 			    pldt1->ldt_refcnt > 1) {
 				pldt = user_ldt_alloc(mdp1, pldt1->ldt_len);
 				if (pldt == NULL)
 					panic("could not copy LDT");
 				mdp1->md_ldt = pldt;
 				set_user_ldt(mdp1);
 				user_ldt_deref(pldt1);
 			} else
 				mtx_unlock_spin(&dt_lock);
 		}
 		return;
 	}
 
 	/* Ensure that td1's pcb is up to date. */
 	if (td1 == curthread)
 		td1->td_pcb->pcb_gs = rgs();
 #ifdef DEV_NPX
 	critical_enter();
 	if (PCPU_GET(fpcurthread) == td1)
 		npxsave(td1->td_pcb->pcb_save);
 	critical_exit();
 #endif
 
 	/* Point the pcb to the top of the stack */
 	pcb2 = (struct pcb *)(td2->td_kstack +
 	    td2->td_kstack_pages * PAGE_SIZE) - 1;
 	td2->td_pcb = pcb2;
 
 	/* Copy td1's pcb */
 	bcopy(td1->td_pcb, pcb2, sizeof(*pcb2));
 
 	/* Properly initialize pcb_save */
 	pcb2->pcb_save = &pcb2->pcb_user_save;
 
 	/* Point mdproc and then copy over td1's contents */
 	mdp2 = &p2->p_md;
 	bcopy(&p1->p_md, mdp2, sizeof(*mdp2));
 
 	/*
 	 * Create a new fresh stack for the new process.
 	 * Copy the trap frame for the return to user mode as if from a
 	 * syscall.  This copies most of the user mode register values.
 	 * The -16 is so we can expand the trapframe if we go to vm86.
 	 */
 	td2->td_frame = (struct trapframe *)((caddr_t)td2->td_pcb - 16) - 1;
 	bcopy(td1->td_frame, td2->td_frame, sizeof(struct trapframe));
 
 	td2->td_frame->tf_eax = 0;		/* Child returns zero */
 	td2->td_frame->tf_eflags &= ~PSL_C;	/* success */
 	td2->td_frame->tf_edx = 1;
 
 	/*
 	 * If the parent process has the trap bit set (i.e. a debugger had
 	 * single stepped the process to the system call), we need to clear
 	 * the trap flag from the new frame unless the debugger had set PF_FORK
 	 * on the parent.  Otherwise, the child will receive a (likely
 	 * unexpected) SIGTRAP when it executes the first instruction after
 	 * returning  to userland.
 	 */
 	if ((p1->p_pfsflags & PF_FORK) == 0)
 		td2->td_frame->tf_eflags &= ~PSL_T;
 
 	/*
 	 * Set registers for trampoline to user mode.  Leave space for the
 	 * return address on stack.  These are the kernel mode register values.
 	 */
 #ifdef PAE
 	pcb2->pcb_cr3 = vtophys(vmspace_pmap(p2->p_vmspace)->pm_pdpt);
 #else
 	pcb2->pcb_cr3 = vtophys(vmspace_pmap(p2->p_vmspace)->pm_pdir);
 #endif
 	pcb2->pcb_edi = 0;
 	pcb2->pcb_esi = (int)fork_return;	/* fork_trampoline argument */
 	pcb2->pcb_ebp = 0;
 	pcb2->pcb_esp = (int)td2->td_frame - sizeof(void *);
 	pcb2->pcb_ebx = (int)td2;		/* fork_trampoline argument */
 	pcb2->pcb_eip = (int)fork_trampoline;
 	pcb2->pcb_psl = PSL_KERNEL;		/* ints disabled */
 	/*-
 	 * pcb2->pcb_dr*:	cloned above.
 	 * pcb2->pcb_savefpu:	cloned above.
 	 * pcb2->pcb_flags:	cloned above.
 	 * pcb2->pcb_onfault:	cloned above (always NULL here?).
 	 * pcb2->pcb_gs:	cloned above.
 	 * pcb2->pcb_ext:	cleared below.
 	 */
 
 	/*
 	 * XXX don't copy the i/o pages.  this should probably be fixed.
 	 */
 	pcb2->pcb_ext = 0;
 
 	/* Copy the LDT, if necessary. */
 	mtx_lock_spin(&dt_lock);
 	if (mdp2->md_ldt != NULL) {
 		if (flags & RFMEM) {
 			mdp2->md_ldt->ldt_refcnt++;
 		} else {
 			mdp2->md_ldt = user_ldt_alloc(mdp2,
 			    mdp2->md_ldt->ldt_len);
 			if (mdp2->md_ldt == NULL)
 				panic("could not copy LDT");
 		}
 	}
 	mtx_unlock_spin(&dt_lock);
 
 	/* Setup to release spin count in fork_exit(). */
 	td2->td_md.md_spinlock_count = 1;
 	/*
 	 * XXX XEN need to check on PSL_USER is handled
 	 */
 	td2->td_md.md_saved_flags = PSL_KERNEL | PSL_I;
 	/*
 	 * Now, cpu_switch() can schedule the new process.
 	 * pcb_esp is loaded pointing to the cpu_switch() stack frame
 	 * containing the return address when exiting cpu_switch.
 	 * This will normally be to fork_trampoline(), which will have
 	 * %ebx loaded with the new proc's pointer.  fork_trampoline()
 	 * will set up a stack to call fork_return(p, frame); to complete
 	 * the return to user-mode.
 	 */
 }
 
 /*
  * Intercept the return address from a freshly forked process that has NOT
  * been scheduled yet.
  *
  * This is needed to make kernel threads stay in kernel mode.
  */
 void
 cpu_set_fork_handler(td, func, arg)
 	struct thread *td;
 	void (*func)(void *);
 	void *arg;
 {
 	/*
 	 * Note that the trap frame follows the args, so the function
 	 * is really called like this:  func(arg, frame);
 	 */
 	td->td_pcb->pcb_esi = (int) func;	/* function */
 	td->td_pcb->pcb_ebx = (int) arg;	/* first arg */
 }
 
 void
 cpu_exit(struct thread *td)
 {
 
 	/*
 	 * If this process has a custom LDT, release it.  Reset pc->pcb_gs
 	 * and %gs before we free it in case they refer to an LDT entry.
 	 */
 	mtx_lock_spin(&dt_lock);
 	if (td->td_proc->p_md.md_ldt) {
 		td->td_pcb->pcb_gs = _udatasel;
 		load_gs(_udatasel);
 		user_ldt_free(td);
 	} else
 		mtx_unlock_spin(&dt_lock);
 }
 
 void
 cpu_thread_exit(struct thread *td)
 {
 
 #ifdef DEV_NPX
 	critical_enter();
 	if (td == PCPU_GET(fpcurthread))
 		npxdrop();
 	critical_exit();
 #endif
 
 	/* Disable any hardware breakpoints. */
 	if (td->td_pcb->pcb_flags & PCB_DBREGS) {
 		reset_dbregs();
 		td->td_pcb->pcb_flags &= ~PCB_DBREGS;
 	}
 }
 
 void
 cpu_thread_clean(struct thread *td)
 {
 	struct pcb *pcb;
 
 	pcb = td->td_pcb; 
 	if (pcb->pcb_ext != NULL) {
 		/* if (pcb->pcb_ext->ext_refcount-- == 1) ?? */
 		/*
 		 * XXX do we need to move the TSS off the allocated pages
 		 * before freeing them?  (not done here)
 		 */
 		kmem_free(kernel_map, (vm_offset_t)pcb->pcb_ext,
 		    ctob(IOPAGES + 1));
 		pcb->pcb_ext = NULL;
 	}
 }
 
 void
 cpu_thread_swapin(struct thread *td)
 {
 }
 
 void
 cpu_thread_swapout(struct thread *td)
 {
 }
 
 void
 cpu_thread_alloc(struct thread *td)
 {
 
 	td->td_pcb = (struct pcb *)(td->td_kstack +
 	    td->td_kstack_pages * PAGE_SIZE) - 1;
 	td->td_frame = (struct trapframe *)((caddr_t)td->td_pcb - 16) - 1;
 	td->td_pcb->pcb_ext = NULL; 
 	td->td_pcb->pcb_save = &td->td_pcb->pcb_user_save;
 }
 
 void
 cpu_thread_free(struct thread *td)
 {
 
 	cpu_thread_clean(td);
 }
 
 void
 cpu_set_syscall_retval(struct thread *td, int error)
 {
 
 	switch (error) {
 	case 0:
 		td->td_frame->tf_eax = td->td_retval[0];
 		td->td_frame->tf_edx = td->td_retval[1];
 		td->td_frame->tf_eflags &= ~PSL_C;
 		break;
 
 	case ERESTART:
 		/*
 		 * Reconstruct pc, assuming lcall $X,y is 7 bytes, int
 		 * 0x80 is 2 bytes. We saved this in tf_err.
 		 */
 		td->td_frame->tf_eip -= td->td_frame->tf_err;
 		break;
 
 	case EJUSTRETURN:
 		break;
 
 	default:
 		if (td->td_proc->p_sysent->sv_errsize) {
 			if (error >= td->td_proc->p_sysent->sv_errsize)
 				error = -1;	/* XXX */
 			else
 				error = td->td_proc->p_sysent->sv_errtbl[error];
 		}
 		td->td_frame->tf_eax = error;
 		td->td_frame->tf_eflags |= PSL_C;
 		break;
 	}
 }
 
 /*
  * Initialize machine state (pcb and trap frame) for a new thread about to
  * upcall. Put enough state in the new thread's PCB to get it to go back 
  * userret(), where we can intercept it again to set the return (upcall)
  * Address and stack, along with those from upcals that are from other sources
  * such as those generated in thread_userret() itself.
  */
 void
 cpu_set_upcall(struct thread *td, struct thread *td0)
 {
 	struct pcb *pcb2;
 
 	/* Point the pcb to the top of the stack. */
 	pcb2 = td->td_pcb;
 
 	/*
 	 * Copy the upcall pcb.  This loads kernel regs.
 	 * Those not loaded individually below get their default
 	 * values here.
 	 */
 	bcopy(td0->td_pcb, pcb2, sizeof(*pcb2));
 	pcb2->pcb_flags &= ~(PCB_NPXINITDONE | PCB_NPXUSERINITDONE);
 	pcb2->pcb_save = &pcb2->pcb_user_save;
 
 	/*
 	 * Create a new fresh stack for the new thread.
 	 */
 	bcopy(td0->td_frame, td->td_frame, sizeof(struct trapframe));
 
 	/* If the current thread has the trap bit set (i.e. a debugger had
 	 * single stepped the process to the system call), we need to clear
 	 * the trap flag from the new frame. Otherwise, the new thread will
 	 * receive a (likely unexpected) SIGTRAP when it executes the first
 	 * instruction after returning to userland.
 	 */
 	td->td_frame->tf_eflags &= ~PSL_T;
 
 	/*
 	 * Set registers for trampoline to user mode.  Leave space for the
 	 * return address on stack.  These are the kernel mode register values.
 	 */
 	pcb2->pcb_edi = 0;
 	pcb2->pcb_esi = (int)fork_return;		    /* trampoline arg */
 	pcb2->pcb_ebp = 0;
 	pcb2->pcb_esp = (int)td->td_frame - sizeof(void *); /* trampoline arg */
 	pcb2->pcb_ebx = (int)td;			    /* trampoline arg */
 	pcb2->pcb_eip = (int)fork_trampoline;
 	pcb2->pcb_psl &= ~(PSL_I);	/* interrupts must be disabled */
 	pcb2->pcb_gs = rgs();
 	/*
 	 * If we didn't copy the pcb, we'd need to do the following registers:
 	 * pcb2->pcb_cr3:	cloned above.
 	 * pcb2->pcb_dr*:	cloned above.
 	 * pcb2->pcb_savefpu:	cloned above.
 	 * pcb2->pcb_flags:	cloned above.
 	 * pcb2->pcb_onfault:	cloned above (always NULL here?).
 	 * pcb2->pcb_gs:	cloned above.
 	 * pcb2->pcb_ext:	cleared below.
 	 */
 	pcb2->pcb_ext = NULL;
 
 	/* Setup to release spin count in fork_exit(). */
 	td->td_md.md_spinlock_count = 1;
 	td->td_md.md_saved_flags = PSL_KERNEL | PSL_I;
 }
 
 /*
  * Set that machine state for performing an upcall that has to
  * be done in thread_userret() so that those upcalls generated
  * in thread_userret() itself can be done as well.
  */
 void
 cpu_set_upcall_kse(struct thread *td, void (*entry)(void *), void *arg,
 	stack_t *stack)
 {
 
 	/* 
 	 * Do any extra cleaning that needs to be done.
 	 * The thread may have optional components
 	 * that are not present in a fresh thread.
 	 * This may be a recycled thread so make it look
 	 * as though it's newly allocated.
 	 */
 	cpu_thread_clean(td);
 
 	/*
 	 * Set the trap frame to point at the beginning of the uts
 	 * function.
 	 */
 	td->td_frame->tf_ebp = 0; 
 	td->td_frame->tf_esp =
 	    (((int)stack->ss_sp + stack->ss_size - 4) & ~0x0f) - 4;
 	td->td_frame->tf_eip = (int)entry;
 
 	/*
 	 * Pass the address of the mailbox for this kse to the uts
 	 * function as a parameter on the stack.
 	 */
 	suword((void *)(td->td_frame->tf_esp + sizeof(void *)),
 	    (int)arg);
 }
 
 int
 cpu_set_user_tls(struct thread *td, void *tls_base)
 {
 	struct segment_descriptor sd;
 	uint32_t base;
 
 	/*
 	 * Construct a descriptor and store it in the pcb for
 	 * the next context switch.  Also store it in the gdt
 	 * so that the load of tf_fs into %fs will activate it
 	 * at return to userland.
 	 */
 	base = (uint32_t)tls_base;
 	sd.sd_lobase = base & 0xffffff;
 	sd.sd_hibase = (base >> 24) & 0xff;
 	sd.sd_lolimit = 0xffff;	/* 4GB limit, wraps around */
 	sd.sd_hilimit = 0xf;
 	sd.sd_type  = SDT_MEMRWA;
 	sd.sd_dpl   = SEL_UPL;
 	sd.sd_p     = 1;
 	sd.sd_xx    = 0;
 	sd.sd_def32 = 1;
 	sd.sd_gran  = 1;
 	critical_enter();
 	/* set %gs */
 	td->td_pcb->pcb_gsd = sd;
 	if (td == curthread) {
 		PCPU_GET(fsgs_gdt)[1] = sd;
 		load_gs(GSEL(GUGS_SEL, SEL_UPL));
 	}
 	critical_exit();
 	return (0);
 }
 
 /*
  * Convert kernel VA to physical address
  */
 vm_paddr_t
 kvtop(void *addr)
 {
 	vm_paddr_t pa;
 
 	pa = pmap_kextract((vm_offset_t)addr);
 	if (pa == 0)
 		panic("kvtop: zero page frame");
 	return (pa);
 }
 
 #ifdef SMP
 static void
 cpu_reset_proxy()
 {
+	cpuset_t tcrp;
 
 	cpu_reset_proxy_active = 1;
 	while (cpu_reset_proxy_active == 1)
 		;	/* Wait for other cpu to see that we've started */
-	stop_cpus((1<<cpu_reset_proxyid));
+	CPU_SETOF(cpu_reset_proxyid, &tcrp);
+	stop_cpus(tcrp);
 	printf("cpu_reset_proxy: Stopped CPU %d\n", cpu_reset_proxyid);
 	DELAY(1000000);
 	cpu_reset_real();
 }
 #endif
 
 void
 cpu_reset()
 {
 #ifdef XBOX
 	if (arch_i386_is_xbox) {
 		/* Kick the PIC16L, it can reboot the box */
 		pic16l_reboot();
 		for (;;);
 	}
 #endif
 
 #ifdef SMP
-	cpumask_t map;
+	cpuset_t map;
 	u_int cnt;
 
 	if (smp_active) {
-		map = PCPU_GET(other_cpus) & ~stopped_cpus;
-		if (map != 0) {
+		sched_pin();
+		map = PCPU_GET(other_cpus);
+		CPU_NAND(&map, &stopped_cpus);
+		if (!CPU_EMPTY(&map)) {
 			printf("cpu_reset: Stopping other CPUs\n");
 			stop_cpus(map);
 		}
 
 		if (PCPU_GET(cpuid) != 0) {
 			cpu_reset_proxyid = PCPU_GET(cpuid);
+			sched_unpin();
 			cpustop_restartfunc = cpu_reset_proxy;
 			cpu_reset_proxy_active = 0;
 			printf("cpu_reset: Restarting BSP\n");
 
 			/* Restart CPU #0. */
 			/* XXX: restart_cpus(1 << 0); */
-			atomic_store_rel_int(&started_cpus, (1 << 0));
+			CPU_SETOF(0, &started_cpus);
+			wmb();
 
 			cnt = 0;
 			while (cpu_reset_proxy_active == 0 && cnt < 10000000)
 				cnt++;	/* Wait for BSP to announce restart */
 			if (cpu_reset_proxy_active == 0)
 				printf("cpu_reset: Failed to restart BSP\n");
 			enable_intr();
 			cpu_reset_proxy_active = 2;
 
 			while (1);
 			/* NOTREACHED */
-		}
+		} else
+			sched_unpin();
 
 		DELAY(1000000);
 	}
 #endif
 	cpu_reset_real();
 	/* NOTREACHED */
 }
 
 static void
 cpu_reset_real()
 {
 	struct region_descriptor null_idt;
 #ifndef PC98
 	int b;
 #endif
 
 	disable_intr();
 #ifdef XEN
 	if (smp_processor_id() == 0)
 		HYPERVISOR_shutdown(SHUTDOWN_reboot);
 	else
 		HYPERVISOR_shutdown(SHUTDOWN_poweroff);
 #endif 
 #ifdef CPU_ELAN
 	if (elan_mmcr != NULL)
 		elan_mmcr->RESCFG = 1;
 #endif
 
 	if (cpu == CPU_GEODE1100) {
 		/* Attempt Geode's own reset */
 		outl(0xcf8, 0x80009044ul);
 		outl(0xcfc, 0xf);
 	}
 
 #ifdef PC98
 	/*
 	 * Attempt to do a CPU reset via CPU reset port.
 	 */
 	if ((inb(0x35) & 0xa0) != 0xa0) {
 		outb(0x37, 0x0f);		/* SHUT0 = 0. */
 		outb(0x37, 0x0b);		/* SHUT1 = 0. */
 	}
 	outb(0xf0, 0x00);		/* Reset. */
 #else
 #if !defined(BROKEN_KEYBOARD_RESET)
 	/*
 	 * Attempt to do a CPU reset via the keyboard controller,
 	 * do not turn off GateA20, as any machine that fails
 	 * to do the reset here would then end up in no man's land.
 	 */
 	outb(IO_KBD + 4, 0xFE);
 	DELAY(500000);	/* wait 0.5 sec to see if that did it */
 #endif
 
 	/*
 	 * Attempt to force a reset via the Reset Control register at
 	 * I/O port 0xcf9.  Bit 2 forces a system reset when it
 	 * transitions from 0 to 1.  Bit 1 selects the type of reset
 	 * to attempt: 0 selects a "soft" reset, and 1 selects a
 	 * "hard" reset.  We try a "hard" reset.  The first write sets
 	 * bit 1 to select a "hard" reset and clears bit 2.  The
 	 * second write forces a 0 -> 1 transition in bit 2 to trigger
 	 * a reset.
 	 */
 	outb(0xcf9, 0x2);
 	outb(0xcf9, 0x6);
 	DELAY(500000);  /* wait 0.5 sec to see if that did it */
 
 	/*
 	 * Attempt to force a reset via the Fast A20 and Init register
 	 * at I/O port 0x92.  Bit 1 serves as an alternate A20 gate.
 	 * Bit 0 asserts INIT# when set to 1.  We are careful to only
 	 * preserve bit 1 while setting bit 0.  We also must clear bit
 	 * 0 before setting it if it isn't already clear.
 	 */
 	b = inb(0x92);
 	if (b != 0xff) {
 		if ((b & 0x1) != 0)
 			outb(0x92, b & 0xfe);
 		outb(0x92, b | 0x1);
 		DELAY(500000);  /* wait 0.5 sec to see if that did it */
 	}
 #endif /* PC98 */
 
 	printf("No known reset method worked, attempting CPU shutdown\n");
 	DELAY(1000000); /* wait 1 sec for printf to complete */
 
 	/* Wipe the IDT. */
 	null_idt.rd_limit = 0;
 	null_idt.rd_base = 0;
 	lidt(&null_idt);
 
 	/* "good night, sweet prince .... <THUNK!>" */
 	breakpoint();
 
 	/* NOTREACHED */
 	while(1);
 }
 
 /*
  * Allocate a pool of sf_bufs (sendfile(2) or "super-fast" if you prefer. :-))
  */
 static void
 sf_buf_init(void *arg)
 {
 	struct sf_buf *sf_bufs;
 	vm_offset_t sf_base;
 	int i;
 
 	nsfbufs = NSFBUFS;
 	TUNABLE_INT_FETCH("kern.ipc.nsfbufs", &nsfbufs);
 
 	sf_buf_active = hashinit(nsfbufs, M_TEMP, &sf_buf_hashmask);
 	TAILQ_INIT(&sf_buf_freelist);
 	sf_base = kmem_alloc_nofault(kernel_map, nsfbufs * PAGE_SIZE);
 	sf_bufs = malloc(nsfbufs * sizeof(struct sf_buf), M_TEMP,
 	    M_NOWAIT | M_ZERO);
 	for (i = 0; i < nsfbufs; i++) {
 		sf_bufs[i].kva = sf_base + i * PAGE_SIZE;
 		TAILQ_INSERT_TAIL(&sf_buf_freelist, &sf_bufs[i], free_entry);
 	}
 	sf_buf_alloc_want = 0;
 	mtx_init(&sf_buf_lock, "sf_buf", NULL, MTX_DEF);
 }
 
 /*
  * Invalidate the cache lines that may belong to the page, if
  * (possibly old) mapping of the page by sf buffer exists.  Returns
  * TRUE when mapping was found and cache invalidated.
  */
 boolean_t
 sf_buf_invalidate_cache(vm_page_t m)
 {
 	struct sf_head *hash_list;
 	struct sf_buf *sf;
 	boolean_t ret;
 
 	hash_list = &sf_buf_active[SF_BUF_HASH(m)];
 	ret = FALSE;
 	mtx_lock(&sf_buf_lock);
 	LIST_FOREACH(sf, hash_list, list_entry) {
 		if (sf->m == m) {
 			/*
 			 * Use pmap_qenter to update the pte for
 			 * existing mapping, in particular, the PAT
 			 * settings are recalculated.
 			 */
 			pmap_qenter(sf->kva, &m, 1);
 			pmap_invalidate_cache_range(sf->kva, sf->kva +
 			    PAGE_SIZE);
 			ret = TRUE;
 			break;
 		}
 	}
 	mtx_unlock(&sf_buf_lock);
 	return (ret);
 }
 
 /*
  * Get an sf_buf from the freelist.  May block if none are available.
  */
 struct sf_buf *
 sf_buf_alloc(struct vm_page *m, int flags)
 {
 	pt_entry_t opte, *ptep;
 	struct sf_head *hash_list;
 	struct sf_buf *sf;
 #ifdef SMP
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 #endif
 	int error;
 
 	KASSERT(curthread->td_pinned > 0 || (flags & SFB_CPUPRIVATE) == 0,
 	    ("sf_buf_alloc(SFB_CPUPRIVATE): curthread not pinned"));
 	hash_list = &sf_buf_active[SF_BUF_HASH(m)];
 	mtx_lock(&sf_buf_lock);
 	LIST_FOREACH(sf, hash_list, list_entry) {
 		if (sf->m == m) {
 			sf->ref_count++;
 			if (sf->ref_count == 1) {
 				TAILQ_REMOVE(&sf_buf_freelist, sf, free_entry);
 				nsfbufsused++;
 				nsfbufspeak = imax(nsfbufspeak, nsfbufsused);
 			}
 #ifdef SMP
 			goto shootdown;	
 #else
 			goto done;
 #endif
 		}
 	}
 	while ((sf = TAILQ_FIRST(&sf_buf_freelist)) == NULL) {
 		if (flags & SFB_NOWAIT)
 			goto done;
 		sf_buf_alloc_want++;
 		mbstat.sf_allocwait++;
 		error = msleep(&sf_buf_freelist, &sf_buf_lock,
 		    (flags & SFB_CATCH) ? PCATCH | PVM : PVM, "sfbufa", 0);
 		sf_buf_alloc_want--;
 
 		/*
 		 * If we got a signal, don't risk going back to sleep. 
 		 */
 		if (error)
 			goto done;
 	}
 	TAILQ_REMOVE(&sf_buf_freelist, sf, free_entry);
 	if (sf->m != NULL)
 		LIST_REMOVE(sf, list_entry);
 	LIST_INSERT_HEAD(hash_list, sf, list_entry);
 	sf->ref_count = 1;
 	sf->m = m;
 	nsfbufsused++;
 	nsfbufspeak = imax(nsfbufspeak, nsfbufsused);
 
 	/*
 	 * Update the sf_buf's virtual-to-physical mapping, flushing the
 	 * virtual address from the TLB.  Since the reference count for 
 	 * the sf_buf's old mapping was zero, that mapping is not 
 	 * currently in use.  Consequently, there is no need to exchange 
 	 * the old and new PTEs atomically, even under PAE.
 	 */
 	ptep = vtopte(sf->kva);
 	opte = *ptep;
 #ifdef XEN
        PT_SET_MA(sf->kva, xpmap_ptom(VM_PAGE_TO_PHYS(m)) | pgeflag
 	   | PG_RW | PG_V | pmap_cache_bits(m->md.pat_mode, 0));
 #else
 	*ptep = VM_PAGE_TO_PHYS(m) | pgeflag | PG_RW | PG_V |
 	    pmap_cache_bits(m->md.pat_mode, 0);
 #endif
 
 	/*
 	 * Avoid unnecessary TLB invalidations: If the sf_buf's old
 	 * virtual-to-physical mapping was not used, then any processor
 	 * that has invalidated the sf_buf's virtual address from its TLB
 	 * since the last used mapping need not invalidate again.
 	 */
 #ifdef SMP
 	if ((opte & (PG_V | PG_A)) ==  (PG_V | PG_A))
-		sf->cpumask = 0;
+		CPU_ZERO(&sf->cpumask);
 shootdown:
 	sched_pin();
 	cpumask = PCPU_GET(cpumask);
-	if ((sf->cpumask & cpumask) == 0) {
-		sf->cpumask |= cpumask;
+	if (!CPU_OVERLAP(&cpumask, &sf->cpumask)) {
+		CPU_OR(&sf->cpumask, &cpumask);
 		invlpg(sf->kva);
 	}
 	if ((flags & SFB_CPUPRIVATE) == 0) {
-		other_cpus = PCPU_GET(other_cpus) & ~sf->cpumask;
-		if (other_cpus != 0) {
-			sf->cpumask |= other_cpus;
+		other_cpus = PCPU_GET(other_cpus);
+		CPU_NAND(&other_cpus, &sf->cpumask);
+		if (!CPU_EMPTY(&other_cpus)) {
+			CPU_OR(&sf->cpumask, &other_cpus);
 			smp_masked_invlpg(other_cpus, sf->kva);
 		}
 	}
-	sched_unpin();	
+	sched_unpin();
 #else
 	if ((opte & (PG_V | PG_A)) ==  (PG_V | PG_A))
 		pmap_invalidate_page(kernel_pmap, sf->kva);
 #endif
 done:
 	mtx_unlock(&sf_buf_lock);
 	return (sf);
 }
 
 /*
  * Remove a reference from the given sf_buf, adding it to the free
  * list when its reference count reaches zero.  A freed sf_buf still,
  * however, retains its virtual-to-physical mapping until it is
  * recycled or reactivated by sf_buf_alloc(9).
  */
 void
 sf_buf_free(struct sf_buf *sf)
 {
 
 	mtx_lock(&sf_buf_lock);
 	sf->ref_count--;
 	if (sf->ref_count == 0) {
 		TAILQ_INSERT_TAIL(&sf_buf_freelist, sf, free_entry);
 		nsfbufsused--;
 #ifdef XEN
 /*
  * Xen doesn't like having dangling R/W mappings
  */
 		pmap_qremove(sf->kva, 1);
 		sf->m = NULL;
 		LIST_REMOVE(sf, list_entry);
 #endif
 		if (sf_buf_alloc_want > 0)
 			wakeup(&sf_buf_freelist);
 	}
 	mtx_unlock(&sf_buf_lock);
 }
 
 /*
  * Software interrupt handler for queued VM system processing.
  */   
 void  
 swi_vm(void *dummy) 
 {     
 	if (busdma_swi_pending != 0)
 		busdma_swi();
 }
 
 /*
  * Tell whether this address is in some physical memory region.
  * Currently used by the kernel coredump code in order to avoid
  * dumping the ``ISA memory hole'' which could cause indefinite hangs,
  * or other unpredictable behaviour.
  */
 
 int
 is_physical_memory(vm_paddr_t addr)
 {
 
 #ifdef DEV_ISA
 	/* The ISA ``memory hole''. */
 	if (addr >= 0xa0000 && addr < 0x100000)
 		return 0;
 #endif
 
 	/*
 	 * stuff other tests for known memory-mapped devices (PCI?)
 	 * here
 	 */
 
 	return 1;
 }
Index: head/sys/i386/include/_types.h
===================================================================
--- head/sys/i386/include/_types.h	(revision 222812)
+++ head/sys/i386/include/_types.h	(revision 222813)
@@ -1,129 +1,128 @@
 /*-
  * Copyright (c) 2002 Mike Barcroft <mike@FreeBSD.org>
  * Copyright (c) 1990, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	From: @(#)ansi.h	8.2 (Berkeley) 1/4/94
  *	From: @(#)types.h	8.3 (Berkeley) 1/5/94
  * $FreeBSD$
  */
 
 #ifndef _MACHINE__TYPES_H_
 #define	_MACHINE__TYPES_H_
 
 #ifndef _SYS_CDEFS_H_
 #error this file needs sys/cdefs.h as a prerequisite
 #endif
 
 #define __NO_STRICT_ALIGNMENT
 
 /*
  * Basic types upon which most other types are built.
  */
 typedef	__signed char		__int8_t;
 typedef	unsigned char		__uint8_t;
 typedef	short			__int16_t;
 typedef	unsigned short		__uint16_t;
 typedef	int			__int32_t;
 typedef	unsigned int		__uint32_t;
 #ifndef lint
 __extension__
 #endif
 /* LONGLONG */
 typedef	long long		__int64_t;
 #ifndef lint
 __extension__
 #endif
 /* LONGLONG */
 typedef	unsigned long long	__uint64_t;
 
 /*
  * Standard type definitions.
  */
 typedef	unsigned long	__clock_t;		/* clock()... */
-typedef	unsigned int	__cpumask_t;
 typedef	__int32_t	__critical_t;
 typedef	long double	__double_t;
 typedef	long double	__float_t;
 typedef	__int32_t	__intfptr_t;
 typedef	__int64_t	__intmax_t;
 typedef	__int32_t	__intptr_t;
 typedef	__int32_t	__int_fast8_t;
 typedef	__int32_t	__int_fast16_t;
 typedef	__int32_t	__int_fast32_t;
 typedef	__int64_t	__int_fast64_t;
 typedef	__int8_t	__int_least8_t;
 typedef	__int16_t	__int_least16_t;
 typedef	__int32_t	__int_least32_t;
 typedef	__int64_t	__int_least64_t;
 typedef	__int32_t	__ptrdiff_t;		/* ptr1 - ptr2 */
 typedef	__int32_t	__register_t;
 typedef	__int32_t	__segsz_t;		/* segment size (in pages) */
 typedef	__uint32_t	__size_t;		/* sizeof() */
 typedef	__int32_t	__ssize_t;		/* byte count or error */
 typedef	__int32_t	__time_t;		/* time()... */
 typedef	__uint32_t	__uintfptr_t;
 typedef	__uint64_t	__uintmax_t;
 typedef	__uint32_t	__uintptr_t;
 typedef	__uint32_t	__uint_fast8_t;
 typedef	__uint32_t	__uint_fast16_t;
 typedef	__uint32_t	__uint_fast32_t;
 typedef	__uint64_t	__uint_fast64_t;
 typedef	__uint8_t	__uint_least8_t;
 typedef	__uint16_t	__uint_least16_t;
 typedef	__uint32_t	__uint_least32_t;
 typedef	__uint64_t	__uint_least64_t;
 typedef	__uint32_t	__u_register_t;
 typedef	__uint32_t	__vm_offset_t;
 typedef	__int64_t	__vm_ooffset_t;
 #ifdef PAE
 typedef	__uint64_t	__vm_paddr_t;
 #else
 typedef	__uint32_t	__vm_paddr_t;
 #endif
 typedef	__uint64_t	__vm_pindex_t;
 typedef	__uint32_t	__vm_size_t;
 
 /*
  * Unusual type definitions.
  */
 #ifdef __GNUCLIKE_BUILTIN_VARARGS
 typedef __builtin_va_list	__va_list;	/* internally known to gcc */
 #else
 typedef	char *			__va_list;
 #endif /* __GNUCLIKE_BUILTIN_VARARGS */
 #if defined(__GNUC_VA_LIST_COMPATIBILITY) && !defined(__GNUC_VA_LIST) \
     && !defined(__NO_GNUC_VA_LIST)
 #define __GNUC_VA_LIST
 typedef __va_list		__gnuc_va_list;	/* compatibility w/GNU headers*/
 #endif
 
 #endif /* !_MACHINE__TYPES_H_ */
Index: head/sys/i386/include/pmap.h
===================================================================
--- head/sys/i386/include/pmap.h	(revision 222812)
+++ head/sys/i386/include/pmap.h	(revision 222813)
@@ -1,532 +1,533 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * Derived from hp300 version by Mike Hibler, this version by William
  * Jolitz uses a recursive map [a pde points to the page directory] to
  * map the page tables using the pagetables themselves. This is done to
  * reduce the impact on kernel virtual memory for lots of sparse address
  * space, and to reduce the cost of memory to each process.
  *
  *	from: hp300: @(#)pmap.h	7.2 (Berkeley) 12/16/90
  *	from: @(#)pmap.h	7.4 (Berkeley) 5/12/91
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_PMAP_H_
 #define	_MACHINE_PMAP_H_
 
 /*
  * Page-directory and page-table entries follow this format, with a few
  * of the fields not present here and there, depending on a lot of things.
  */
 				/* ---- Intel Nomenclature ---- */
 #define	PG_V		0x001	/* P	Valid			*/
 #define PG_RW		0x002	/* R/W	Read/Write		*/
 #define PG_U		0x004	/* U/S  User/Supervisor		*/
 #define	PG_NC_PWT	0x008	/* PWT	Write through		*/
 #define	PG_NC_PCD	0x010	/* PCD	Cache disable		*/
 #define PG_A		0x020	/* A	Accessed		*/
 #define	PG_M		0x040	/* D	Dirty			*/
 #define	PG_PS		0x080	/* PS	Page size (0=4k,1=4M)	*/
 #define	PG_PTE_PAT	0x080	/* PAT	PAT index		*/
 #define	PG_G		0x100	/* G	Global			*/
 #define	PG_AVAIL1	0x200	/*    /	Available for system	*/
 #define	PG_AVAIL2	0x400	/*   <	programmers use		*/
 #define	PG_AVAIL3	0x800	/*    \				*/
 #define	PG_PDE_PAT	0x1000	/* PAT	PAT index		*/
 #ifdef PAE
 #define	PG_NX		(1ull<<63) /* No-execute */
 #endif
 
 
 /* Our various interpretations of the above */
 #define PG_W		PG_AVAIL1	/* "Wired" pseudoflag */
 #define	PG_MANAGED	PG_AVAIL2
 #ifdef PAE
 #define	PG_FRAME	(0x000ffffffffff000ull)
 #define	PG_PS_FRAME	(0x000fffffffe00000ull)
 #else
 #define	PG_FRAME	(~PAGE_MASK)
 #define	PG_PS_FRAME	(0xffc00000)
 #endif
 #define	PG_PROT		(PG_RW|PG_U)	/* all protection bits . */
 #define PG_N		(PG_NC_PWT|PG_NC_PCD)	/* Non-cacheable */
 
 /* Page level cache control fields used to determine the PAT type */
 #define PG_PDE_CACHE	(PG_PDE_PAT | PG_NC_PWT | PG_NC_PCD)
 #define PG_PTE_CACHE	(PG_PTE_PAT | PG_NC_PWT | PG_NC_PCD)
 
 /*
  * Promotion to a 2 or 4MB (PDE) page mapping requires that the corresponding
  * 4KB (PTE) page mappings have identical settings for the following fields:
  */
 #define PG_PTE_PROMOTE	(PG_MANAGED | PG_W | PG_G | PG_PTE_PAT | \
 	    PG_M | PG_A | PG_NC_PCD | PG_NC_PWT | PG_U | PG_RW | PG_V)
 
 /*
  * Page Protection Exception bits
  */
 
 #define PGEX_P		0x01	/* Protection violation vs. not present */
 #define PGEX_W		0x02	/* during a Write cycle */
 #define PGEX_U		0x04	/* access from User mode (UPL) */
 #define PGEX_RSV	0x08	/* reserved PTE field is non-zero */
 #define PGEX_I		0x10	/* during an instruction fetch */
 
 /*
  * Size of Kernel address space.  This is the number of page table pages
  * (4MB each) to use for the kernel.  256 pages == 1 Gigabyte.
  * This **MUST** be a multiple of 4 (eg: 252, 256, 260, etc).
  * For PAE, the page table page unit size is 2MB.  This means that 512 pages
  * is 1 Gigabyte.  Double everything.  It must be a multiple of 8 for PAE.
  */
 #ifndef KVA_PAGES
 #ifdef PAE
 #define KVA_PAGES	512
 #else
 #define KVA_PAGES	256
 #endif
 #endif
 
 /*
  * Pte related macros
  */
 #define VADDR(pdi, pti) ((vm_offset_t)(((pdi)<<PDRSHIFT)|((pti)<<PAGE_SHIFT)))
 
 /* Initial number of kernel page tables. */
 #ifndef NKPT
 #ifdef PAE
 /* 152 page tables needed to map 16G (76B "struct vm_page", 2M page tables). */
 #define	NKPT		240
 #else
 /* 18 page tables needed to map 4G (72B "struct vm_page", 4M page tables). */
 #define	NKPT		30
 #endif
 #endif
 
 #ifndef NKPDE
 #define NKPDE	(KVA_PAGES)	/* number of page tables/pde's */
 #endif
 
 /*
  * The *PTDI values control the layout of virtual memory
  *
  * XXX This works for now, but I am not real happy with it, I'll fix it
  * right after I fix locore.s and the magic 28K hole
  */
 #define	KPTDI		(NPDEPTD-NKPDE)	/* start of kernel virtual pde's */
 #define	PTDPTDI		(KPTDI-NPGPTD)	/* ptd entry that points to ptd! */
 
 /*
  * XXX doesn't really belong here I guess...
  */
 #define ISA_HOLE_START    0xa0000
 #define ISA_HOLE_LENGTH (0x100000-ISA_HOLE_START)
 
 #ifndef LOCORE
 
 #include <sys/queue.h>
+#include <sys/_cpuset.h>
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
 
 #ifdef PAE
 
 typedef uint64_t pdpt_entry_t;
 typedef uint64_t pd_entry_t;
 typedef uint64_t pt_entry_t;
 
 #define	PTESHIFT	(3)
 #define	PDESHIFT	(3)
 
 #else
 
 typedef uint32_t pd_entry_t;
 typedef uint32_t pt_entry_t;
 
 #define	PTESHIFT	(2)
 #define	PDESHIFT	(2)
 
 #endif
 
 /*
  * Address of current address space page table maps and directories.
  */
 #ifdef _KERNEL
 extern pt_entry_t PTmap[];
 extern pd_entry_t PTD[];
 extern pd_entry_t PTDpde[];
 
 #ifdef PAE
 extern pdpt_entry_t *IdlePDPT;
 #endif
 extern pd_entry_t *IdlePTD;	/* physical address of "Idle" state directory */
 
 /*
  * Translate a virtual address to the kernel virtual address of its page table
  * entry (PTE).  This can be used recursively.  If the address of a PTE as
  * previously returned by this macro is itself given as the argument, then the
  * address of the page directory entry (PDE) that maps the PTE will be
  * returned.
  *
  * This macro may be used before pmap_bootstrap() is called.
  */
 #define	vtopte(va)	(PTmap + i386_btop(va))
 
 /*
  * Translate a virtual address to its physical address.
  *
  * This macro may be used before pmap_bootstrap() is called.
  */
 #define	vtophys(va)	pmap_kextract((vm_offset_t)(va))
 
 #if defined(XEN)
 #include <sys/param.h>
 #include <machine/xen/xen-os.h>
 #include <machine/xen/xenvar.h>
 #include <machine/xen/xenpmap.h>
 
 extern pt_entry_t pg_nx;
 
 #define PG_KERNEL  (PG_V | PG_A | PG_RW | PG_M)
 
 #define MACH_TO_VM_PAGE(ma) PHYS_TO_VM_PAGE(xpmap_mtop((ma)))
 #define VM_PAGE_TO_MACH(m) xpmap_ptom(VM_PAGE_TO_PHYS((m)))
 
 #define VTOM(va) xpmap_ptom(VTOP(va))
 
 static __inline vm_paddr_t
 pmap_kextract_ma(vm_offset_t va)
 {
         vm_paddr_t ma;
         if ((ma = PTD[va >> PDRSHIFT]) & PG_PS) {
                 ma = (ma & ~(NBPDR - 1)) | (va & (NBPDR - 1));
         } else {
                 ma = (*vtopte(va) & PG_FRAME) | (va & PAGE_MASK);
         }
         return ma;
 }
 
 static __inline vm_paddr_t
 pmap_kextract(vm_offset_t va)
 {
         return xpmap_mtop(pmap_kextract_ma(va));
 }
 #define vtomach(va)     pmap_kextract_ma(((vm_offset_t) (va)))
 
 vm_paddr_t pmap_extract_ma(struct pmap *pmap, vm_offset_t va);
 
 void    pmap_kenter_ma(vm_offset_t va, vm_paddr_t pa);
 void    pmap_map_readonly(struct pmap *pmap, vm_offset_t va, int len);
 void    pmap_map_readwrite(struct pmap *pmap, vm_offset_t va, int len);
 
 static __inline pt_entry_t
 pte_load_store(pt_entry_t *ptep, pt_entry_t v)
 {
 	pt_entry_t r;
 
 	r = *ptep;
 	PT_SET_VA(ptep, v, TRUE);
 	return (r);
 }
 
 static __inline pt_entry_t
 pte_load_store_ma(pt_entry_t *ptep, pt_entry_t v)
 {
 	pt_entry_t r;
 
 	r = *ptep;
 	PT_SET_VA_MA(ptep, v, TRUE);
 	return (r);
 }
 
 #define	pte_load_clear(ptep)	pte_load_store((ptep), (pt_entry_t)0ULL)
 
 #define	pte_store(ptep, pte)	pte_load_store((ptep), (pt_entry_t)pte)
 #define	pte_store_ma(ptep, pte)	pte_load_store_ma((ptep), (pt_entry_t)pte)
 #define	pde_store_ma(ptep, pte)	pte_load_store_ma((ptep), (pt_entry_t)pte)
 
 #elif !defined(XEN)
 
 /*
  * KPTmap is a linear mapping of the kernel page table.  It differs from the
  * recursive mapping in two ways: (1) it only provides access to kernel page
  * table pages, and not user page table pages, and (2) it provides access to
  * a kernel page table page after the corresponding virtual addresses have
  * been promoted to a 2/4MB page mapping.
  *
  * KPTmap is first initialized by locore to support just NPKT page table
  * pages.  Later, it is reinitialized by pmap_bootstrap() to allow for
  * expansion of the kernel page table.
  */
 extern pt_entry_t *KPTmap;
 
 /*
  * Extract from the kernel page table the physical address that is mapped by
  * the given virtual address "va".
  *
  * This function may be used before pmap_bootstrap() is called.
  */
 static __inline vm_paddr_t
 pmap_kextract(vm_offset_t va)
 {
 	vm_paddr_t pa;
 
 	if ((pa = PTD[va >> PDRSHIFT]) & PG_PS) {
 		pa = (pa & PG_PS_FRAME) | (va & PDRMASK);
 	} else {
 		/*
 		 * Beware of a concurrent promotion that changes the PDE at
 		 * this point!  For example, vtopte() must not be used to
 		 * access the PTE because it would use the new PDE.  It is,
 		 * however, safe to use the old PDE because the page table
 		 * page is preserved by the promotion.
 		 */
 		pa = KPTmap[i386_btop(va)];
 		pa = (pa & PG_FRAME) | (va & PAGE_MASK);
 	}
 	return (pa);
 }
 #endif
 
 #if !defined(XEN)
 #define PT_UPDATES_FLUSH()
 #endif
 
 #if defined(PAE) && !defined(XEN)
 
 #define	pde_cmpset(pdep, old, new) \
 				atomic_cmpset_64((pdep), (old), (new))
 
 static __inline pt_entry_t
 pte_load(pt_entry_t *ptep)
 {
 	pt_entry_t r;
 
 	__asm __volatile(
 	    "lock; cmpxchg8b %1"
 	    : "=A" (r)
 	    : "m" (*ptep), "a" (0), "d" (0), "b" (0), "c" (0));
 	return (r);
 }
 
 static __inline pt_entry_t
 pte_load_store(pt_entry_t *ptep, pt_entry_t v)
 {
 	pt_entry_t r;
 
 	r = *ptep;
 	__asm __volatile(
 	    "1:\n"
 	    "\tlock; cmpxchg8b %1\n"
 	    "\tjnz 1b"
 	    : "+A" (r)
 	    : "m" (*ptep), "b" ((uint32_t)v), "c" ((uint32_t)(v >> 32)));
 	return (r);
 }
 
 /* XXXRU move to atomic.h? */
 static __inline int
 atomic_cmpset_64(volatile uint64_t *dst, uint64_t exp, uint64_t src)
 {
 	int64_t res = exp;
 
 	__asm __volatile (
 	"	lock ;			"
 	"	cmpxchg8b %2 ;		"
 	"	setz	%%al ;		"
 	"	movzbl	%%al,%0 ;	"
 	"# atomic_cmpset_64"
 	: "+A" (res),			/* 0 (result) */
 	  "=m" (*dst)			/* 1 */
 	: "m" (*dst),			/* 2 */
 	  "b" ((uint32_t)src),
 	  "c" ((uint32_t)(src >> 32)));
 
 	return (res);
 }
 
 #define	pte_load_clear(ptep)	pte_load_store((ptep), (pt_entry_t)0ULL)
 
 #define	pte_store(ptep, pte)	pte_load_store((ptep), (pt_entry_t)pte)
 
 extern pt_entry_t pg_nx;
 
 #elif !defined(PAE) && !defined (XEN)
 
 #define	pde_cmpset(pdep, old, new) \
 				atomic_cmpset_int((pdep), (old), (new))
 
 static __inline pt_entry_t
 pte_load(pt_entry_t *ptep)
 {
 	pt_entry_t r;
 
 	r = *ptep;
 	return (r);
 }
 
 static __inline pt_entry_t
 pte_load_store(pt_entry_t *ptep, pt_entry_t pte)
 {
 	__asm volatile("xchgl %0, %1" : "+m" (*ptep), "+r" (pte));
 	return (pte);
 }
 
 #define	pte_load_clear(pte)	atomic_readandclear_int(pte)
 
 static __inline void
 pte_store(pt_entry_t *ptep, pt_entry_t pte)
 {
 
 	*ptep = pte;
 }
 
 #endif /* PAE */
 
 #define	pte_clear(ptep)		pte_store((ptep), (pt_entry_t)0ULL)
 
 #define	pde_store(pdep, pde)	pte_store((pdep), (pde))
 
 #endif /* _KERNEL */
 
 /*
  * Pmap stuff
  */
 struct	pv_entry;
 struct	pv_chunk;
 
 struct md_page {
 	TAILQ_HEAD(,pv_entry)	pv_list;
 	int			pat_mode;
 };
 
 struct pmap {
 	struct mtx		pm_mtx;
 	pd_entry_t		*pm_pdir;	/* KVA of page directory */
 	TAILQ_HEAD(,pv_chunk)	pm_pvchunk;	/* list of mappings in pmap */
-	cpumask_t		pm_active;	/* active on cpus */
+	cpuset_t		pm_active;	/* active on cpus */
 	struct pmap_statistics	pm_stats;	/* pmap statistics */
 	LIST_ENTRY(pmap) 	pm_list;	/* List of all pmaps */
 #ifdef PAE
 	pdpt_entry_t		*pm_pdpt;	/* KVA of page director pointer
 						   table */
 #endif
 	vm_page_t		pm_root;	/* spare page table pages */
 };
 
 typedef struct pmap	*pmap_t;
 
 #ifdef _KERNEL
 extern struct pmap	kernel_pmap_store;
 #define kernel_pmap	(&kernel_pmap_store)
 
 #define	PMAP_LOCK(pmap)		mtx_lock(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_ASSERT(pmap, type) \
 				mtx_assert(&(pmap)->pm_mtx, (type))
 #define	PMAP_LOCK_DESTROY(pmap)	mtx_destroy(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_INIT(pmap)	mtx_init(&(pmap)->pm_mtx, "pmap", \
 				    NULL, MTX_DEF | MTX_DUPOK)
 #define	PMAP_LOCKED(pmap)	mtx_owned(&(pmap)->pm_mtx)
 #define	PMAP_MTX(pmap)		(&(pmap)->pm_mtx)
 #define	PMAP_TRYLOCK(pmap)	mtx_trylock(&(pmap)->pm_mtx)
 #define	PMAP_UNLOCK(pmap)	mtx_unlock(&(pmap)->pm_mtx)
 #endif
 
 /*
  * For each vm_page_t, there is a list of all currently valid virtual
  * mappings of that page.  An entry is a pv_entry_t, the list is pv_list.
  */
 typedef struct pv_entry {
 	vm_offset_t	pv_va;		/* virtual address for mapping */
 	TAILQ_ENTRY(pv_entry)	pv_list;
 } *pv_entry_t;
 
 /*
  * pv_entries are allocated in chunks per-process.  This avoids the
  * need to track per-pmap assignments.
  */
 #define	_NPCM	11
 #define	_NPCPV	336
 struct pv_chunk {
 	pmap_t			pc_pmap;
 	TAILQ_ENTRY(pv_chunk)	pc_list;
 	uint32_t		pc_map[_NPCM];	/* bitmap; 1 = free */
 	uint32_t		pc_spare[2];
 	struct pv_entry		pc_pventry[_NPCPV];
 };
 
 #ifdef	_KERNEL
 
 extern caddr_t	CADDR1;
 extern pt_entry_t *CMAP1;
 extern vm_paddr_t phys_avail[];
 extern vm_paddr_t dump_avail[];
 extern int pseflag;
 extern int pgeflag;
 extern char *ptvmmap;		/* poor name! */
 extern vm_offset_t virtual_avail;
 extern vm_offset_t virtual_end;
 
 #define	pmap_page_get_memattr(m)	((vm_memattr_t)(m)->md.pat_mode)
 #define	pmap_unmapbios(va, sz)	pmap_unmapdev((va), (sz))
 
 /*
  * Only the following functions or macros may be used before pmap_bootstrap()
  * is called: pmap_kenter(), pmap_kextract(), pmap_kremove(), vtophys(), and
  * vtopte().
  */
 void	pmap_bootstrap(vm_paddr_t);
 int	pmap_cache_bits(int mode, boolean_t is_pde);
 int	pmap_change_attr(vm_offset_t, vm_size_t, int);
 void	pmap_init_pat(void);
 void	pmap_kenter(vm_offset_t va, vm_paddr_t pa);
 void	*pmap_kenter_temporary(vm_paddr_t pa, int i);
 void	pmap_kremove(vm_offset_t);
 void	*pmap_mapbios(vm_paddr_t, vm_size_t);
 void	*pmap_mapdev(vm_paddr_t, vm_size_t);
 void	*pmap_mapdev_attr(vm_paddr_t, vm_size_t, int);
 boolean_t pmap_page_is_mapped(vm_page_t m);
 void	pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma);
 void	pmap_unmapdev(vm_offset_t, vm_size_t);
 pt_entry_t *pmap_pte(pmap_t, vm_offset_t) __pure2;
 void	pmap_invalidate_page(pmap_t, vm_offset_t);
 void	pmap_invalidate_range(pmap_t, vm_offset_t, vm_offset_t);
 void	pmap_invalidate_all(pmap_t);
 void	pmap_invalidate_cache(void);
 void	pmap_invalidate_cache_pages(vm_page_t *pages, int count);
 void	pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva);
 
 #endif /* _KERNEL */
 
 #endif /* !LOCORE */
 
 #endif /* !_MACHINE_PMAP_H_ */
Index: head/sys/i386/include/sf_buf.h
===================================================================
--- head/sys/i386/include/sf_buf.h	(revision 222812)
+++ head/sys/i386/include/sf_buf.h	(revision 222813)
@@ -1,63 +1,64 @@
 /*-
  * Copyright (c) 2003, 2005 Alan L. Cox <alc@cs.rice.edu>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_SF_BUF_H_
 #define _MACHINE_SF_BUF_H_
 
+#include <sys/_cpuset.h>
 #include <sys/queue.h>
 
 struct vm_page;
 
 struct sf_buf {
 	LIST_ENTRY(sf_buf) list_entry;	/* list of buffers */
 	TAILQ_ENTRY(sf_buf) free_entry;	/* list of buffers */
 	struct		vm_page *m;	/* currently mapped page */
 	vm_offset_t	kva;		/* va of mapping */
 	int		ref_count;	/* usage of this mapping */
 #ifdef SMP
-	cpumask_t	cpumask;	/* cpus on which mapping is valid */
+	cpuset_t	cpumask;	/* cpus on which mapping is valid */
 #endif
 };
 
 static __inline vm_offset_t
 sf_buf_kva(struct sf_buf *sf)
 {
 
 	return (sf->kva);
 }
 
 static __inline struct vm_page *
 sf_buf_page(struct sf_buf *sf)
 {
 
 	return (sf->m);
 }
 
 boolean_t sf_buf_invalidate_cache(vm_page_t m);
 
 #endif /* !_MACHINE_SF_BUF_H_ */
Index: head/sys/i386/include/smp.h
===================================================================
--- head/sys/i386/include/smp.h	(revision 222812)
+++ head/sys/i386/include/smp.h	(revision 222813)
@@ -1,93 +1,93 @@
 /*-
  * ----------------------------------------------------------------------------
  * "THE BEER-WARE LICENSE" (Revision 42):
  * <phk@FreeBSD.org> wrote this file.  As long as you retain this notice you
  * can do whatever you want with this stuff. If we meet some day, and you think
  * this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
  * ----------------------------------------------------------------------------
  *
  * $FreeBSD$
  *
  */
 
 #ifndef _MACHINE_SMP_H_
 #define _MACHINE_SMP_H_
 
 #ifdef _KERNEL
 
 #ifdef SMP
 
 #ifndef LOCORE
 
 #include <sys/bus.h>
 #include <machine/frame.h>
 #include <machine/intr_machdep.h>
 #include <machine/apicvar.h>
 #include <machine/pcb.h>
 
 /* global data in mpboot.s */
 extern int			bootMP_size;
 
 /* functions in mpboot.s */
 void	bootMP(void);
 
 /* global data in mp_machdep.c */
 extern int			mp_naps;
 extern int			boot_cpu_id;
 extern struct pcb		stoppcbs[];
 extern int			cpu_apic_ids[];
 #ifdef COUNT_IPIS
 extern u_long *ipi_invltlb_counts[MAXCPU];
 extern u_long *ipi_invlrng_counts[MAXCPU];
 extern u_long *ipi_invlpg_counts[MAXCPU];
 extern u_long *ipi_invlcache_counts[MAXCPU];
 extern u_long *ipi_rendezvous_counts[MAXCPU];
 extern u_long *ipi_lazypmap_counts[MAXCPU];
 #endif
 
 /* IPI handlers */
 inthand_t
 	IDTVEC(invltlb),	/* TLB shootdowns - global */
 	IDTVEC(invlpg),		/* TLB shootdowns - 1 page */
 	IDTVEC(invlrng),	/* TLB shootdowns - page range */
 	IDTVEC(invlcache),	/* Write back and invalidate cache */
 	IDTVEC(ipi_intr_bitmap_handler), /* Bitmap based IPIs */ 
 	IDTVEC(cpustop),	/* CPU stops & waits to be restarted */
 	IDTVEC(rendezvous),	/* handle CPU rendezvous */
 	IDTVEC(lazypmap);	/* handle lazy pmap release */
 
 /* functions in mp_machdep.c */
 void	cpu_add(u_int apic_id, char boot_cpu);
 void	cpustop_handler(void);
 void	init_secondary(void);
 void	ipi_all_but_self(u_int ipi);
 #ifndef XEN
 void 	ipi_bitmap_handler(struct trapframe frame);
 #endif
 void	ipi_cpu(int cpu, u_int ipi);
 int	ipi_nmi_handler(void);
-void	ipi_selected(cpumask_t cpus, u_int ipi);
+void	ipi_selected(cpuset_t cpus, u_int ipi);
 u_int	mp_bootaddress(u_int);
 int	mp_grab_cpu_hlt(void);
 void	smp_cache_flush(void);
 void	smp_invlpg(vm_offset_t addr);
-void	smp_masked_invlpg(cpumask_t mask, vm_offset_t addr);
+void	smp_masked_invlpg(cpuset_t mask, vm_offset_t addr);
 void	smp_invlpg_range(vm_offset_t startva, vm_offset_t endva);
-void	smp_masked_invlpg_range(cpumask_t mask, vm_offset_t startva,
+void	smp_masked_invlpg_range(cpuset_t mask, vm_offset_t startva,
 	    vm_offset_t endva);
 void	smp_invltlb(void);
-void	smp_masked_invltlb(cpumask_t mask);
+void	smp_masked_invltlb(cpuset_t mask);
 
 #ifdef XEN
 void ipi_to_irq_init(void);
 
 #define RESCHEDULE_VECTOR	0
 #define CALL_FUNCTION_VECTOR	1
 #define NR_IPIS			2
 
 #endif
 #endif /* !LOCORE */
 #endif /* SMP */
 
 #endif /* _KERNEL */
 #endif /* _MACHINE_SMP_H_ */
Index: head/sys/i386/xen/mp_machdep.c
===================================================================
--- head/sys/i386/xen/mp_machdep.c	(revision 222812)
+++ head/sys/i386/xen/mp_machdep.c	(revision 222813)
@@ -1,1247 +1,1268 @@
 /*-
  * Copyright (c) 1996, by Steve Passe
  * Copyright (c) 2008, by Kip Macy
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. The name of the developer may NOT be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_apic.h"
 #include "opt_cpu.h"
 #include "opt_kstack_pages.h"
 #include "opt_mp_watchdog.h"
 #include "opt_pmap.h"
 #include "opt_sched.h"
 #include "opt_smp.h"
 
 #if !defined(lint)
 #if !defined(SMP)
 #error How did you get here?
 #endif
 
 #ifndef DEV_APIC
 #error The apic device is required for SMP, add "device apic" to your config file.
 #endif
 #if defined(CPU_DISABLE_CMPXCHG) && !defined(COMPILING_LINT)
 #error SMP not supported with CPU_DISABLE_CMPXCHG
 #endif
 #endif /* not lint */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/cons.h>	/* cngetc() */
+#include <sys/cpuset.h>
 #ifdef GPROF 
 #include <sys/gmon.h>
 #endif
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/memrange.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_page.h>
 
 #include <x86/apicreg.h>
 #include <machine/md_var.h>
 #include <machine/mp_watchdog.h>
 #include <machine/pcb.h>
 #include <machine/psl.h>
 #include <machine/smp.h>
 #include <machine/specialreg.h>
 #include <machine/pcpu.h>
 
 
 
 #include <machine/xen/xen-os.h>
 #include <xen/evtchn.h>
 #include <xen/xen_intr.h>
 #include <xen/hypervisor.h>
 #include <xen/interface/vcpu.h>
 
 
 int	mp_naps;		/* # of Applications processors */
 int	boot_cpu_id = -1;	/* designated BSP */
 
 extern	struct pcpu __pcpu[];
 
 static int bootAP;
 static union descriptor *bootAPgdt;
 
 static char resched_name[NR_CPUS][15];
 static char callfunc_name[NR_CPUS][15];
 
 /* Free these after use */
 void *bootstacks[MAXCPU];
 
 struct pcb stoppcbs[MAXCPU];
 
 /* Variables needed for SMP tlb shootdown. */
 vm_offset_t smp_tlb_addr1;
 vm_offset_t smp_tlb_addr2;
 volatile int smp_tlb_wait;
 
 typedef void call_data_func_t(uintptr_t , uintptr_t);
 
 static u_int logical_cpus;
-static volatile cpumask_t ipi_nmi_pending;
+static volatile cpuset_t ipi_nmi_pending;
 
 /* used to hold the AP's until we are ready to release them */
 static struct mtx ap_boot_mtx;
 
 /* Set to 1 once we're ready to let the APs out of the pen. */
 static volatile int aps_ready = 0;
 
 /*
  * Store data from cpu_add() until later in the boot when we actually setup
  * the APs.
  */
 struct cpu_info {
 	int	cpu_present:1;
 	int	cpu_bsp:1;
 	int	cpu_disabled:1;
 } static cpu_info[MAX_APIC_ID + 1];
 int cpu_apic_ids[MAXCPU];
 int apic_cpuids[MAX_APIC_ID + 1];
 
 /* Holds pending bitmap based IPIs per CPU */
 static volatile u_int cpu_ipi_pending[MAXCPU];
 
 static int cpu_logical;
 static int cpu_cores;
 
 static void	assign_cpu_ids(void);
 static void	set_interrupt_apic_ids(void);
 int	start_all_aps(void);
 static int	start_ap(int apic_id);
 static void	release_aps(void *dummy);
 
 static u_int	hyperthreading_cpus;
-static cpumask_t	hyperthreading_cpus_mask;
+static cpuset_t	hyperthreading_cpus_mask;
 
 extern void Xhypervisor_callback(void);
 extern void failsafe_callback(void);
 extern void pmap_lazyfix_action(void);
 
 struct cpu_group *
 cpu_topo(void)
 {
 	if (cpu_cores == 0)
 		cpu_cores = 1;
 	if (cpu_logical == 0)
 		cpu_logical = 1;
 	if (mp_ncpus % (cpu_cores * cpu_logical) != 0) {
 		printf("WARNING: Non-uniform processors.\n");
 		printf("WARNING: Using suboptimal topology.\n");
 		return (smp_topo_none());
 	}
 	/*
 	 * No multi-core or hyper-threaded.
 	 */
 	if (cpu_logical * cpu_cores == 1)
 		return (smp_topo_none());
 	/*
 	 * Only HTT no multi-core.
 	 */
 	if (cpu_logical > 1 && cpu_cores == 1)
 		return (smp_topo_1level(CG_SHARE_L1, cpu_logical, CG_FLAG_HTT));
 	/*
 	 * Only multi-core no HTT.
 	 */
 	if (cpu_cores > 1 && cpu_logical == 1)
 		return (smp_topo_1level(CG_SHARE_NONE, cpu_cores, 0));
 	/*
 	 * Both HTT and multi-core.
 	 */
 	return (smp_topo_2level(CG_SHARE_NONE, cpu_cores,
 	    CG_SHARE_L1, cpu_logical, CG_FLAG_HTT));
 }
 
 /*
  * Calculate usable address in base memory for AP trampoline code.
  */
 u_int
 mp_bootaddress(u_int basemem)
 {
 
 	return (basemem);
 }
 
 void
 cpu_add(u_int apic_id, char boot_cpu)
 {
 
 	if (apic_id > MAX_APIC_ID) {
 		panic("SMP: APIC ID %d too high", apic_id);
 		return;
 	}
 	KASSERT(cpu_info[apic_id].cpu_present == 0, ("CPU %d added twice",
 	    apic_id));
 	cpu_info[apic_id].cpu_present = 1;
 	if (boot_cpu) {
 		KASSERT(boot_cpu_id == -1,
 		    ("CPU %d claims to be BSP, but CPU %d already is", apic_id,
 		    boot_cpu_id));
 		boot_cpu_id = apic_id;
 		cpu_info[apic_id].cpu_bsp = 1;
 	}
 	if (mp_ncpus < MAXCPU)
 		mp_ncpus++;
 	if (bootverbose)
 		printf("SMP: Added CPU %d (%s)\n", apic_id, boot_cpu ? "BSP" :
 		    "AP");
 }
 
 void
 cpu_mp_setmaxid(void)
 {
 
 	mp_maxid = MAXCPU - 1;
 }
 
 int
 cpu_mp_probe(void)
 {
 
 	/*
 	 * Always record BSP in CPU map so that the mbuf init code works
 	 * correctly.
 	 */
-	all_cpus = 1;
+	CPU_SETOF(0, &all_cpus);
 	if (mp_ncpus == 0) {
 		/*
 		 * No CPUs were found, so this must be a UP system.  Setup
 		 * the variables to represent a system with a single CPU
 		 * with an id of 0.
 		 */
 		mp_ncpus = 1;
 		return (0);
 	}
 
 	/* At least one CPU was found. */
 	if (mp_ncpus == 1) {
 		/*
 		 * One CPU was found, so this must be a UP system with
 		 * an I/O APIC.
 		 */
 		return (0);
 	}
 
 	/* At least two CPUs were found. */
 	return (1);
 }
 
 /*
  * Initialize the IPI handlers and start up the AP's.
  */
 void
 cpu_mp_start(void)
 {
 	int i;
 
 	/* Initialize the logical ID to APIC ID table. */
 	for (i = 0; i < MAXCPU; i++) {
 		cpu_apic_ids[i] = -1;
 		cpu_ipi_pending[i] = 0;
 	}
 
 	/* Set boot_cpu_id if needed. */
 	if (boot_cpu_id == -1) {
 		boot_cpu_id = PCPU_GET(apic_id);
 		cpu_info[boot_cpu_id].cpu_bsp = 1;
 	} else
 		KASSERT(boot_cpu_id == PCPU_GET(apic_id),
 		    ("BSP's APIC ID doesn't match boot_cpu_id"));
 	cpu_apic_ids[0] = boot_cpu_id;
 	apic_cpuids[boot_cpu_id] = 0;
 
 	assign_cpu_ids();
 
 	/* Start each Application Processor */
 	start_all_aps();
 
 	/* Setup the initial logical CPUs info. */
-	logical_cpus = logical_cpus_mask = 0;
+	logical_cpus = 0;
+	CPU_ZERO(&logical_cpus_mask);
 	if (cpu_feature & CPUID_HTT)
 		logical_cpus = (cpu_procinfo & CPUID_HTT_CORES) >> 16;
 
 	set_interrupt_apic_ids();
 }
 
 
 static void
 iv_rendezvous(uintptr_t a, uintptr_t b)
 {
 	smp_rendezvous_action();
 }
 
 static void
 iv_invltlb(uintptr_t a, uintptr_t b)
 {
 	xen_tlb_flush();
 }
 
 static void
 iv_invlpg(uintptr_t a, uintptr_t b)
 {
 	xen_invlpg(a);
 }
 
 static void
 iv_invlrng(uintptr_t a, uintptr_t b)
 {
 	vm_offset_t start = (vm_offset_t)a;
 	vm_offset_t end = (vm_offset_t)b;
 
 	while (start < end) {
 		xen_invlpg(start);
 		start += PAGE_SIZE;
 	}
 }
 
 
 static void
 iv_invlcache(uintptr_t a, uintptr_t b)
 {
 
 	wbinvd();
 	atomic_add_int(&smp_tlb_wait, 1);
 }
 
 static void
 iv_lazypmap(uintptr_t a, uintptr_t b)
 {
 	pmap_lazyfix_action();
 	atomic_add_int(&smp_tlb_wait, 1);
 }
 
 /*
  * These start from "IPI offset" APIC_IPI_INTS
  */
 static call_data_func_t *ipi_vectors[6] = 
 {
   iv_rendezvous,
   iv_invltlb,
   iv_invlpg,
   iv_invlrng,
   iv_invlcache,
   iv_lazypmap,
 };
 
 /*
  * Reschedule call back. Nothing to do,
  * all the work is done automatically when
  * we return from the interrupt.
  */
 static int
 smp_reschedule_interrupt(void *unused)
 {
 	int cpu = PCPU_GET(cpuid);
 	u_int ipi_bitmap;
 
 	ipi_bitmap = atomic_readandclear_int(&cpu_ipi_pending[cpu]);
 
 	if (ipi_bitmap & (1 << IPI_PREEMPT)) {
 #ifdef COUNT_IPIS
 		(*ipi_preempt_counts[cpu])++;
 #endif
 		sched_preempt(curthread);
 	}
 
 	if (ipi_bitmap & (1 << IPI_AST)) {
 #ifdef COUNT_IPIS
 		(*ipi_ast_counts[cpu])++;
 #endif
 		/* Nothing to do for AST */
 	}	
 	return (FILTER_HANDLED);
 }
 
 struct _call_data {
 	uint16_t func_id;
 	uint16_t wait;
 	uintptr_t arg1;
 	uintptr_t arg2;
 	atomic_t started;
 	atomic_t finished;
 };
 
 static struct _call_data *call_data;
 
 static int
 smp_call_function_interrupt(void *unused)
 {	
 	call_data_func_t *func;
 	uintptr_t arg1 = call_data->arg1;
 	uintptr_t arg2 = call_data->arg2;
 	int wait = call_data->wait;
 	atomic_t *started = &call_data->started;
 	atomic_t *finished = &call_data->finished;
 
 	/* We only handle function IPIs, not bitmap IPIs */
 	if (call_data->func_id < APIC_IPI_INTS || call_data->func_id > IPI_BITMAP_VECTOR)
 		panic("invalid function id %u", call_data->func_id);
 	
 	func = ipi_vectors[call_data->func_id - APIC_IPI_INTS];
 	/*
 	 * Notify initiating CPU that I've grabbed the data and am
 	 * about to execute the function
 	 */
 	mb();
 	atomic_inc(started);
 	/*
 	 * At this point the info structure may be out of scope unless wait==1
 	 */
 	(*func)(arg1, arg2);
 
 	if (wait) {
 		mb();
 		atomic_inc(finished);
 	}
 	atomic_add_int(&smp_tlb_wait, 1);
 	return (FILTER_HANDLED);
 }
 
 /*
  * Print various information about the SMP system hardware and setup.
  */
 void
 cpu_mp_announce(void)
 {
 	int i, x;
 
 	/* List CPUs */
 	printf(" cpu0 (BSP): APIC ID: %2d\n", boot_cpu_id);
 	for (i = 1, x = 0; x <= MAX_APIC_ID; x++) {
 		if (!cpu_info[x].cpu_present || cpu_info[x].cpu_bsp)
 			continue;
 		if (cpu_info[x].cpu_disabled)
 			printf("  cpu (AP): APIC ID: %2d (disabled)\n", x);
 		else {
 			KASSERT(i < mp_ncpus,
 			    ("mp_ncpus and actual cpus are out of whack"));
 			printf(" cpu%d (AP): APIC ID: %2d\n", i++, x);
 		}
 	}
 }
 
 static int
 xen_smp_intr_init(unsigned int cpu)
 {
 	int rc;
 	unsigned int irq;
 	
 	per_cpu(resched_irq, cpu) = per_cpu(callfunc_irq, cpu) = -1;
 
 	sprintf(resched_name[cpu], "resched%u", cpu);
 	rc = bind_ipi_to_irqhandler(RESCHEDULE_VECTOR,
 				    cpu,
 				    resched_name[cpu],
 				    smp_reschedule_interrupt,
 	    INTR_TYPE_TTY, &irq);
 
 	printf("[XEN] IPI cpu=%d irq=%d vector=RESCHEDULE_VECTOR (%d)\n",
 	    cpu, irq, RESCHEDULE_VECTOR);
 	
 	per_cpu(resched_irq, cpu) = irq;
 
 	sprintf(callfunc_name[cpu], "callfunc%u", cpu);
 	rc = bind_ipi_to_irqhandler(CALL_FUNCTION_VECTOR,
 				    cpu,
 				    callfunc_name[cpu],
 				    smp_call_function_interrupt,
 	    INTR_TYPE_TTY, &irq);
 	if (rc < 0)
 		goto fail;
 	per_cpu(callfunc_irq, cpu) = irq;
 
 	printf("[XEN] IPI cpu=%d irq=%d vector=CALL_FUNCTION_VECTOR (%d)\n",
 	    cpu, irq, CALL_FUNCTION_VECTOR);
 
 	
 	if ((cpu != 0) && ((rc = ap_cpu_initclocks(cpu)) != 0))
 		goto fail;
 
 	return 0;
 
  fail:
 	if (per_cpu(resched_irq, cpu) >= 0)
 		unbind_from_irqhandler(per_cpu(resched_irq, cpu));
 	if (per_cpu(callfunc_irq, cpu) >= 0)
 		unbind_from_irqhandler(per_cpu(callfunc_irq, cpu));
 	return rc;
 }
 
 static void
 xen_smp_intr_init_cpus(void *unused)
 {
 	int i;
 	    
 	for (i = 0; i < mp_ncpus; i++)
 		xen_smp_intr_init(i);
 }
 
 #define MTOPSIZE (1<<(14 + PAGE_SHIFT))
 
 /*
  * AP CPU's call this to initialize themselves.
  */
 void
 init_secondary(void)
 {
+	cpuset_t tcpuset, tallcpus;
 	vm_offset_t addr;
 	int	gsel_tss;
 	
 	
 	/* bootAP is set in start_ap() to our ID. */
 	PCPU_SET(currentldt, _default_ldt);
 	gsel_tss = GSEL(GPROC0_SEL, SEL_KPL);
 #if 0
 	gdt[bootAP * NGDT + GPROC0_SEL].sd.sd_type = SDT_SYS386TSS;
 #endif
 	PCPU_SET(common_tss.tss_esp0, 0); /* not used until after switch */
 	PCPU_SET(common_tss.tss_ss0, GSEL(GDATA_SEL, SEL_KPL));
 	PCPU_SET(common_tss.tss_ioopt, (sizeof (struct i386tss)) << 16);
 #if 0
 	PCPU_SET(tss_gdt, &gdt[bootAP * NGDT + GPROC0_SEL].sd);
 
 	PCPU_SET(common_tssd, *PCPU_GET(tss_gdt));
 #endif
 	PCPU_SET(fsgs_gdt, &gdt[GUFS_SEL].sd);
 
 	/*
 	 * Set to a known state:
 	 * Set by mpboot.s: CR0_PG, CR0_PE
 	 * Set by cpu_setregs: CR0_NE, CR0_MP, CR0_TS, CR0_WP, CR0_AM
 	 */
 	/*
 	 * signal our startup to the BSP.
 	 */
 	mp_naps++;
 
 	/* Spin until the BSP releases the AP's. */
 	while (!aps_ready)
 		ia32_pause();
 
 	/* BSP may have changed PTD while we were waiting */
 	invltlb();
 	for (addr = 0; addr < NKPT * NBPDR - 1; addr += PAGE_SIZE)
 		invlpg(addr);
 
 	/* set up FPU state on the AP */
 	npxinit();
 #if 0
 	
 	/* set up SSE registers */
 	enable_sse();
 #endif
 #if 0 && defined(PAE)
 	/* Enable the PTE no-execute bit. */
 	if ((amd_feature & AMDID_NX) != 0) {
 		uint64_t msr;
 
 		msr = rdmsr(MSR_EFER) | EFER_NXE;
 		wrmsr(MSR_EFER, msr);
 	}
 #endif
 #if 0
 	/* A quick check from sanity claus */
 	if (PCPU_GET(apic_id) != lapic_id()) {
 		printf("SMP: cpuid = %d\n", PCPU_GET(cpuid));
 		printf("SMP: actual apic_id = %d\n", lapic_id());
 		printf("SMP: correct apic_id = %d\n", PCPU_GET(apic_id));
 		panic("cpuid mismatch! boom!!");
 	}
 #endif
 	
 	/* Initialize curthread. */
 	KASSERT(PCPU_GET(idlethread) != NULL, ("no idle thread"));
 	PCPU_SET(curthread, PCPU_GET(idlethread));
 
 	mtx_lock_spin(&ap_boot_mtx);
 #if 0
 	
 	/* Init local apic for irq's */
 	lapic_setup(1);
 #endif
 	smp_cpus++;
 
 	CTR1(KTR_SMP, "SMP: AP CPU #%d Launched", PCPU_GET(cpuid));
 	printf("SMP: AP CPU #%d Launched!\n", PCPU_GET(cpuid));
+	tcpuset = PCPU_GET(cpumask);
 
 	/* Determine if we are a logical CPU. */
 	if (logical_cpus > 1 && PCPU_GET(apic_id) % logical_cpus != 0)
-		logical_cpus_mask |= PCPU_GET(cpumask);
+		CPU_OR(&logical_cpus_mask, &tcpuset);
 	
 	/* Determine if we are a hyperthread. */
 	if (hyperthreading_cpus > 1 &&
 	    PCPU_GET(apic_id) % hyperthreading_cpus != 0)
-		hyperthreading_cpus_mask |= PCPU_GET(cpumask);
+		CPU_OR(&hyperthreading_cpus_mask, &tcpuset);
 
 	/* Build our map of 'other' CPUs. */
-	PCPU_SET(other_cpus, all_cpus & ~PCPU_GET(cpumask));
+	tallcpus = all_cpus;
+	CPU_NAND(&tallcpus, &tcpuset);
+	PCPU_SET(other_cpus, tallcpus);
 #if 0
 	if (bootverbose)
 		lapic_dump("AP");
 #endif
 	if (smp_cpus == mp_ncpus) {
 		/* enable IPI's, tlb shootdown, freezes etc */
 		atomic_store_rel_int(&smp_started, 1);
 		smp_active = 1;	 /* historic */
 	}
 
 	mtx_unlock_spin(&ap_boot_mtx);
 
 	/* wait until all the AP's are up */
 	while (smp_started == 0)
 		ia32_pause();
 
 	PCPU_SET(curthread, PCPU_GET(idlethread));
 
 	/* Start per-CPU event timers. */
 	cpu_initclocks_ap();
 
 	/* enter the scheduler */
 	sched_throw(NULL);
 
 	panic("scheduler returned us to %s", __func__);
 	/* NOTREACHED */
 }
 
 /*******************************************************************
  * local functions and data
  */
 
 /*
  * We tell the I/O APIC code about all the CPUs we want to receive
  * interrupts.  If we don't want certain CPUs to receive IRQs we
  * can simply not tell the I/O APIC code about them in this function.
  * We also do not tell it about the BSP since it tells itself about
  * the BSP internally to work with UP kernels and on UP machines.
  */
 static void
 set_interrupt_apic_ids(void)
 {
 	u_int i, apic_id;
 
 	for (i = 0; i < MAXCPU; i++) {
 		apic_id = cpu_apic_ids[i];
 		if (apic_id == -1)
 			continue;
 		if (cpu_info[apic_id].cpu_bsp)
 			continue;
 		if (cpu_info[apic_id].cpu_disabled)
 			continue;
 
 		/* Don't let hyperthreads service interrupts. */
 		if (hyperthreading_cpus > 1 &&
 		    apic_id % hyperthreading_cpus != 0)
 			continue;
 
 		intr_add_cpu(i);
 	}
 }
 
 /*
  * Assign logical CPU IDs to local APICs.
  */
 static void
 assign_cpu_ids(void)
 {
 	u_int i;
 
 	/* Check for explicitly disabled CPUs. */
 	for (i = 0; i <= MAX_APIC_ID; i++) {
 		if (!cpu_info[i].cpu_present || cpu_info[i].cpu_bsp)
 			continue;
 
 		/* Don't use this CPU if it has been disabled by a tunable. */
 		if (resource_disabled("lapic", i)) {
 			cpu_info[i].cpu_disabled = 1;
 			continue;
 		}
 	}
 
 	/*
 	 * Assign CPU IDs to local APIC IDs and disable any CPUs
 	 * beyond MAXCPU.  CPU 0 has already been assigned to the BSP,
 	 * so we only have to assign IDs for APs.
 	 */
 	mp_ncpus = 1;
 	for (i = 0; i <= MAX_APIC_ID; i++) {
 		if (!cpu_info[i].cpu_present || cpu_info[i].cpu_bsp ||
 		    cpu_info[i].cpu_disabled)
 			continue;
 
 		if (mp_ncpus < MAXCPU) {
 			cpu_apic_ids[mp_ncpus] = i;
 			apic_cpuids[i] = mp_ncpus;
 			mp_ncpus++;
 		} else
 			cpu_info[i].cpu_disabled = 1;
 	}
 	KASSERT(mp_maxid >= mp_ncpus - 1,
 	    ("%s: counters out of sync: max %d, count %d", __func__, mp_maxid,
 	    mp_ncpus));		
 }
 
 /*
  * start each AP in our list
  */
 /* Lowest 1MB is already mapped: don't touch*/
 #define TMPMAP_START 1
 int
 start_all_aps(void)
 {
+	cpuset_t tallcpus;
 	int x,apic_id, cpu;
 	struct pcpu *pc;
 	
 	mtx_init(&ap_boot_mtx, "ap boot", NULL, MTX_SPIN);
 
 	/* set up temporary P==V mapping for AP boot */
 	/* XXX this is a hack, we should boot the AP on its own stack/PTD */
 
 	/* start each AP */
 	for (cpu = 1; cpu < mp_ncpus; cpu++) {
 		apic_id = cpu_apic_ids[cpu];
 
 
 		bootAP = cpu;
 		bootAPgdt = gdt + (512*cpu);
 
 		/* Get per-cpu data */
 		pc = &__pcpu[bootAP];
 		pcpu_init(pc, bootAP, sizeof(struct pcpu));
 		dpcpu_init((void *)kmem_alloc(kernel_map, DPCPU_SIZE), bootAP);
 		pc->pc_apic_id = cpu_apic_ids[bootAP];
 		pc->pc_prvspace = pc;
 		pc->pc_curthread = 0;
 
 		gdt_segs[GPRIV_SEL].ssd_base = (int) pc;
 		gdt_segs[GPROC0_SEL].ssd_base = (int) &pc->pc_common_tss;
 		
 		PT_SET_MA(bootAPgdt, VTOM(bootAPgdt) | PG_V | PG_RW);
 		bzero(bootAPgdt, PAGE_SIZE);
 		for (x = 0; x < NGDT; x++)
 			ssdtosd(&gdt_segs[x], &bootAPgdt[x].sd);
 		PT_SET_MA(bootAPgdt, vtomach(bootAPgdt) | PG_V);
 #ifdef notyet
 		
                 if (HYPERVISOR_vcpu_op(VCPUOP_get_physid, cpu, &cpu_id) == 0) { 
                         apicid = xen_vcpu_physid_to_x86_apicid(cpu_id.phys_id); 
                         acpiid = xen_vcpu_physid_to_x86_acpiid(cpu_id.phys_id); 
 #ifdef CONFIG_ACPI 
                         if (acpiid != 0xff) 
                                 x86_acpiid_to_apicid[acpiid] = apicid; 
 #endif 
                 } 
 #endif
 		
 		/* attempt to start the Application Processor */
 		if (!start_ap(cpu)) {
 			printf("AP #%d (PHY# %d) failed!\n", cpu, apic_id);
 			/* better panic as the AP may be running loose */
 			printf("panic y/n? [y] ");
 			if (cngetc() != 'n')
 				panic("bye-bye");
 		}
 
-		all_cpus |= (1 << cpu);		/* record AP in CPU map */
+		CPU_SET(cpu, &all_cpus);	/* record AP in CPU map */
 	}
 	
 
 	/* build our map of 'other' CPUs */
-	PCPU_SET(other_cpus, all_cpus & ~PCPU_GET(cpumask));
+	tallcpus = all_cpus;
+	CPU_NAND(&tallcpus, PCPU_PTR(cpumask));
+	PCPU_SET(other_cpus, tallcpus);
 
 	pmap_invalidate_range(kernel_pmap, 0, NKPT * NBPDR - 1);
 	
 	/* number of APs actually started */
 	return mp_naps;
 }
 
 extern uint8_t *pcpu_boot_stack;
 extern trap_info_t trap_table[];
 
 static void
 smp_trap_init(trap_info_t *trap_ctxt)
 {
         const trap_info_t *t = trap_table;
 
         for (t = trap_table; t->address; t++) {
                 trap_ctxt[t->vector].flags = t->flags;
                 trap_ctxt[t->vector].cs = t->cs;
                 trap_ctxt[t->vector].address = t->address;
         }
 }
 
 extern int nkpt;
 static void
 cpu_initialize_context(unsigned int cpu)
 {
 	/* vcpu_guest_context_t is too large to allocate on the stack.
 	 * Hence we allocate statically and protect it with a lock */
 	vm_page_t m[4];
 	static vcpu_guest_context_t ctxt;
 	vm_offset_t boot_stack;
 	vm_offset_t newPTD;
 	vm_paddr_t ma[NPGPTD];
 	static int color;
 	int i;
 
 	/*
 	 * Page 0,[0-3]	PTD
 	 * Page 1, [4]	boot stack
 	 * Page [5]	PDPT
 	 *
 	 */
 	for (i = 0; i < NPGPTD + 2; i++) {
 		m[i] = vm_page_alloc(NULL, color++,
 		    VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED |
 		    VM_ALLOC_ZERO);
 
 		pmap_zero_page(m[i]);
 
 	}
 	boot_stack = kmem_alloc_nofault(kernel_map, 1);
 	newPTD = kmem_alloc_nofault(kernel_map, NPGPTD);
 	ma[0] = VM_PAGE_TO_MACH(m[0])|PG_V;
 
 #ifdef PAE	
 	pmap_kenter(boot_stack, VM_PAGE_TO_PHYS(m[NPGPTD + 1]));
 	for (i = 0; i < NPGPTD; i++) {
 		((vm_paddr_t *)boot_stack)[i] =
 		ma[i] = VM_PAGE_TO_MACH(m[i])|PG_V;
 	}
 #endif	
 
 	/*
 	 * Copy cpu0 IdlePTD to new IdlePTD - copying only
 	 * kernel mappings
 	 */
 	pmap_qenter(newPTD, m, 4);
 	
 	memcpy((uint8_t *)newPTD + KPTDI*sizeof(vm_paddr_t),
 	    (uint8_t *)PTOV(IdlePTD) + KPTDI*sizeof(vm_paddr_t),
 	    nkpt*sizeof(vm_paddr_t));
 
 	pmap_qremove(newPTD, 4);
 	kmem_free(kernel_map, newPTD, 4);
 	/*
 	 * map actual idle stack to boot_stack
 	 */
 	pmap_kenter(boot_stack, VM_PAGE_TO_PHYS(m[NPGPTD]));
 
 
 	xen_pgdpt_pin(VM_PAGE_TO_MACH(m[NPGPTD + 1]));
 	vm_page_lock_queues();
 	for (i = 0; i < 4; i++) {
 		int pdir = (PTDPTDI + i) / NPDEPG;
 		int curoffset = (PTDPTDI + i) % NPDEPG;
 		
 		xen_queue_pt_update((vm_paddr_t)
 		    ((ma[pdir] & ~PG_V) + (curoffset*sizeof(vm_paddr_t))), 
 		    ma[i]);
 	}
 	PT_UPDATES_FLUSH();
 	vm_page_unlock_queues();
 	
 	memset(&ctxt, 0, sizeof(ctxt));
 	ctxt.flags = VGCF_IN_KERNEL;
 	ctxt.user_regs.ds = GSEL(GDATA_SEL, SEL_KPL);
 	ctxt.user_regs.es = GSEL(GDATA_SEL, SEL_KPL);
 	ctxt.user_regs.fs = GSEL(GPRIV_SEL, SEL_KPL);
 	ctxt.user_regs.gs = GSEL(GDATA_SEL, SEL_KPL);
 	ctxt.user_regs.cs = GSEL(GCODE_SEL, SEL_KPL);
 	ctxt.user_regs.ss = GSEL(GDATA_SEL, SEL_KPL);
 	ctxt.user_regs.eip = (unsigned long)init_secondary;
 	ctxt.user_regs.eflags = PSL_KERNEL | 0x1000; /* IOPL_RING1 */
 
 	memset(&ctxt.fpu_ctxt, 0, sizeof(ctxt.fpu_ctxt));
 
 	smp_trap_init(ctxt.trap_ctxt);
 
 	ctxt.ldt_ents = 0;
 	ctxt.gdt_frames[0] = (uint32_t)((uint64_t)vtomach(bootAPgdt) >> PAGE_SHIFT);
 	ctxt.gdt_ents      = 512;
 
 #ifdef __i386__
 	ctxt.user_regs.esp = boot_stack + PAGE_SIZE;
 
 	ctxt.kernel_ss = GSEL(GDATA_SEL, SEL_KPL);
 	ctxt.kernel_sp = boot_stack + PAGE_SIZE;
 
 	ctxt.event_callback_cs     = GSEL(GCODE_SEL, SEL_KPL);
 	ctxt.event_callback_eip    = (unsigned long)Xhypervisor_callback;
 	ctxt.failsafe_callback_cs  = GSEL(GCODE_SEL, SEL_KPL);
 	ctxt.failsafe_callback_eip = (unsigned long)failsafe_callback;
 
 	ctxt.ctrlreg[3] = VM_PAGE_TO_MACH(m[NPGPTD + 1]);
 #else /* __x86_64__ */
 	ctxt.user_regs.esp = idle->thread.rsp0 - sizeof(struct pt_regs);
 	ctxt.kernel_ss = GSEL(GDATA_SEL, SEL_KPL);
 	ctxt.kernel_sp = idle->thread.rsp0;
 
 	ctxt.event_callback_eip    = (unsigned long)hypervisor_callback;
 	ctxt.failsafe_callback_eip = (unsigned long)failsafe_callback;
 	ctxt.syscall_callback_eip  = (unsigned long)system_call;
 
 	ctxt.ctrlreg[3] = xen_pfn_to_cr3(virt_to_mfn(init_level4_pgt));
 
 	ctxt.gs_base_kernel = (unsigned long)(cpu_pda(cpu));
 #endif
 
 	printf("gdtpfn=%lx pdptpfn=%lx\n",
 	    ctxt.gdt_frames[0],
 	    ctxt.ctrlreg[3] >> PAGE_SHIFT);
 
 	PANIC_IF(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt));
 	DELAY(3000);
 	PANIC_IF(HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL));
 }
 
 /*
  * This function starts the AP (application processor) identified
  * by the APIC ID 'physicalCpu'.  It does quite a "song and dance"
  * to accomplish this.  This is necessary because of the nuances
  * of the different hardware we might encounter.  It isn't pretty,
  * but it seems to work.
  */
 
 int cpus;
 static int
 start_ap(int apic_id)
 {
 	int ms;
 
 	/* used as a watchpoint to signal AP startup */
 	cpus = mp_naps;
 
 	cpu_initialize_context(apic_id);
 	
 	/* Wait up to 5 seconds for it to start. */
 	for (ms = 0; ms < 5000; ms++) {
 		if (mp_naps > cpus)
 			return 1;	/* return SUCCESS */
 		DELAY(1000);
 	}
 	return 0;		/* return FAILURE */
 }
 
 /*
  * send an IPI to a specific CPU.
  */
 static void
 ipi_send_cpu(int cpu, u_int ipi)
 {
 	u_int bitmap, old_pending, new_pending;
 
 	if (IPI_IS_BITMAPED(ipi)) { 
 		bitmap = 1 << ipi;
 		ipi = IPI_BITMAP_VECTOR;
 		do {
 			old_pending = cpu_ipi_pending[cpu];
 			new_pending = old_pending | bitmap;
 		} while  (!atomic_cmpset_int(&cpu_ipi_pending[cpu],
 		    old_pending, new_pending));	
 		if (!old_pending)
 			ipi_pcpu(cpu, RESCHEDULE_VECTOR);
 	} else {
 		KASSERT(call_data != NULL, ("call_data not set"));
 		ipi_pcpu(cpu, CALL_FUNCTION_VECTOR);
 	}
 }
 
 /*
  * Flush the TLB on all other CPU's
  */
 static void
 smp_tlb_shootdown(u_int vector, vm_offset_t addr1, vm_offset_t addr2)
 {
 	u_int ncpu;
 	struct _call_data data;
 
 	ncpu = mp_ncpus - 1;	/* does not shootdown self */
 	if (ncpu < 1)
 		return;		/* no other cpus */
 	if (!(read_eflags() & PSL_I))
 		panic("%s: interrupts disabled", __func__);
 	mtx_lock_spin(&smp_ipi_mtx);
 	KASSERT(call_data == NULL, ("call_data isn't null?!"));
 	call_data = &data;
 	call_data->func_id = vector;
 	call_data->arg1 = addr1;
 	call_data->arg2 = addr2;
 	atomic_store_rel_int(&smp_tlb_wait, 0);
 	ipi_all_but_self(vector);
 	while (smp_tlb_wait < ncpu)
 		ia32_pause();
 	call_data = NULL;
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 
 static void
-smp_targeted_tlb_shootdown(cpumask_t mask, u_int vector, vm_offset_t addr1, vm_offset_t addr2)
+smp_targeted_tlb_shootdown(cpuset_t mask, u_int vector, vm_offset_t addr1, vm_offset_t addr2)
 {
-	int ncpu, othercpus;
+	int cpu, ncpu, othercpus;
 	struct _call_data data;
 
 	othercpus = mp_ncpus - 1;
-	if (mask == (u_int)-1) {
-		ncpu = othercpus;
-		if (ncpu < 1)
+	if (CPU_ISFULLSET(&mask)) {
+		if (othercpus < 1)
 			return;
 	} else {
-		mask &= ~PCPU_GET(cpumask);
-		if (mask == 0)
+		critical_enter();
+		CPU_NAND(&mask, PCPU_PTR(cpumask));
+		critical_exit();
+		if (CPU_EMPTY(&mask))
 			return;
-		ncpu = bitcount32(mask);
-		if (ncpu > othercpus) {
-			/* XXX this should be a panic offence */
-			printf("SMP: tlb shootdown to %d other cpus (only have %d)\n",
-			    ncpu, othercpus);
-			ncpu = othercpus;
-		}
-		/* XXX should be a panic, implied by mask == 0 above */
-		if (ncpu < 1)
-			return;
 	}
 	if (!(read_eflags() & PSL_I))
 		panic("%s: interrupts disabled", __func__);
 	mtx_lock_spin(&smp_ipi_mtx);
 	KASSERT(call_data == NULL, ("call_data isn't null?!"));
 	call_data = &data;		
 	call_data->func_id = vector;
 	call_data->arg1 = addr1;
 	call_data->arg2 = addr2;
 	atomic_store_rel_int(&smp_tlb_wait, 0);
-	if (mask == (u_int)-1)
+	if (CPU_ISFULLSET(&mask)) {
+		ncpu = othercpus;
 		ipi_all_but_self(vector);
-	else
-		ipi_selected(mask, vector);
+	} else {
+		ncpu = 0;
+		while ((cpu = cpusetobj_ffs(&mask)) != 0) {
+			cpu--;
+			CPU_CLR(cpu, &mask);
+			CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu,
+			    vector);
+			ipi_send_cpu(cpu, vector);
+			ncpu++;
+		}
+	}
 	while (smp_tlb_wait < ncpu)
 		ia32_pause();
 	call_data = NULL;
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 
 void
 smp_cache_flush(void)
 {
 
 	if (smp_started)
 		smp_tlb_shootdown(IPI_INVLCACHE, 0, 0);
 }
 
 void
 smp_invltlb(void)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLTLB, 0, 0);
 	}
 }
 
 void
 smp_invlpg(vm_offset_t addr)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLPG, addr, 0);
 	}
 }
 
 void
 smp_invlpg_range(vm_offset_t addr1, vm_offset_t addr2)
 {
 
 	if (smp_started) {
 		smp_tlb_shootdown(IPI_INVLRNG, addr1, addr2);
 	}
 }
 
 void
-smp_masked_invltlb(cpumask_t mask)
+smp_masked_invltlb(cpuset_t mask)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLTLB, 0, 0);
 	}
 }
 
 void
-smp_masked_invlpg(cpumask_t mask, vm_offset_t addr)
+smp_masked_invlpg(cpuset_t mask, vm_offset_t addr)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLPG, addr, 0);
 	}
 }
 
 void
-smp_masked_invlpg_range(cpumask_t mask, vm_offset_t addr1, vm_offset_t addr2)
+smp_masked_invlpg_range(cpuset_t mask, vm_offset_t addr1, vm_offset_t addr2)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLRNG, addr1, addr2);
 	}
 }
 
 /*
  * send an IPI to a set of cpus.
  */
 void
-ipi_selected(cpumask_t cpus, u_int ipi)
+ipi_selected(cpuset_t cpus, u_int ipi)
 {
 	int cpu;
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, cpus);
+		CPU_OR_ATOMIC(&ipi_nmi_pending, &cpus);
 
-	while ((cpu = ffs(cpus)) != 0) {
+	while ((cpu = cpusetobj_ffs(&cpus)) != 0) {
 		cpu--;
-		cpus &= ~(1 << cpu);
+		CPU_CLR(cpu, &cpus);
 		CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, ipi);
 		ipi_send_cpu(cpu, ipi);
 	}
 }
 
 /*
  * send an IPI to a specific CPU.
  */
 void
 ipi_cpu(int cpu, u_int ipi)
 {
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, 1 << cpu);
+		CPU_SET_ATOMIC(cpu, &ipi_nmi_pending);
 
 	CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, ipi);
 	ipi_send_cpu(cpu, ipi);
 }
 
 /*
  * send an IPI to all CPUs EXCEPT myself
  */
 void
 ipi_all_but_self(u_int ipi)
 {
+	cpuset_t other_cpus;
 
 	/*
 	 * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit
 	 * of help in order to understand what is the source.
 	 * Set the mask of receiving CPUs for this purpose.
 	 */
+	sched_pin();
+	other_cpus = PCPU_GET(other_cpus);
+	sched_unpin();
 	if (ipi == IPI_STOP_HARD)
-		atomic_set_int(&ipi_nmi_pending, PCPU_GET(other_cpus));
+		CPU_OR_ATOMIC(&ipi_nmi_pending, &other_cpus);
 
 	CTR2(KTR_SMP, "%s: ipi: %x", __func__, ipi);
-	ipi_selected(PCPU_GET(other_cpus), ipi);
+	ipi_selected(other_cpus, ipi);
 }
 
 int
 ipi_nmi_handler()
 {
-	cpumask_t cpumask;
+	cpuset_t cpumask;
 
 	/*
 	 * As long as there is not a simple way to know about a NMI's
 	 * source, if the bitmask for the current CPU is present in
 	 * the global pending bitword an IPI_STOP_HARD has been issued
 	 * and should be handled.
 	 */
+	sched_pin();
 	cpumask = PCPU_GET(cpumask);
-	if ((ipi_nmi_pending & cpumask) == 0)
+	sched_unpin();
+	if (!CPU_OVERLAP(&ipi_nmi_pending, &cpumask))
 		return (1);
 
-	atomic_clear_int(&ipi_nmi_pending, cpumask);
+	CPU_NAND_ATOMIC(&ipi_nmi_pending, &cpumask);
 	cpustop_handler();
 	return (0);
 }
 
 /*
  * Handle an IPI_STOP by saving our current context and spinning until we
  * are resumed.
  */
 void
 cpustop_handler(void)
 {
-	int cpu = PCPU_GET(cpuid);
-	int cpumask = PCPU_GET(cpumask);
+	cpuset_t cpumask;
+	int cpu;
 
+	sched_pin();
+	cpumask = PCPU_GET(cpumask);
+	cpu = PCPU_GET(cpuid);
+	sched_unpin();
+
 	savectx(&stoppcbs[cpu]);
 
 	/* Indicate that we are stopped */
-	atomic_set_int(&stopped_cpus, cpumask);
+	CPU_OR_ATOMIC(&stopped_cpus, &cpumask);
 
 	/* Wait for restart */
-	while (!(started_cpus & cpumask))
+	while (!CPU_OVERLAP(&started_cpus, &cpumask))
 	    ia32_pause();
 
-	atomic_clear_int(&started_cpus, cpumask);
-	atomic_clear_int(&stopped_cpus, cpumask);
+	CPU_NAND_ATOMIC(&started_cpus, &cpumask);
+	CPU_NAND_ATOMIC(&stopped_cpus, &cpumask);
 
 	if (cpu == 0 && cpustop_restartfunc != NULL) {
 		cpustop_restartfunc();
 		cpustop_restartfunc = NULL;
 	}
 }
 
 /*
  * This is called once the rest of the system is up and running and we're
  * ready to let the AP's out of the pen.
  */
 static void
 release_aps(void *dummy __unused)
 {
 
 	if (mp_ncpus == 1) 
 		return;
 	atomic_store_rel_int(&aps_ready, 1);
 	while (smp_started == 0)
 		ia32_pause();
 }
 SYSINIT(start_aps, SI_SUB_SMP, SI_ORDER_FIRST, release_aps, NULL);
 SYSINIT(start_ipis, SI_SUB_INTR, SI_ORDER_ANY, xen_smp_intr_init_cpus, NULL);
 
Index: head/sys/i386/xen/pmap.c
===================================================================
--- head/sys/i386/xen/pmap.c	(revision 222812)
+++ head/sys/i386/xen/pmap.c	(revision 222813)
@@ -1,4345 +1,4354 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 1994 John S. Dyson
  * All rights reserved.
  * Copyright (c) 1994 David Greenman
  * All rights reserved.
  * Copyright (c) 2005 Alan L. Cox <alc@cs.rice.edu>
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from:	@(#)pmap.c	7.7 (Berkeley)	5/12/91
  */
 /*-
  * Copyright (c) 2003 Networks Associates Technology, Inc.
  * All rights reserved.
  *
  * This software was developed for the FreeBSD Project by Jake Burkholder,
  * Safeport Network Services, and Network Associates Laboratories, the
  * Security Research Division of Network Associates, Inc. under
  * DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA
  * CHATS research program.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  *	Manages physical address maps.
  *
  *	In addition to hardware address maps, this
  *	module is called upon to provide software-use-only
  *	maps which may or may not be stored in the same
  *	form as hardware maps.  These pseudo-maps are
  *	used to store intermediate results from copy
  *	operations to and from address spaces.
  *
  *	Since the information managed by this module is
  *	also stored by the logical address mapping module,
  *	this module may throw away valid virtual-to-physical
  *	mappings at almost any time.  However, invalidations
  *	of virtual-to-physical mappings must be done as
  *	requested.
  *
  *	In order to cope with hardware architectures which
  *	make virtual-to-physical map invalidates expensive,
  *	this module may delay invalidate or reduced protection
  *	operations until such time as they are actually
  *	necessary.  This module is given full information as
  *	to which processors are currently using which maps,
  *	and to when physical maps must be made correct.
  */
 
 #include "opt_cpu.h"
 #include "opt_pmap.h"
 #include "opt_smp.h"
 #include "opt_xbox.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mman.h>
 #include <sys/msgbuf.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/sf_buf.h>
 #include <sys/sx.h>
 #include <sys/vmmeter.h>
 #include <sys/sched.h>
 #include <sys/sysctl.h>
 #ifdef SMP
 #include <sys/smp.h>
 #endif
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 #include <vm/uma.h>
 
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
 #include <machine/specialreg.h>
 #ifdef SMP
 #include <machine/smp.h>
 #endif
 
 #ifdef XBOX
 #include <machine/xbox.h>
 #endif
 
 #include <xen/interface/xen.h>
 #include <xen/hypervisor.h>
 #include <machine/xen/hypercall.h>
 #include <machine/xen/xenvar.h>
 #include <machine/xen/xenfunc.h>
 
 #if !defined(CPU_DISABLE_SSE) && defined(I686_CPU)
 #define CPU_ENABLE_SSE
 #endif
 
 #ifndef PMAP_SHPGPERPROC
 #define PMAP_SHPGPERPROC 200
 #endif
 
 #define DIAGNOSTIC
 
 #if !defined(DIAGNOSTIC)
 #ifdef __GNUC_GNU_INLINE__
 #define PMAP_INLINE	__attribute__((__gnu_inline__)) inline
 #else
 #define PMAP_INLINE	extern inline
 #endif
 #else
 #define PMAP_INLINE
 #endif
 
 #define PV_STATS
 #ifdef PV_STATS
 #define PV_STAT(x)	do { x ; } while (0)
 #else
 #define PV_STAT(x)	do { } while (0)
 #endif
 
 #define	pa_index(pa)	((pa) >> PDRSHIFT)
 #define	pa_to_pvh(pa)	(&pv_table[pa_index(pa)])
 
 /*
  * Get PDEs and PTEs for user/kernel address space
  */
 #define	pmap_pde(m, v)	(&((m)->pm_pdir[(vm_offset_t)(v) >> PDRSHIFT]))
 #define pdir_pde(m, v) (m[(vm_offset_t)(v) >> PDRSHIFT])
 
 #define pmap_pde_v(pte)		((*(int *)pte & PG_V) != 0)
 #define pmap_pte_w(pte)		((*(int *)pte & PG_W) != 0)
 #define pmap_pte_m(pte)		((*(int *)pte & PG_M) != 0)
 #define pmap_pte_u(pte)		((*(int *)pte & PG_A) != 0)
 #define pmap_pte_v(pte)		((*(int *)pte & PG_V) != 0)
 
 #define pmap_pte_set_prot(pte, v) ((*(int *)pte &= ~PG_PROT), (*(int *)pte |= (v)))
 
 #define HAMFISTED_LOCKING
 #ifdef HAMFISTED_LOCKING
 static struct mtx createdelete_lock;
 #endif
 
 struct pmap kernel_pmap_store;
 LIST_HEAD(pmaplist, pmap);
 static struct pmaplist allpmaps;
 static struct mtx allpmaps_lock;
 
 vm_offset_t virtual_avail;	/* VA of first avail page (after kernel bss) */
 vm_offset_t virtual_end;	/* VA of last avail page (end of kernel AS) */
 int pgeflag = 0;		/* PG_G or-in */
 int pseflag = 0;		/* PG_PS or-in */
 
 int nkpt;
 vm_offset_t kernel_vm_end;
 extern u_int32_t KERNend;
 
 #ifdef PAE
 pt_entry_t pg_nx;
 #endif
 
 static int pat_works;			/* Is page attribute table sane? */
 
 /*
  * Data for the pv entry allocation mechanism
  */
 static int pv_entry_count = 0, pv_entry_max = 0, pv_entry_high_water = 0;
 static struct md_page *pv_table;
 static int shpgperproc = PMAP_SHPGPERPROC;
 
 struct pv_chunk *pv_chunkbase;		/* KVA block for pv_chunks */
 int pv_maxchunks;			/* How many chunks we have KVA for */
 vm_offset_t pv_vafree;			/* freelist stored in the PTE */
 
 /*
  * All those kernel PT submaps that BSD is so fond of
  */
 struct sysmaps {
 	struct	mtx lock;
 	pt_entry_t *CMAP1;
 	pt_entry_t *CMAP2;
 	caddr_t	CADDR1;
 	caddr_t	CADDR2;
 };
 static struct sysmaps sysmaps_pcpu[MAXCPU];
 static pt_entry_t *CMAP3;
 caddr_t ptvmmap = 0;
 static caddr_t CADDR3;
 struct msgbuf *msgbufp = 0;
 
 /*
  * Crashdump maps.
  */
 static caddr_t crashdumpmap;
 
 static pt_entry_t *PMAP1 = 0, *PMAP2;
 static pt_entry_t *PADDR1 = 0, *PADDR2;
 #ifdef SMP
 static int PMAP1cpu;
 static int PMAP1changedcpu;
 SYSCTL_INT(_debug, OID_AUTO, PMAP1changedcpu, CTLFLAG_RD, 
 	   &PMAP1changedcpu, 0,
 	   "Number of times pmap_pte_quick changed CPU with same PMAP1");
 #endif
 static int PMAP1changed;
 SYSCTL_INT(_debug, OID_AUTO, PMAP1changed, CTLFLAG_RD, 
 	   &PMAP1changed, 0,
 	   "Number of times pmap_pte_quick changed PMAP1");
 static int PMAP1unchanged;
 SYSCTL_INT(_debug, OID_AUTO, PMAP1unchanged, CTLFLAG_RD, 
 	   &PMAP1unchanged, 0,
 	   "Number of times pmap_pte_quick didn't change PMAP1");
 static struct mtx PMAP2mutex;
 
 SYSCTL_NODE(_vm, OID_AUTO, pmap, CTLFLAG_RD, 0, "VM/pmap parameters");
 static int pg_ps_enabled;
 SYSCTL_INT(_vm_pmap, OID_AUTO, pg_ps_enabled, CTLFLAG_RDTUN, &pg_ps_enabled, 0,
     "Are large page mappings enabled?");
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_max, CTLFLAG_RD, &pv_entry_max, 0,
 	"Max number of PV entries");
 SYSCTL_INT(_vm_pmap, OID_AUTO, shpgperproc, CTLFLAG_RD, &shpgperproc, 0,
 	"Page share factor per proc");
 SYSCTL_NODE(_vm_pmap, OID_AUTO, pde, CTLFLAG_RD, 0,
     "2/4MB page mapping counters");
 
 static u_long pmap_pde_mappings;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, mappings, CTLFLAG_RD,
     &pmap_pde_mappings, 0, "2/4MB page mappings");
 
 static void	free_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t locked_pmap, int try);
 static void	pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va);
 static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap,
 		    vm_offset_t va);
 
 static vm_page_t pmap_enter_quick_locked(multicall_entry_t **mcl, int *count, pmap_t pmap, vm_offset_t va,
     vm_page_t m, vm_prot_t prot, vm_page_t mpte);
 static int pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t sva,
     vm_page_t *free);
 static void pmap_remove_page(struct pmap *pmap, vm_offset_t va,
     vm_page_t *free);
 static void pmap_remove_entry(struct pmap *pmap, vm_page_t m,
 					vm_offset_t va);
 static boolean_t pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va,
     vm_page_t m);
 
 static vm_page_t pmap_allocpte(pmap_t pmap, vm_offset_t va, int flags);
 
 static vm_page_t _pmap_allocpte(pmap_t pmap, unsigned ptepindex, int flags);
 static int _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, vm_page_t *free);
 static pt_entry_t *pmap_pte_quick(pmap_t pmap, vm_offset_t va);
 static void pmap_pte_release(pt_entry_t *pte);
 static int pmap_unuse_pt(pmap_t, vm_offset_t, vm_page_t *);
 static vm_offset_t pmap_kmem_choose(vm_offset_t addr);
 static boolean_t pmap_is_prefaultable_locked(pmap_t pmap, vm_offset_t addr);
 static void pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode);
 
 static __inline void pagezero(void *page);
 
 CTASSERT(1 << PDESHIFT == sizeof(pd_entry_t));
 CTASSERT(1 << PTESHIFT == sizeof(pt_entry_t));
 
 /*
  * If you get an error here, then you set KVA_PAGES wrong! See the
  * description of KVA_PAGES in sys/i386/include/pmap.h. It must be
  * multiple of 4 for a normal kernel, or a multiple of 8 for a PAE.
  */
 CTASSERT(KERNBASE % (1 << 24) == 0);
 
 
 
 void 
 pd_set(struct pmap *pmap, int ptepindex, vm_paddr_t val, int type)
 {
 	vm_paddr_t pdir_ma = vtomach(&pmap->pm_pdir[ptepindex]);
 	
 	switch (type) {
 	case SH_PD_SET_VA:
 #if 0		
 		xen_queue_pt_update(shadow_pdir_ma,
 				    xpmap_ptom(val & ~(PG_RW)));
 #endif		
 		xen_queue_pt_update(pdir_ma,
 				    xpmap_ptom(val)); 	
 		break;
 	case SH_PD_SET_VA_MA:
 #if 0		
 		xen_queue_pt_update(shadow_pdir_ma,
 				    val & ~(PG_RW));
 #endif		
 		xen_queue_pt_update(pdir_ma, val); 	
 		break;
 	case SH_PD_SET_VA_CLEAR:
 #if 0
 		xen_queue_pt_update(shadow_pdir_ma, 0);
 #endif		
 		xen_queue_pt_update(pdir_ma, 0); 	
 		break;
 	}
 }
 
 /*
  * Move the kernel virtual free pointer to the next
  * 4MB.  This is used to help improve performance
  * by using a large (4MB) page for much of the kernel
  * (.text, .data, .bss)
  */
 static vm_offset_t
 pmap_kmem_choose(vm_offset_t addr)
 {
 	vm_offset_t newaddr = addr;
 
 #ifndef DISABLE_PSE
 	if (cpu_feature & CPUID_PSE)
 		newaddr = (addr + PDRMASK) & ~PDRMASK;
 #endif
 	return newaddr;
 }
 
 /*
  *	Bootstrap the system enough to run with virtual memory.
  *
  *	On the i386 this is called after mapping has already been enabled
  *	and just syncs the pmap module with what has already been done.
  *	[We can't call it easily with mapping off since the kernel is not
  *	mapped with PA == VA, hence we would have to relocate every address
  *	from the linked base (virtual) address "KERNBASE" to the actual
  *	(physical) address starting relative to 0]
  */
 void
 pmap_bootstrap(vm_paddr_t firstaddr)
 {
 	vm_offset_t va;
 	pt_entry_t *pte, *unused;
 	struct sysmaps *sysmaps;
 	int i;
 
 	/*
 	 * XXX The calculation of virtual_avail is wrong. It's NKPT*PAGE_SIZE too
 	 * large. It should instead be correctly calculated in locore.s and
 	 * not based on 'first' (which is a physical address, not a virtual
 	 * address, for the start of unused physical memory). The kernel
 	 * page tables are NOT double mapped and thus should not be included
 	 * in this calculation.
 	 */
 	virtual_avail = (vm_offset_t) KERNBASE + firstaddr;
 	virtual_avail = pmap_kmem_choose(virtual_avail);
 
 	virtual_end = VM_MAX_KERNEL_ADDRESS;
 
 	/*
 	 * Initialize the kernel pmap (which is statically allocated).
 	 */
 	PMAP_LOCK_INIT(kernel_pmap);
 	kernel_pmap->pm_pdir = (pd_entry_t *) (KERNBASE + (u_int)IdlePTD);
 #ifdef PAE
 	kernel_pmap->pm_pdpt = (pdpt_entry_t *) (KERNBASE + (u_int)IdlePDPT);
 #endif
-	kernel_pmap->pm_active = -1;	/* don't allow deactivation */
+	CPU_FILL(&kernel_pmap->pm_active);	/* don't allow deactivation */
 	TAILQ_INIT(&kernel_pmap->pm_pvchunk);
 	LIST_INIT(&allpmaps);
 	mtx_init(&allpmaps_lock, "allpmaps", NULL, MTX_SPIN);
 	mtx_lock_spin(&allpmaps_lock);
 	LIST_INSERT_HEAD(&allpmaps, kernel_pmap, pm_list);
 	mtx_unlock_spin(&allpmaps_lock);
 	if (nkpt == 0)
 		nkpt = NKPT;
 
 	/*
 	 * Reserve some special page table entries/VA space for temporary
 	 * mapping of pages.
 	 */
 #define	SYSMAP(c, p, v, n)	\
 	v = (c)va; va += ((n)*PAGE_SIZE); p = pte; pte += (n);
 
 	va = virtual_avail;
 	pte = vtopte(va);
 
 	/*
 	 * CMAP1/CMAP2 are used for zeroing and copying pages.
 	 * CMAP3 is used for the idle process page zeroing.
 	 */
 	for (i = 0; i < MAXCPU; i++) {
 		sysmaps = &sysmaps_pcpu[i];
 		mtx_init(&sysmaps->lock, "SYSMAPS", NULL, MTX_DEF);
 		SYSMAP(caddr_t, sysmaps->CMAP1, sysmaps->CADDR1, 1)
 		SYSMAP(caddr_t, sysmaps->CMAP2, sysmaps->CADDR2, 1)
 		PT_SET_MA(sysmaps->CADDR1, 0);
 		PT_SET_MA(sysmaps->CADDR2, 0);
 	}
 	SYSMAP(caddr_t, CMAP3, CADDR3, 1)
 	PT_SET_MA(CADDR3, 0);
 
 	/*
 	 * Crashdump maps.
 	 */
 	SYSMAP(caddr_t, unused, crashdumpmap, MAXDUMPPGS)
 
 	/*
 	 * ptvmmap is used for reading arbitrary physical pages via /dev/mem.
 	 */
 	SYSMAP(caddr_t, unused, ptvmmap, 1)
 
 	/*
 	 * msgbufp is used to map the system message buffer.
 	 */
 	SYSMAP(struct msgbuf *, unused, msgbufp, atop(round_page(msgbufsize)))
 
 	/*
 	 * ptemap is used for pmap_pte_quick
 	 */
 	SYSMAP(pt_entry_t *, PMAP1, PADDR1, 1);
 	SYSMAP(pt_entry_t *, PMAP2, PADDR2, 1);
 
 	mtx_init(&PMAP2mutex, "PMAP2", NULL, MTX_DEF);
 
 	virtual_avail = va;
 
 	/*
 	 * Leave in place an identity mapping (virt == phys) for the low 1 MB
 	 * physical memory region that is used by the ACPI wakeup code.  This
 	 * mapping must not have PG_G set. 
 	 */
 #ifndef XEN
 	/*
 	 * leave here deliberately to show that this is not supported
 	 */
 #ifdef XBOX
 	/* FIXME: This is gross, but needed for the XBOX. Since we are in such
 	 * an early stadium, we cannot yet neatly map video memory ... :-(
 	 * Better fixes are very welcome! */
 	if (!arch_i386_is_xbox)
 #endif
 	for (i = 1; i < NKPT; i++)
 		PTD[i] = 0;
 
 	/* Initialize the PAT MSR if present. */
 	pmap_init_pat();
 
 	/* Turn on PG_G on kernel page(s) */
 	pmap_set_pg();
 #endif
 
 #ifdef HAMFISTED_LOCKING
 	mtx_init(&createdelete_lock, "pmap create/delete", NULL, MTX_DEF);
 #endif
 }
 
 /*
  * Setup the PAT MSR.
  */
 void
 pmap_init_pat(void)
 {
 	uint64_t pat_msr;
 
 	/* Bail if this CPU doesn't implement PAT. */
 	if (!(cpu_feature & CPUID_PAT))
 		return;
 
 	if (cpu_vendor_id != CPU_VENDOR_INTEL ||
 	    (CPUID_TO_FAMILY(cpu_id) == 6 && CPUID_TO_MODEL(cpu_id) >= 0xe)) {
 		/*
 		 * Leave the indices 0-3 at the default of WB, WT, UC, and UC-.
 		 * Program 4 and 5 as WP and WC.
 		 * Leave 6 and 7 as UC and UC-.
 		 */
 		pat_msr = rdmsr(MSR_PAT);
 		pat_msr &= ~(PAT_MASK(4) | PAT_MASK(5));
 		pat_msr |= PAT_VALUE(4, PAT_WRITE_PROTECTED) |
 		    PAT_VALUE(5, PAT_WRITE_COMBINING);
 		pat_works = 1;
 	} else {
 		/*
 		 * Due to some Intel errata, we can only safely use the lower 4
 		 * PAT entries.  Thus, just replace PAT Index 2 with WC instead
 		 * of UC-.
 		 *
 		 *   Intel Pentium III Processor Specification Update
 		 * Errata E.27 (Upper Four PAT Entries Not Usable With Mode B
 		 * or Mode C Paging)
 		 *
 		 *   Intel Pentium IV  Processor Specification Update
 		 * Errata N46 (PAT Index MSB May Be Calculated Incorrectly)
 		 */
 		pat_msr = rdmsr(MSR_PAT);
 		pat_msr &= ~PAT_MASK(2);
 		pat_msr |= PAT_VALUE(2, PAT_WRITE_COMBINING);
 		pat_works = 0;
 	}
 	wrmsr(MSR_PAT, pat_msr);
 }
 
 /*
  * Initialize a vm_page's machine-dependent fields.
  */
 void
 pmap_page_init(vm_page_t m)
 {
 
 	TAILQ_INIT(&m->md.pv_list);
 	m->md.pat_mode = PAT_WRITE_BACK;
 }
 
 /*
  * ABuse the pte nodes for unmapped kva to thread a kva freelist through.
  * Requirements:
  *  - Must deal with pages in order to ensure that none of the PG_* bits
  *    are ever set, PG_V in particular.
  *  - Assumes we can write to ptes without pte_store() atomic ops, even
  *    on PAE systems.  This should be ok.
  *  - Assumes nothing will ever test these addresses for 0 to indicate
  *    no mapping instead of correctly checking PG_V.
  *  - Assumes a vm_offset_t will fit in a pte (true for i386).
  * Because PG_V is never set, there can be no mappings to invalidate.
  */
 static int ptelist_count = 0;
 static vm_offset_t
 pmap_ptelist_alloc(vm_offset_t *head)
 {
 	vm_offset_t va;
 	vm_offset_t *phead = (vm_offset_t *)*head;
 	
 	if (ptelist_count == 0) {
 		printf("out of memory!!!!!!\n");
 		return (0);	/* Out of memory */
 	}
 	ptelist_count--;
 	va = phead[ptelist_count];
 	return (va);
 }
 
 static void
 pmap_ptelist_free(vm_offset_t *head, vm_offset_t va)
 {
 	vm_offset_t *phead = (vm_offset_t *)*head;
 
 	phead[ptelist_count++] = va;
 }
 
 static void
 pmap_ptelist_init(vm_offset_t *head, void *base, int npages)
 {
 	int i, nstackpages;
 	vm_offset_t va;
 	vm_page_t m;
 	
 	nstackpages = (npages + PAGE_SIZE/sizeof(vm_offset_t) - 1)/ (PAGE_SIZE/sizeof(vm_offset_t));
 	for (i = 0; i < nstackpages; i++) {
 		va = (vm_offset_t)base + i * PAGE_SIZE;
 		m = vm_page_alloc(NULL, i,
 		    VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED |
 		    VM_ALLOC_ZERO);
 		pmap_qenter(va, &m, 1);
 	}
 
 	*head = (vm_offset_t)base;
 	for (i = npages - 1; i >= nstackpages; i--) {
 		va = (vm_offset_t)base + i * PAGE_SIZE;
 		pmap_ptelist_free(head, va);
 	}
 }
 
 
 /*
  *	Initialize the pmap module.
  *	Called by vm_init, to initialize any structures that the pmap
  *	system needs to map virtual memory.
  */
 void
 pmap_init(void)
 {
 	vm_page_t mpte;
 	vm_size_t s;
 	int i, pv_npg;
 
 	/*
 	 * Initialize the vm page array entries for the kernel pmap's
 	 * page table pages.
 	 */ 
 	for (i = 0; i < nkpt; i++) {
 		mpte = PHYS_TO_VM_PAGE(xpmap_mtop(PTD[i + KPTDI] & PG_FRAME));
 		KASSERT(mpte >= vm_page_array &&
 		    mpte < &vm_page_array[vm_page_array_size],
 		    ("pmap_init: page table page is out of range"));
 		mpte->pindex = i + KPTDI;
 		mpte->phys_addr = xpmap_mtop(PTD[i + KPTDI] & PG_FRAME);
 	}
 
         /*
 	 * Initialize the address space (zone) for the pv entries.  Set a
 	 * high water mark so that the system can recover from excessive
 	 * numbers of pv entries.
 	 */
 	TUNABLE_INT_FETCH("vm.pmap.shpgperproc", &shpgperproc);
 	pv_entry_max = shpgperproc * maxproc + cnt.v_page_count;
 	TUNABLE_INT_FETCH("vm.pmap.pv_entries", &pv_entry_max);
 	pv_entry_max = roundup(pv_entry_max, _NPCPV);
 	pv_entry_high_water = 9 * (pv_entry_max / 10);
 
 	/*
 	 * Are large page mappings enabled?
 	 */
 	TUNABLE_INT_FETCH("vm.pmap.pg_ps_enabled", &pg_ps_enabled);
 
 	/*
 	 * Calculate the size of the pv head table for superpages.
 	 */
 	for (i = 0; phys_avail[i + 1]; i += 2);
 	pv_npg = round_4mpage(phys_avail[(i - 2) + 1]) / NBPDR;
 
 	/*
 	 * Allocate memory for the pv head table for superpages.
 	 */
 	s = (vm_size_t)(pv_npg * sizeof(struct md_page));
 	s = round_page(s);
 	pv_table = (struct md_page *)kmem_alloc(kernel_map, s);
 	for (i = 0; i < pv_npg; i++)
 		TAILQ_INIT(&pv_table[i].pv_list);
 
 	pv_maxchunks = MAX(pv_entry_max / _NPCPV, maxproc);
 	pv_chunkbase = (struct pv_chunk *)kmem_alloc_nofault(kernel_map,
 	    PAGE_SIZE * pv_maxchunks);
 	if (pv_chunkbase == NULL)
 		panic("pmap_init: not enough kvm for pv chunks");
 	pmap_ptelist_init(&pv_vafree, pv_chunkbase, pv_maxchunks);
 }
 
 
 /***************************************************
  * Low level helper routines.....
  ***************************************************/
 
 /*
  * Determine the appropriate bits to set in a PTE or PDE for a specified
  * caching mode.
  */
 int
 pmap_cache_bits(int mode, boolean_t is_pde)
 {
 	int pat_flag, pat_index, cache_bits;
 
 	/* The PAT bit is different for PTE's and PDE's. */
 	pat_flag = is_pde ? PG_PDE_PAT : PG_PTE_PAT;
 
 	/* If we don't support PAT, map extended modes to older ones. */
 	if (!(cpu_feature & CPUID_PAT)) {
 		switch (mode) {
 		case PAT_UNCACHEABLE:
 		case PAT_WRITE_THROUGH:
 		case PAT_WRITE_BACK:
 			break;
 		case PAT_UNCACHED:
 		case PAT_WRITE_COMBINING:
 		case PAT_WRITE_PROTECTED:
 			mode = PAT_UNCACHEABLE;
 			break;
 		}
 	}
 	
 	/* Map the caching mode to a PAT index. */
 	if (pat_works) {
 		switch (mode) {
 			case PAT_UNCACHEABLE:
 				pat_index = 3;
 				break;
 			case PAT_WRITE_THROUGH:
 				pat_index = 1;
 				break;
 			case PAT_WRITE_BACK:
 				pat_index = 0;
 				break;
 			case PAT_UNCACHED:
 				pat_index = 2;
 				break;
 			case PAT_WRITE_COMBINING:
 				pat_index = 5;
 				break;
 			case PAT_WRITE_PROTECTED:
 				pat_index = 4;
 				break;
 			default:
 				panic("Unknown caching mode %d\n", mode);
 		}
 	} else {
 		switch (mode) {
 			case PAT_UNCACHED:
 			case PAT_UNCACHEABLE:
 			case PAT_WRITE_PROTECTED:
 				pat_index = 3;
 				break;
 			case PAT_WRITE_THROUGH:
 				pat_index = 1;
 				break;
 			case PAT_WRITE_BACK:
 				pat_index = 0;
 				break;
 			case PAT_WRITE_COMBINING:
 				pat_index = 2;
 				break;
 			default:
 				panic("Unknown caching mode %d\n", mode);
 		}
 	}	
 
 	/* Map the 3-bit index value into the PAT, PCD, and PWT bits. */
 	cache_bits = 0;
 	if (pat_index & 0x4)
 		cache_bits |= pat_flag;
 	if (pat_index & 0x2)
 		cache_bits |= PG_NC_PCD;
 	if (pat_index & 0x1)
 		cache_bits |= PG_NC_PWT;
 	return (cache_bits);
 }
 #ifdef SMP
 /*
  * For SMP, these functions have to use the IPI mechanism for coherence.
  *
  * N.B.: Before calling any of the following TLB invalidation functions,
  * the calling processor must ensure that all stores updating a non-
  * kernel page table are globally performed.  Otherwise, another
  * processor could cache an old, pre-update entry without being
  * invalidated.  This can happen one of two ways: (1) The pmap becomes
  * active on another processor after its pm_active field is checked by
  * one of the following functions but before a store updating the page
  * table is globally performed. (2) The pmap becomes active on another
  * processor before its pm_active field is checked but due to
  * speculative loads one of the following functions stills reads the
  * pmap as inactive on the other processor.
  * 
  * The kernel page table is exempt because its pm_active field is
  * immutable.  The kernel page table is always active on every
  * processor.
  */
 void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 
 	CTR2(KTR_PMAP, "pmap_invalidate_page: pmap=%p va=0x%x",
 	    pmap, va);
 	
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		invlpg(va);
 		smp_invlpg(va);
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			invlpg(va);
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invlpg(pmap->pm_active & other_cpus, va);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invlpg(other_cpus, va);
 	}
 	sched_unpin();
 	PT_UPDATES_FLUSH();
 }
 
 void
 pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 	vm_offset_t addr;
 
 	CTR3(KTR_PMAP, "pmap_invalidate_page: pmap=%p eva=0x%x sva=0x%x",
 	    pmap, sva, eva);
 
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		for (addr = sva; addr < eva; addr += PAGE_SIZE)
 			invlpg(addr);
 		smp_invlpg_range(sva, eva);
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			for (addr = sva; addr < eva; addr += PAGE_SIZE)
 				invlpg(addr);
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invlpg_range(pmap->pm_active & other_cpus,
-			    sva, eva);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invlpg_range(other_cpus, sva, eva);
 	}
 	sched_unpin();
 	PT_UPDATES_FLUSH();
 }
 
 void
 pmap_invalidate_all(pmap_t pmap)
 {
-	cpumask_t cpumask, other_cpus;
+	cpuset_t cpumask, other_cpus;
 
 	CTR1(KTR_PMAP, "pmap_invalidate_page: pmap=%p", pmap);
 
 	sched_pin();
-	if (pmap == kernel_pmap || pmap->pm_active == all_cpus) {
+	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
 		invltlb();
 		smp_invltlb();
 	} else {
 		cpumask = PCPU_GET(cpumask);
 		other_cpus = PCPU_GET(other_cpus);
-		if (pmap->pm_active & cpumask)
+		if (CPU_OVERLAP(&pmap->pm_active, &cpumask))
 			invltlb();
-		if (pmap->pm_active & other_cpus)
-			smp_masked_invltlb(pmap->pm_active & other_cpus);
+		CPU_AND(&other_cpus, &pmap->pm_active);
+		if (!CPU_EMPTY(&other_cpus))
+			smp_masked_invltlb(other_cpus);
 	}
 	sched_unpin();
 }
 
 void
 pmap_invalidate_cache(void)
 {
 
 	sched_pin();
 	wbinvd();
 	smp_cache_flush();
 	sched_unpin();
 }
 #else /* !SMP */
 /*
  * Normal, non-SMP, 486+ invalidation functions.
  * We inline these within pmap.c for speed.
  */
 PMAP_INLINE void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
 	CTR2(KTR_PMAP, "pmap_invalidate_page: pmap=%p va=0x%x",
 	    pmap, va);
 
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		invlpg(va);
 	PT_UPDATES_FLUSH();
 }
 
 PMAP_INLINE void
 pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t addr;
 
 	if (eva - sva > PAGE_SIZE)
 		CTR3(KTR_PMAP, "pmap_invalidate_range: pmap=%p sva=0x%x eva=0x%x",
 		    pmap, sva, eva);
 
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		for (addr = sva; addr < eva; addr += PAGE_SIZE)
 			invlpg(addr);
 	PT_UPDATES_FLUSH();
 }
 
 PMAP_INLINE void
 pmap_invalidate_all(pmap_t pmap)
 {
 
 	CTR1(KTR_PMAP, "pmap_invalidate_all: pmap=%p", pmap);
 	
-	if (pmap == kernel_pmap || pmap->pm_active)
+	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		invltlb();
 }
 
 PMAP_INLINE void
 pmap_invalidate_cache(void)
 {
 
 	wbinvd();
 }
 #endif /* !SMP */
 
 void
 pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
 {
 
 	KASSERT((sva & PAGE_MASK) == 0,
 	    ("pmap_invalidate_cache_range: sva not page-aligned"));
 	KASSERT((eva & PAGE_MASK) == 0,
 	    ("pmap_invalidate_cache_range: eva not page-aligned"));
 
 	if (cpu_feature & CPUID_SS)
 		; /* If "Self Snoop" is supported, do nothing. */
 	else if (cpu_feature & CPUID_CLFSH) {
 
 		/*
 		 * Otherwise, do per-cache line flush.  Use the mfence
 		 * instruction to insure that previous stores are
 		 * included in the write-back.  The processor
 		 * propagates flush to other processors in the cache
 		 * coherence domain.
 		 */
 		mfence();
 		for (; sva < eva; sva += cpu_clflush_line_size)
 			clflush(sva);
 		mfence();
 	} else {
 
 		/*
 		 * No targeted cache flush methods are supported by CPU,
 		 * globally invalidate cache as a last resort.
 		 */
 		pmap_invalidate_cache();
 	}
 }
 
 /*
  * Are we current address space or kernel?  N.B. We return FALSE when
  * a pmap's page table is in use because a kernel thread is borrowing
  * it.  The borrowed page table can change spontaneously, making any
  * dependence on its continued use subject to a race condition.
  */
 static __inline int
 pmap_is_current(pmap_t pmap)
 {
 
 	return (pmap == kernel_pmap ||
 	    (pmap == vmspace_pmap(curthread->td_proc->p_vmspace) &&
 		(pmap->pm_pdir[PTDPTDI] & PG_FRAME) == (PTDpde[0] & PG_FRAME)));
 }
 
 /*
  * If the given pmap is not the current or kernel pmap, the returned pte must
  * be released by passing it to pmap_pte_release().
  */
 pt_entry_t *
 pmap_pte(pmap_t pmap, vm_offset_t va)
 {
 	pd_entry_t newpf;
 	pd_entry_t *pde;
 
 	pde = pmap_pde(pmap, va);
 	if (*pde & PG_PS)
 		return (pde);
 	if (*pde != 0) {
 		/* are we current address space or kernel? */
 		if (pmap_is_current(pmap))
 			return (vtopte(va));
 		mtx_lock(&PMAP2mutex);
 		newpf = *pde & PG_FRAME;
 		if ((*PMAP2 & PG_FRAME) != newpf) {
 			vm_page_lock_queues();
 			PT_SET_MA(PADDR2, newpf | PG_V | PG_A | PG_M);
 			vm_page_unlock_queues();
 			CTR3(KTR_PMAP, "pmap_pte: pmap=%p va=0x%x newpte=0x%08x",
 			    pmap, va, (*PMAP2 & 0xffffffff));
 		}
 		
 		return (PADDR2 + (i386_btop(va) & (NPTEPG - 1)));
 	}
 	return (0);
 }
 
 /*
  * Releases a pte that was obtained from pmap_pte().  Be prepared for the pte
  * being NULL.
  */
 static __inline void
 pmap_pte_release(pt_entry_t *pte)
 {
 
 	if ((pt_entry_t *)((vm_offset_t)pte & ~PAGE_MASK) == PADDR2) {
 		CTR1(KTR_PMAP, "pmap_pte_release: pte=0x%jx",
 		    *PMAP2);
 		vm_page_lock_queues();
 		PT_SET_VA(PMAP2, 0, TRUE);
 		vm_page_unlock_queues();
 		mtx_unlock(&PMAP2mutex);
 	}
 }
 
 static __inline void
 invlcaddr(void *caddr)
 {
 
 	invlpg((u_int)caddr);
 	PT_UPDATES_FLUSH();
 }
 
 /*
  * Super fast pmap_pte routine best used when scanning
  * the pv lists.  This eliminates many coarse-grained
  * invltlb calls.  Note that many of the pv list
  * scans are across different pmaps.  It is very wasteful
  * to do an entire invltlb for checking a single mapping.
  *
  * If the given pmap is not the current pmap, vm_page_queue_mtx
  * must be held and curthread pinned to a CPU.
  */
 static pt_entry_t *
 pmap_pte_quick(pmap_t pmap, vm_offset_t va)
 {
 	pd_entry_t newpf;
 	pd_entry_t *pde;
 
 	pde = pmap_pde(pmap, va);
 	if (*pde & PG_PS)
 		return (pde);
 	if (*pde != 0) {
 		/* are we current address space or kernel? */
 		if (pmap_is_current(pmap))
 			return (vtopte(va));
 		mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 		KASSERT(curthread->td_pinned > 0, ("curthread not pinned"));
 		newpf = *pde & PG_FRAME;
 		if ((*PMAP1 & PG_FRAME) != newpf) {
 			PT_SET_MA(PADDR1, newpf | PG_V | PG_A | PG_M);
 			CTR3(KTR_PMAP, "pmap_pte_quick: pmap=%p va=0x%x newpte=0x%08x",
 			    pmap, va, (u_long)*PMAP1);
 			
 #ifdef SMP
 			PMAP1cpu = PCPU_GET(cpuid);
 #endif
 			PMAP1changed++;
 		} else
 #ifdef SMP
 		if (PMAP1cpu != PCPU_GET(cpuid)) {
 			PMAP1cpu = PCPU_GET(cpuid);
 			invlcaddr(PADDR1);
 			PMAP1changedcpu++;
 		} else
 #endif
 			PMAP1unchanged++;
 		return (PADDR1 + (i386_btop(va) & (NPTEPG - 1)));
 	}
 	return (0);
 }
 
 /*
  *	Routine:	pmap_extract
  *	Function:
  *		Extract the physical page address associated
  *		with the given map/virtual_address pair.
  */
 vm_paddr_t 
 pmap_extract(pmap_t pmap, vm_offset_t va)
 {
 	vm_paddr_t rtval;
 	pt_entry_t *pte;
 	pd_entry_t pde;
 	pt_entry_t pteval;
 	
 	rtval = 0;
 	PMAP_LOCK(pmap);
 	pde = pmap->pm_pdir[va >> PDRSHIFT];
 	if (pde != 0) {
 		if ((pde & PG_PS) != 0) {
 			rtval = xpmap_mtop(pde & PG_PS_FRAME) | (va & PDRMASK);
 			PMAP_UNLOCK(pmap);
 			return rtval;
 		}
 		pte = pmap_pte(pmap, va);
 		pteval = *pte ? xpmap_mtop(*pte) : 0;
 		rtval = (pteval & PG_FRAME) | (va & PAGE_MASK);
 		pmap_pte_release(pte);
 	}
 	PMAP_UNLOCK(pmap);
 	return (rtval);
 }
 
 /*
  *	Routine:	pmap_extract_ma
  *	Function:
  *		Like pmap_extract, but returns machine address
  */
 vm_paddr_t 
 pmap_extract_ma(pmap_t pmap, vm_offset_t va)
 {
 	vm_paddr_t rtval;
 	pt_entry_t *pte;
 	pd_entry_t pde;
 
 	rtval = 0;
 	PMAP_LOCK(pmap);
 	pde = pmap->pm_pdir[va >> PDRSHIFT];
 	if (pde != 0) {
 		if ((pde & PG_PS) != 0) {
 			rtval = (pde & ~PDRMASK) | (va & PDRMASK);
 			PMAP_UNLOCK(pmap);
 			return rtval;
 		}
 		pte = pmap_pte(pmap, va);
 		rtval = (*pte & PG_FRAME) | (va & PAGE_MASK);
 		pmap_pte_release(pte);
 	}
 	PMAP_UNLOCK(pmap);
 	return (rtval);
 }
 
 /*
  *	Routine:	pmap_extract_and_hold
  *	Function:
  *		Atomically extract and hold the physical page
  *		with the given pmap and virtual address pair
  *		if that mapping permits the given protection.
  */
 vm_page_t
 pmap_extract_and_hold(pmap_t pmap, vm_offset_t va, vm_prot_t prot)
 {
 	pd_entry_t pde;
 	pt_entry_t pte;
 	vm_page_t m;
 	vm_paddr_t pa;
 
 	pa = 0;
 	m = NULL;
 	PMAP_LOCK(pmap);
 retry:
 	pde = PT_GET(pmap_pde(pmap, va));
 	if (pde != 0) {
 		if (pde & PG_PS) {
 			if ((pde & PG_RW) || (prot & VM_PROT_WRITE) == 0) {
 				if (vm_page_pa_tryrelock(pmap, (pde & PG_PS_FRAME) |
 				       (va & PDRMASK), &pa))
 					goto retry;
 				m = PHYS_TO_VM_PAGE((pde & PG_PS_FRAME) |
 				    (va & PDRMASK));
 				vm_page_hold(m);
 			}
 		} else {
 			sched_pin();
 			pte = PT_GET(pmap_pte_quick(pmap, va));
 			if (*PMAP1)
 				PT_SET_MA(PADDR1, 0);
 			if ((pte & PG_V) &&
 			    ((pte & PG_RW) || (prot & VM_PROT_WRITE) == 0)) {
 				if (vm_page_pa_tryrelock(pmap, pte & PG_FRAME, &pa))
 					goto retry;
 				m = PHYS_TO_VM_PAGE(pte & PG_FRAME);
 				vm_page_hold(m);
 			}
 			sched_unpin();
 		}
 	}
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pmap);
 	return (m);
 }
 
 /***************************************************
  * Low level mapping routines.....
  ***************************************************/
 
 /*
  * Add a wired page to the kva.
  * Note: not SMP coherent.
  */
 void 
 pmap_kenter(vm_offset_t va, vm_paddr_t pa)
 {
 	PT_SET_MA(va, xpmap_ptom(pa)| PG_RW | PG_V | pgeflag);
 }
 
 void 
 pmap_kenter_ma(vm_offset_t va, vm_paddr_t ma)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_store_ma(pte, ma | PG_RW | PG_V | pgeflag);
 }
 
 
 static __inline void 
 pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode)
 {
 	PT_SET_MA(va, pa | PG_RW | PG_V | pgeflag | pmap_cache_bits(mode, 0));
 }
 
 /*
  * Remove a page from the kernel pagetables.
  * Note: not SMP coherent.
  */
 PMAP_INLINE void
 pmap_kremove(vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	PT_CLEAR_VA(pte, FALSE);
 }
 
 /*
  *	Used to map a range of physical addresses into kernel
  *	virtual address space.
  *
  *	The value passed in '*virt' is a suggested virtual address for
  *	the mapping. Architectures which can support a direct-mapped
  *	physical to virtual region can return the appropriate address
  *	within that region, leaving '*virt' unchanged. Other
  *	architectures should map the pages starting at '*virt' and
  *	update '*virt' with the first usable address after the mapped
  *	region.
  */
 vm_offset_t
 pmap_map(vm_offset_t *virt, vm_paddr_t start, vm_paddr_t end, int prot)
 {
 	vm_offset_t va, sva;
 
 	va = sva = *virt;
 	CTR4(KTR_PMAP, "pmap_map: va=0x%x start=0x%jx end=0x%jx prot=0x%x",
 	    va, start, end, prot);
 	while (start < end) {
 		pmap_kenter(va, start);
 		va += PAGE_SIZE;
 		start += PAGE_SIZE;
 	}
 	pmap_invalidate_range(kernel_pmap, sva, va);
 	*virt = va;
 	return (sva);
 }
 
 
 /*
  * Add a list of wired pages to the kva
  * this routine is only used for temporary
  * kernel mappings that do not need to have
  * page modification or references recorded.
  * Note that old mappings are simply written
  * over.  The page *must* be wired.
  * Note: SMP coherent.  Uses a ranged shootdown IPI.
  */
 void
 pmap_qenter(vm_offset_t sva, vm_page_t *ma, int count)
 {
 	pt_entry_t *endpte, *pte;
 	vm_paddr_t pa;
 	vm_offset_t va = sva;
 	int mclcount = 0;
 	multicall_entry_t mcl[16];
 	multicall_entry_t *mclp = mcl;
 	int error;
 
 	CTR2(KTR_PMAP, "pmap_qenter:sva=0x%x count=%d", va, count);
 	pte = vtopte(sva);
 	endpte = pte + count;
 	while (pte < endpte) {
 		pa = VM_PAGE_TO_MACH(*ma) | pgeflag | PG_RW | PG_V | PG_M | PG_A;
 
 		mclp->op = __HYPERVISOR_update_va_mapping;
 		mclp->args[0] = va;
 		mclp->args[1] = (uint32_t)(pa & 0xffffffff);
 		mclp->args[2] = (uint32_t)(pa >> 32);
 		mclp->args[3] = (*pte & PG_V) ? UVMF_INVLPG|UVMF_ALL : 0;
 	
 		va += PAGE_SIZE;
 		pte++;
 		ma++;
 		mclp++;
 		mclcount++;
 		if (mclcount == 16) {
 			error = HYPERVISOR_multicall(mcl, mclcount);
 			mclp = mcl;
 			mclcount = 0;
 			KASSERT(error == 0, ("bad multicall %d", error));
 		}		
 	}
 	if (mclcount) {
 		error = HYPERVISOR_multicall(mcl, mclcount);
 		KASSERT(error == 0, ("bad multicall %d", error));
 	}
 	
 #ifdef INVARIANTS
 	for (pte = vtopte(sva), mclcount = 0; mclcount < count; mclcount++, pte++)
 		KASSERT(*pte, ("pte not set for va=0x%x", sva + mclcount*PAGE_SIZE));
 #endif	
 }
 
 
 /*
  * This routine tears out page mappings from the
  * kernel -- it is meant only for temporary mappings.
  * Note: SMP coherent.  Uses a ranged shootdown IPI.
  */
 void
 pmap_qremove(vm_offset_t sva, int count)
 {
 	vm_offset_t va;
 
 	CTR2(KTR_PMAP, "pmap_qremove: sva=0x%x count=%d", sva, count);
 	va = sva;
 	vm_page_lock_queues();
 	critical_enter();
 	while (count-- > 0) {
 		pmap_kremove(va);
 		va += PAGE_SIZE;
 	}
 	PT_UPDATES_FLUSH();
 	pmap_invalidate_range(kernel_pmap, sva, va);
 	critical_exit();
 	vm_page_unlock_queues();
 }
 
 /***************************************************
  * Page table page management routines.....
  ***************************************************/
 static __inline void
 pmap_free_zero_pages(vm_page_t free)
 {
 	vm_page_t m;
 
 	while (free != NULL) {
 		m = free;
 		free = m->right;
 		vm_page_free_zero(m);
 	}
 }
 
 /*
  * This routine unholds page table pages, and if the hold count
  * drops to zero, then it decrements the wire count.
  */
 static __inline int
 pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, vm_page_t *free)
 {
 
 	--m->wire_count;
 	if (m->wire_count == 0)
 		return _pmap_unwire_pte_hold(pmap, m, free);
 	else
 		return 0;
 }
 
 static int 
 _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, vm_page_t *free)
 {
 	vm_offset_t pteva;
 
 	PT_UPDATES_FLUSH();
 	/*
 	 * unmap the page table page
 	 */
 	xen_pt_unpin(pmap->pm_pdir[m->pindex]);
 	/*
 	 * page *might* contain residual mapping :-/  
 	 */
 	PD_CLEAR_VA(pmap, m->pindex, TRUE);
 	pmap_zero_page(m);
 	--pmap->pm_stats.resident_count;
 
 	/*
 	 * This is a release store so that the ordinary store unmapping
 	 * the page table page is globally performed before TLB shoot-
 	 * down is begun.
 	 */
 	atomic_subtract_rel_int(&cnt.v_wire_count, 1);
 
 	/*
 	 * Do an invltlb to make the invalidated mapping
 	 * take effect immediately.
 	 */
 	pteva = VM_MAXUSER_ADDRESS + i386_ptob(m->pindex);
 	pmap_invalidate_page(pmap, pteva);
 
 	/* 
 	 * Put page on a list so that it is released after
 	 * *ALL* TLB shootdown is done
 	 */
 	m->right = *free;
 	*free = m;
 
 	return 1;
 }
 
 /*
  * After removing a page table entry, this routine is used to
  * conditionally free the page, and manage the hold/wire counts.
  */
 static int
 pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t *free)
 {
 	pd_entry_t ptepde;
 	vm_page_t mpte;
 
 	if (va >= VM_MAXUSER_ADDRESS)
 		return 0;
 	ptepde = PT_GET(pmap_pde(pmap, va));
 	mpte = PHYS_TO_VM_PAGE(ptepde & PG_FRAME);
 	return pmap_unwire_pte_hold(pmap, mpte, free);
 }
 
 void
 pmap_pinit0(pmap_t pmap)
 {
 
 	PMAP_LOCK_INIT(pmap);
 	pmap->pm_pdir = (pd_entry_t *)(KERNBASE + (vm_offset_t)IdlePTD);
 #ifdef PAE
 	pmap->pm_pdpt = (pdpt_entry_t *)(KERNBASE + (vm_offset_t)IdlePDPT);
 #endif
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 	PCPU_SET(curpmap, pmap);
 	TAILQ_INIT(&pmap->pm_pvchunk);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 	mtx_lock_spin(&allpmaps_lock);
 	LIST_INSERT_HEAD(&allpmaps, pmap, pm_list);
 	mtx_unlock_spin(&allpmaps_lock);
 }
 
 /*
  * Initialize a preallocated and zeroed pmap structure,
  * such as one in a vmspace structure.
  */
 int
 pmap_pinit(pmap_t pmap)
 {
 	vm_page_t m, ptdpg[NPGPTD + 1];
 	int npgptd = NPGPTD + 1;
 	static int color;
 	int i;
 
 #ifdef HAMFISTED_LOCKING
 	mtx_lock(&createdelete_lock);
 #endif
 
 	PMAP_LOCK_INIT(pmap);
 
 	/*
 	 * No need to allocate page table space yet but we do need a valid
 	 * page directory table.
 	 */
 	if (pmap->pm_pdir == NULL) {
 		pmap->pm_pdir = (pd_entry_t *)kmem_alloc_nofault(kernel_map,
 		    NBPTD);
 		if (pmap->pm_pdir == NULL) {
 			PMAP_LOCK_DESTROY(pmap);
 #ifdef HAMFISTED_LOCKING
 			mtx_unlock(&createdelete_lock);
 #endif
 			return (0);
 		}
 #ifdef PAE
 		pmap->pm_pdpt = (pd_entry_t *)kmem_alloc_nofault(kernel_map, 1);
 #endif
 	}
 
 	/*
 	 * allocate the page directory page(s)
 	 */
 	for (i = 0; i < npgptd;) {
 		m = vm_page_alloc(NULL, color++,
 		    VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED |
 		    VM_ALLOC_ZERO);
 		if (m == NULL)
 			VM_WAIT;
 		else {
 			ptdpg[i++] = m;
 		}
 	}
 	pmap_qenter((vm_offset_t)pmap->pm_pdir, ptdpg, NPGPTD);
 	for (i = 0; i < NPGPTD; i++) {
 		if ((ptdpg[i]->flags & PG_ZERO) == 0)
 			pagezero(&pmap->pm_pdir[i*NPTEPG]);
 	}
 
 	mtx_lock_spin(&allpmaps_lock);
 	LIST_INSERT_HEAD(&allpmaps, pmap, pm_list);
 	mtx_unlock_spin(&allpmaps_lock);
 	/* Wire in kernel global address entries. */
 
 	bcopy(PTD + KPTDI, pmap->pm_pdir + KPTDI, nkpt * sizeof(pd_entry_t));
 #ifdef PAE
 	pmap_qenter((vm_offset_t)pmap->pm_pdpt, &ptdpg[NPGPTD], 1);
 	if ((ptdpg[NPGPTD]->flags & PG_ZERO) == 0)
 		bzero(pmap->pm_pdpt, PAGE_SIZE);
 	for (i = 0; i < NPGPTD; i++) {
 		vm_paddr_t ma;
 		
 		ma = VM_PAGE_TO_MACH(ptdpg[i]);
 		pmap->pm_pdpt[i] = ma | PG_V;
 
 	}
 #endif	
 	for (i = 0; i < NPGPTD; i++) {
 		pt_entry_t *pd;
 		vm_paddr_t ma;
 		
 		ma = VM_PAGE_TO_MACH(ptdpg[i]);
 		pd = pmap->pm_pdir + (i * NPDEPG);
 		PT_SET_MA(pd, *vtopte((vm_offset_t)pd) & ~(PG_M|PG_A|PG_U|PG_RW));
 #if 0		
 		xen_pgd_pin(ma);
 #endif		
 	}
 	
 #ifdef PAE	
 	PT_SET_MA(pmap->pm_pdpt, *vtopte((vm_offset_t)pmap->pm_pdpt) & ~PG_RW);
 #endif
 	vm_page_lock_queues();
 	xen_flush_queue();
 	xen_pgdpt_pin(VM_PAGE_TO_MACH(ptdpg[NPGPTD]));
 	for (i = 0; i < NPGPTD; i++) {
 		vm_paddr_t ma = VM_PAGE_TO_MACH(ptdpg[i]);
 		PT_SET_VA_MA(&pmap->pm_pdir[PTDPTDI + i], ma | PG_V | PG_A, FALSE);
 	}
 	xen_flush_queue();
 	vm_page_unlock_queues();
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 	TAILQ_INIT(&pmap->pm_pvchunk);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 
 #ifdef HAMFISTED_LOCKING
 	mtx_unlock(&createdelete_lock);
 #endif
 	return (1);
 }
 
 /*
  * this routine is called if the page table page is not
  * mapped correctly.
  */
 static vm_page_t
 _pmap_allocpte(pmap_t pmap, unsigned int ptepindex, int flags)
 {
 	vm_paddr_t ptema;
 	vm_page_t m;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("_pmap_allocpte: flags is neither M_NOWAIT nor M_WAITOK"));
 
 	/*
 	 * Allocate a page table page.
 	 */
 	if ((m = vm_page_alloc(NULL, ptepindex, VM_ALLOC_NOOBJ |
 	    VM_ALLOC_WIRED | VM_ALLOC_ZERO)) == NULL) {
 		if (flags & M_WAITOK) {
 			PMAP_UNLOCK(pmap);
 			vm_page_unlock_queues();
 			VM_WAIT;
 			vm_page_lock_queues();
 			PMAP_LOCK(pmap);
 		}
 
 		/*
 		 * Indicate the need to retry.  While waiting, the page table
 		 * page may have been allocated.
 		 */
 		return (NULL);
 	}
 	if ((m->flags & PG_ZERO) == 0)
 		pmap_zero_page(m);
 
 	/*
 	 * Map the pagetable page into the process address space, if
 	 * it isn't already there.
 	 */
 	pmap->pm_stats.resident_count++;
 
 	ptema = VM_PAGE_TO_MACH(m);
 	xen_pt_pin(ptema);
 	PT_SET_VA_MA(&pmap->pm_pdir[ptepindex],
 		(ptema | PG_U | PG_RW | PG_V | PG_A | PG_M), TRUE);
 	
 	KASSERT(pmap->pm_pdir[ptepindex],
 	    ("_pmap_allocpte: ptepindex=%d did not get mapped", ptepindex));
 	return (m);
 }
 
 static vm_page_t
 pmap_allocpte(pmap_t pmap, vm_offset_t va, int flags)
 {
 	unsigned ptepindex;
 	pd_entry_t ptema;
 	vm_page_t m;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("pmap_allocpte: flags is neither M_NOWAIT nor M_WAITOK"));
 
 	/*
 	 * Calculate pagetable page index
 	 */
 	ptepindex = va >> PDRSHIFT;
 retry:
 	/*
 	 * Get the page directory entry
 	 */
 	ptema = pmap->pm_pdir[ptepindex];
 
 	/*
 	 * This supports switching from a 4MB page to a
 	 * normal 4K page.
 	 */
 	if (ptema & PG_PS) {
 		/*
 		 * XXX 
 		 */
 		pmap->pm_pdir[ptepindex] = 0;
 		ptema = 0;
 		pmap->pm_stats.resident_count -= NBPDR / PAGE_SIZE;
 		pmap_invalidate_all(kernel_pmap);
 	}
 
 	/*
 	 * If the page table page is mapped, we just increment the
 	 * hold count, and activate it.
 	 */
 	if (ptema & PG_V) {
 		m = PHYS_TO_VM_PAGE(xpmap_mtop(ptema) & PG_FRAME);
 		m->wire_count++;
 	} else {
 		/*
 		 * Here if the pte page isn't mapped, or if it has
 		 * been deallocated. 
 		 */
 		CTR3(KTR_PMAP, "pmap_allocpte: pmap=%p va=0x%08x flags=0x%x",
 		    pmap, va, flags);
 		m = _pmap_allocpte(pmap, ptepindex, flags);
 		if (m == NULL && (flags & M_WAITOK))
 			goto retry;
 
 		KASSERT(pmap->pm_pdir[ptepindex], ("ptepindex=%d did not get mapped", ptepindex));
 	}
 	return (m);
 }
 
 
 /***************************************************
 * Pmap allocation/deallocation routines.
  ***************************************************/
 
 #ifdef SMP
 /*
  * Deal with a SMP shootdown of other users of the pmap that we are
  * trying to dispose of.  This can be a bit hairy.
  */
-static cpumask_t *lazymask;
+static cpuset_t *lazymask;
 static u_int lazyptd;
 static volatile u_int lazywait;
 
 void pmap_lazyfix_action(void);
 
 void
 pmap_lazyfix_action(void)
 {
-	cpumask_t mymask = PCPU_GET(cpumask);
 
 #ifdef COUNT_IPIS
 	(*ipi_lazypmap_counts[PCPU_GET(cpuid)])++;
 #endif
 	if (rcr3() == lazyptd)
 		load_cr3(PCPU_GET(curpcb)->pcb_cr3);
-	atomic_clear_int(lazymask, mymask);
+	CPU_CLR_ATOMIC(PCPU_GET(cpuid), lazymask);
 	atomic_store_rel_int(&lazywait, 1);
 }
 
 static void
-pmap_lazyfix_self(cpumask_t mymask)
+pmap_lazyfix_self(cpuset_t mymask)
 {
 
 	if (rcr3() == lazyptd)
 		load_cr3(PCPU_GET(curpcb)->pcb_cr3);
-	atomic_clear_int(lazymask, mymask);
+	CPU_NAND_ATOMIC(lazymask, &mymask);
 }
 
 
 static void
 pmap_lazyfix(pmap_t pmap)
 {
-	cpumask_t mymask, mask;
+	cpuset_t mymask, mask;
 	u_int spins;
+	int lsb;
 
-	while ((mask = pmap->pm_active) != 0) {
+	mask = pmap->pm_active;
+	while (!CPU_EMPTY(&mask)) {
 		spins = 50000000;
-		mask = mask & -mask;	/* Find least significant set bit */
+
+		/* Find least significant set bit. */
+		lsb = cpusetobj_ffs(&mask);
+		MPASS(lsb != 0);
+		lsb--;
+		CPU_SETOF(lsb, &mask);
 		mtx_lock_spin(&smp_ipi_mtx);
 #ifdef PAE
 		lazyptd = vtophys(pmap->pm_pdpt);
 #else
 		lazyptd = vtophys(pmap->pm_pdir);
 #endif
 		mymask = PCPU_GET(cpumask);
-		if (mask == mymask) {
+		if (!CPU_CMP(&mask, &mymask)) {
 			lazymask = &pmap->pm_active;
 			pmap_lazyfix_self(mymask);
 		} else {
 			atomic_store_rel_int((u_int *)&lazymask,
 			    (u_int)&pmap->pm_active);
 			atomic_store_rel_int(&lazywait, 0);
 			ipi_selected(mask, IPI_LAZYPMAP);
 			while (lazywait == 0) {
 				ia32_pause();
 				if (--spins == 0)
 					break;
 			}
 		}
 		mtx_unlock_spin(&smp_ipi_mtx);
 		if (spins == 0)
 			printf("pmap_lazyfix: spun for 50000000\n");
+		mask = pmap->pm_active;
 	}
 }
 
 #else	/* SMP */
 
 /*
  * Cleaning up on uniprocessor is easy.  For various reasons, we're
  * unlikely to have to even execute this code, including the fact
  * that the cleanup is deferred until the parent does a wait(2), which
  * means that another userland process has run.
  */
 static void
 pmap_lazyfix(pmap_t pmap)
 {
 	u_int cr3;
 
 	cr3 = vtophys(pmap->pm_pdir);
 	if (cr3 == rcr3()) {
 		load_cr3(PCPU_GET(curpcb)->pcb_cr3);
-		pmap->pm_active &= ~(PCPU_GET(cpumask));
+		CPU_CLR(PCPU_GET(cpuid), &pmap->pm_active);
 	}
 }
 #endif	/* SMP */
 
 /*
  * Release any resources held by the given physical map.
  * Called when a pmap initialized by pmap_pinit is being released.
  * Should only be called if the map contains no valid mappings.
  */
 void
 pmap_release(pmap_t pmap)
 {
 	vm_page_t m, ptdpg[2*NPGPTD+1];
 	vm_paddr_t ma;
 	int i;
 #ifdef PAE	
 	int npgptd = NPGPTD + 1;
 #else
 	int npgptd = NPGPTD;
 #endif
 	KASSERT(pmap->pm_stats.resident_count == 0,
 	    ("pmap_release: pmap resident count %ld != 0",
 	    pmap->pm_stats.resident_count));
 	PT_UPDATES_FLUSH();
 
 #ifdef HAMFISTED_LOCKING
 	mtx_lock(&createdelete_lock);
 #endif
 
 	pmap_lazyfix(pmap);
 	mtx_lock_spin(&allpmaps_lock);
 	LIST_REMOVE(pmap, pm_list);
 	mtx_unlock_spin(&allpmaps_lock);
 
 	for (i = 0; i < NPGPTD; i++)
 		ptdpg[i] = PHYS_TO_VM_PAGE(vtophys(pmap->pm_pdir + (i*NPDEPG)) & PG_FRAME);
 	pmap_qremove((vm_offset_t)pmap->pm_pdir, NPGPTD);
 #ifdef PAE
 	ptdpg[NPGPTD] = PHYS_TO_VM_PAGE(vtophys(pmap->pm_pdpt));
 #endif	
 
 	for (i = 0; i < npgptd; i++) {
 		m = ptdpg[i];
 		ma = VM_PAGE_TO_MACH(m);
 		/* unpinning L1 and L2 treated the same */
 #if 0
                 xen_pgd_unpin(ma);
 #else
 		if (i == NPGPTD)
 	                xen_pgd_unpin(ma);
 #endif
 #ifdef PAE
 		if (i < NPGPTD)
 			KASSERT(VM_PAGE_TO_MACH(m) == (pmap->pm_pdpt[i] & PG_FRAME),
 			    ("pmap_release: got wrong ptd page"));
 #endif
 		m->wire_count--;
 		atomic_subtract_int(&cnt.v_wire_count, 1);
 		vm_page_free(m);
 	}
 #ifdef PAE
 	pmap_qremove((vm_offset_t)pmap->pm_pdpt, 1);
 #endif
 	PMAP_LOCK_DESTROY(pmap);
 
 #ifdef HAMFISTED_LOCKING
 	mtx_unlock(&createdelete_lock);
 #endif
 }
 
 static int
 kvm_size(SYSCTL_HANDLER_ARGS)
 {
 	unsigned long ksize = VM_MAX_KERNEL_ADDRESS - KERNBASE;
 
 	return sysctl_handle_long(oidp, &ksize, 0, req);
 }
 SYSCTL_PROC(_vm, OID_AUTO, kvm_size, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_size, "IU", "Size of KVM");
 
 static int
 kvm_free(SYSCTL_HANDLER_ARGS)
 {
 	unsigned long kfree = VM_MAX_KERNEL_ADDRESS - kernel_vm_end;
 
 	return sysctl_handle_long(oidp, &kfree, 0, req);
 }
 SYSCTL_PROC(_vm, OID_AUTO, kvm_free, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_free, "IU", "Amount of KVM free");
 
 /*
  * grow the number of kernel page table entries, if needed
  */
 void
 pmap_growkernel(vm_offset_t addr)
 {
 	struct pmap *pmap;
 	vm_paddr_t ptppaddr;
 	vm_page_t nkpg;
 	pd_entry_t newpdir;
 
 	mtx_assert(&kernel_map->system_mtx, MA_OWNED);
 	if (kernel_vm_end == 0) {
 		kernel_vm_end = KERNBASE;
 		nkpt = 0;
 		while (pdir_pde(PTD, kernel_vm_end)) {
 			kernel_vm_end = (kernel_vm_end + PAGE_SIZE * NPTEPG) & ~(PAGE_SIZE * NPTEPG - 1);
 			nkpt++;
 			if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 				kernel_vm_end = kernel_map->max_offset;
 				break;
 			}
 		}
 	}
 	addr = roundup2(addr, PAGE_SIZE * NPTEPG);
 	if (addr - 1 >= kernel_map->max_offset)
 		addr = kernel_map->max_offset;
 	while (kernel_vm_end < addr) {
 		if (pdir_pde(PTD, kernel_vm_end)) {
 			kernel_vm_end = (kernel_vm_end + PAGE_SIZE * NPTEPG) & ~(PAGE_SIZE * NPTEPG - 1);
 			if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 				kernel_vm_end = kernel_map->max_offset;
 				break;
 			}
 			continue;
 		}
 
 		/*
 		 * This index is bogus, but out of the way
 		 */
 		nkpg = vm_page_alloc(NULL, nkpt,
 		    VM_ALLOC_NOOBJ | VM_ALLOC_SYSTEM | VM_ALLOC_WIRED);
 		if (!nkpg)
 			panic("pmap_growkernel: no memory to grow kernel");
 
 		nkpt++;
 
 		pmap_zero_page(nkpg);
 		ptppaddr = VM_PAGE_TO_PHYS(nkpg);
 		newpdir = (pd_entry_t) (ptppaddr | PG_V | PG_RW | PG_A | PG_M);
 		vm_page_lock_queues();
 		PD_SET_VA(kernel_pmap, (kernel_vm_end >> PDRSHIFT), newpdir, TRUE);
 		mtx_lock_spin(&allpmaps_lock);
 		LIST_FOREACH(pmap, &allpmaps, pm_list)
 			PD_SET_VA(pmap, (kernel_vm_end >> PDRSHIFT), newpdir, TRUE);
 
 		mtx_unlock_spin(&allpmaps_lock);
 		vm_page_unlock_queues();
 
 		kernel_vm_end = (kernel_vm_end + PAGE_SIZE * NPTEPG) & ~(PAGE_SIZE * NPTEPG - 1);
 		if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 			kernel_vm_end = kernel_map->max_offset;
 			break;
 		}
 	}
 }
 
 
 /***************************************************
  * page management routines.
  ***************************************************/
 
 CTASSERT(sizeof(struct pv_chunk) == PAGE_SIZE);
 CTASSERT(_NPCM == 11);
 
 static __inline struct pv_chunk *
 pv_to_chunk(pv_entry_t pv)
 {
 
 	return (struct pv_chunk *)((uintptr_t)pv & ~(uintptr_t)PAGE_MASK);
 }
 
 #define PV_PMAP(pv) (pv_to_chunk(pv)->pc_pmap)
 
 #define	PC_FREE0_9	0xfffffffful	/* Free values for index 0 through 9 */
 #define	PC_FREE10	0x0000fffful	/* Free values for index 10 */
 
 static uint32_t pc_freemask[11] = {
 	PC_FREE0_9, PC_FREE0_9, PC_FREE0_9,
 	PC_FREE0_9, PC_FREE0_9, PC_FREE0_9,
 	PC_FREE0_9, PC_FREE0_9, PC_FREE0_9,
 	PC_FREE0_9, PC_FREE10
 };
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_count, CTLFLAG_RD, &pv_entry_count, 0,
 	"Current number of pv entries");
 
 #ifdef PV_STATS
 static int pc_chunk_count, pc_chunk_allocs, pc_chunk_frees, pc_chunk_tryfail;
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_count, CTLFLAG_RD, &pc_chunk_count, 0,
 	"Current number of pv entry chunks");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_allocs, CTLFLAG_RD, &pc_chunk_allocs, 0,
 	"Current number of pv entry chunks allocated");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_frees, CTLFLAG_RD, &pc_chunk_frees, 0,
 	"Current number of pv entry chunks frees");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_tryfail, CTLFLAG_RD, &pc_chunk_tryfail, 0,
 	"Number of times tried to get a chunk page but failed.");
 
 static long pv_entry_frees, pv_entry_allocs;
 static int pv_entry_spare;
 
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_frees, CTLFLAG_RD, &pv_entry_frees, 0,
 	"Current number of pv entry frees");
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_allocs, CTLFLAG_RD, &pv_entry_allocs, 0,
 	"Current number of pv entry allocs");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_spare, CTLFLAG_RD, &pv_entry_spare, 0,
 	"Current number of spare pv entries");
 
 static int pmap_collect_inactive, pmap_collect_active;
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pmap_collect_inactive, CTLFLAG_RD, &pmap_collect_inactive, 0,
 	"Current number times pmap_collect called on inactive queue");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pmap_collect_active, CTLFLAG_RD, &pmap_collect_active, 0,
 	"Current number times pmap_collect called on active queue");
 #endif
 
 /*
  * We are in a serious low memory condition.  Resort to
  * drastic measures to free some pages so we can allocate
  * another pv entry chunk.  This is normally called to
  * unmap inactive pages, and if necessary, active pages.
  */
 static void
 pmap_collect(pmap_t locked_pmap, struct vpgqueues *vpq)
 {
 	pmap_t pmap;
 	pt_entry_t *pte, tpte;
 	pv_entry_t next_pv, pv;
 	vm_offset_t va;
 	vm_page_t m, free;
 
 	sched_pin();
 	TAILQ_FOREACH(m, &vpq->pl, pageq) {
 		if (m->hold_count || m->busy)
 			continue;
 		TAILQ_FOREACH_SAFE(pv, &m->md.pv_list, pv_list, next_pv) {
 			va = pv->pv_va;
 			pmap = PV_PMAP(pv);
 			/* Avoid deadlock and lock recursion. */
 			if (pmap > locked_pmap)
 				PMAP_LOCK(pmap);
 			else if (pmap != locked_pmap && !PMAP_TRYLOCK(pmap))
 				continue;
 			pmap->pm_stats.resident_count--;
 			pte = pmap_pte_quick(pmap, va);
 			tpte = pte_load_clear(pte);
 			KASSERT((tpte & PG_W) == 0,
 			    ("pmap_collect: wired pte %#jx", (uintmax_t)tpte));
 			if (tpte & PG_A)
 				vm_page_flag_set(m, PG_REFERENCED);
 			if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 			free = NULL;
 			pmap_unuse_pt(pmap, va, &free);
 			pmap_invalidate_page(pmap, va);
 			pmap_free_zero_pages(free);
 			TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 			free_pv_entry(pmap, pv);
 			if (pmap != locked_pmap)
 				PMAP_UNLOCK(pmap);
 		}
 		if (TAILQ_EMPTY(&m->md.pv_list))
 			vm_page_flag_clear(m, PG_WRITEABLE);
 	}
 	sched_unpin();
 }
 
 
 /*
  * free the pv_entry back to the free list
  */
 static void
 free_pv_entry(pmap_t pmap, pv_entry_t pv)
 {
 	vm_page_t m;
 	struct pv_chunk *pc;
 	int idx, field, bit;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	PV_STAT(pv_entry_frees++);
 	PV_STAT(pv_entry_spare++);
 	pv_entry_count--;
 	pc = pv_to_chunk(pv);
 	idx = pv - &pc->pc_pventry[0];
 	field = idx / 32;
 	bit = idx % 32;
 	pc->pc_map[field] |= 1ul << bit;
 	/* move to head of list */
 	TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 	TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 	for (idx = 0; idx < _NPCM; idx++)
 		if (pc->pc_map[idx] != pc_freemask[idx])
 			return;
 	PV_STAT(pv_entry_spare -= _NPCPV);
 	PV_STAT(pc_chunk_count--);
 	PV_STAT(pc_chunk_frees++);
 	/* entire chunk is free, return it */
 	TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 	m = PHYS_TO_VM_PAGE(pmap_kextract((vm_offset_t)pc));
 	pmap_qremove((vm_offset_t)pc, 1);
 	vm_page_unwire(m, 0);
 	vm_page_free(m);
 	pmap_ptelist_free(&pv_vafree, (vm_offset_t)pc);
 }
 
 /*
  * get a new pv_entry, allocating a block from the system
  * when needed.
  */
 static pv_entry_t
 get_pv_entry(pmap_t pmap, int try)
 {
 	static const struct timeval printinterval = { 60, 0 };
 	static struct timeval lastprint;
 	static vm_pindex_t colour;
 	struct vpgqueues *pq;
 	int bit, field;
 	pv_entry_t pv;
 	struct pv_chunk *pc;
 	vm_page_t m;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PV_STAT(pv_entry_allocs++);
 	pv_entry_count++;
 	if (pv_entry_count > pv_entry_high_water)
 		if (ratecheck(&lastprint, &printinterval))
 			printf("Approaching the limit on PV entries, consider "
 			    "increasing either the vm.pmap.shpgperproc or the "
 			    "vm.pmap.pv_entry_max tunable.\n");
 	pq = NULL;
 retry:
 	pc = TAILQ_FIRST(&pmap->pm_pvchunk);
 	if (pc != NULL) {
 		for (field = 0; field < _NPCM; field++) {
 			if (pc->pc_map[field]) {
 				bit = bsfl(pc->pc_map[field]);
 				break;
 			}
 		}
 		if (field < _NPCM) {
 			pv = &pc->pc_pventry[field * 32 + bit];
 			pc->pc_map[field] &= ~(1ul << bit);
 			/* If this was the last item, move it to tail */
 			for (field = 0; field < _NPCM; field++)
 				if (pc->pc_map[field] != 0) {
 					PV_STAT(pv_entry_spare--);
 					return (pv);	/* not full, return */
 				}
 			TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 			TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc, pc_list);
 			PV_STAT(pv_entry_spare--);
 			return (pv);
 		}
 	}
 	/*
 	 * Access to the ptelist "pv_vafree" is synchronized by the page
 	 * queues lock.  If "pv_vafree" is currently non-empty, it will
 	 * remain non-empty until pmap_ptelist_alloc() completes.
 	 */
 	if (pv_vafree == 0 || (m = vm_page_alloc(NULL, colour, (pq ==
 	    &vm_page_queues[PQ_ACTIVE] ? VM_ALLOC_SYSTEM : VM_ALLOC_NORMAL) |
 	    VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) {
 		if (try) {
 			pv_entry_count--;
 			PV_STAT(pc_chunk_tryfail++);
 			return (NULL);
 		}
 		/*
 		 * Reclaim pv entries: At first, destroy mappings to
 		 * inactive pages.  After that, if a pv chunk entry
 		 * is still needed, destroy mappings to active pages.
 		 */
 		if (pq == NULL) {
 			PV_STAT(pmap_collect_inactive++);
 			pq = &vm_page_queues[PQ_INACTIVE];
 		} else if (pq == &vm_page_queues[PQ_INACTIVE]) {
 			PV_STAT(pmap_collect_active++);
 			pq = &vm_page_queues[PQ_ACTIVE];
 		} else
 			panic("get_pv_entry: increase vm.pmap.shpgperproc");
 		pmap_collect(pmap, pq);
 		goto retry;
 	}
 	PV_STAT(pc_chunk_count++);
 	PV_STAT(pc_chunk_allocs++);
 	colour++;
 	pc = (struct pv_chunk *)pmap_ptelist_alloc(&pv_vafree);
 	pmap_qenter((vm_offset_t)pc, &m, 1);
 	if ((m->flags & PG_ZERO) == 0)
 		pagezero(pc);
 	pc->pc_pmap = pmap;
 	pc->pc_map[0] = pc_freemask[0] & ~1ul;	/* preallocated bit 0 */
 	for (field = 1; field < _NPCM; field++)
 		pc->pc_map[field] = pc_freemask[field];
 	pv = &pc->pc_pventry[0];
 	TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 	PV_STAT(pv_entry_spare += _NPCPV - 1);
 	return (pv);
 }
 
 static __inline pv_entry_t
 pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 		if (pmap == PV_PMAP(pv) && va == pv->pv_va) {
 			TAILQ_REMOVE(&pvh->pv_list, pv, pv_list);
 			break;
 		}
 	}
 	return (pv);
 }
 
 static void
 pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	pv = pmap_pvh_remove(pvh, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pvh_free: pv not found"));
 	free_pv_entry(pmap, pv);
 }
 
 static void
 pmap_remove_entry(pmap_t pmap, vm_page_t m, vm_offset_t va)
 {
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	pmap_pvh_free(&m->md, pmap, va);
 	if (TAILQ_EMPTY(&m->md.pv_list))
 		vm_page_flag_clear(m, PG_WRITEABLE);
 }
 
 /*
  * Conditionally create a pv entry.
  */
 static boolean_t
 pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	pv_entry_t pv;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if (pv_entry_count < pv_entry_high_water && 
 	    (pv = get_pv_entry(pmap, TRUE)) != NULL) {
 		pv->pv_va = va;
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 /*
  * pmap_remove_pte: do the things to unmap a page in a process
  */
 static int
 pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t va, vm_page_t *free)
 {
 	pt_entry_t oldpte;
 	vm_page_t m;
 
 	CTR3(KTR_PMAP, "pmap_remove_pte: pmap=%p *ptq=0x%x va=0x%x",
 	    pmap, (u_long)*ptq, va);
 	
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpte = *ptq;
 	PT_SET_VA_MA(ptq, 0, TRUE);
 	if (oldpte & PG_W)
 		pmap->pm_stats.wired_count -= 1;
 	/*
 	 * Machines that don't support invlpg, also don't support
 	 * PG_G.
 	 */
 	if (oldpte & PG_G)
 		pmap_invalidate_page(kernel_pmap, va);
 	pmap->pm_stats.resident_count -= 1;
 	if (oldpte & PG_MANAGED) {
 		m = PHYS_TO_VM_PAGE(xpmap_mtop(oldpte) & PG_FRAME);
 		if ((oldpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			vm_page_dirty(m);
 		if (oldpte & PG_A)
 			vm_page_flag_set(m, PG_REFERENCED);
 		pmap_remove_entry(pmap, m, va);
 	}
 	return (pmap_unuse_pt(pmap, va, free));
 }
 
 /*
  * Remove a single page from a process address space
  */
 static void
 pmap_remove_page(pmap_t pmap, vm_offset_t va, vm_page_t *free)
 {
 	pt_entry_t *pte;
 
 	CTR2(KTR_PMAP, "pmap_remove_page: pmap=%p va=0x%x",
 	    pmap, va);
 	
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	KASSERT(curthread->td_pinned > 0, ("curthread not pinned"));
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if ((pte = pmap_pte_quick(pmap, va)) == NULL || (*pte & PG_V) == 0)
 		return;
 	pmap_remove_pte(pmap, pte, va, free);
 	pmap_invalidate_page(pmap, va);
 	if (*PMAP1)
 		PT_SET_MA(PADDR1, 0);
 
 }
 
 /*
  *	Remove the given range of addresses from the specified map.
  *
  *	It is assumed that the start and end are properly
  *	rounded to the page size.
  */
 void
 pmap_remove(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t pdnxt;
 	pd_entry_t ptpaddr;
 	pt_entry_t *pte;
 	vm_page_t free = NULL;
 	int anyvalid;
 	
 	CTR3(KTR_PMAP, "pmap_remove: pmap=%p sva=0x%x eva=0x%x",
 	    pmap, sva, eva);
 	
 	/*
 	 * Perform an unsynchronized read.  This is, however, safe.
 	 */
 	if (pmap->pm_stats.resident_count == 0)
 		return;
 
 	anyvalid = 0;
 
 	vm_page_lock_queues();
 	sched_pin();
 	PMAP_LOCK(pmap);
 
 	/*
 	 * special handling of removing one page.  a very
 	 * common operation and easy to short circuit some
 	 * code.
 	 */
 	if ((sva + PAGE_SIZE == eva) && 
 	    ((pmap->pm_pdir[(sva >> PDRSHIFT)] & PG_PS) == 0)) {
 		pmap_remove_page(pmap, sva, &free);
 		goto out;
 	}
 
 	for (; sva < eva; sva = pdnxt) {
 		unsigned pdirindex;
 
 		/*
 		 * Calculate index for next page table.
 		 */
 		pdnxt = (sva + NBPDR) & ~PDRMASK;
 		if (pmap->pm_stats.resident_count == 0)
 			break;
 
 		pdirindex = sva >> PDRSHIFT;
 		ptpaddr = pmap->pm_pdir[pdirindex];
 
 		/*
 		 * Weed out invalid mappings. Note: we assume that the page
 		 * directory table is always allocated, and in kernel virtual.
 		 */
 		if (ptpaddr == 0)
 			continue;
 
 		/*
 		 * Check for large page.
 		 */
 		if ((ptpaddr & PG_PS) != 0) {
 			PD_CLEAR_VA(pmap, pdirindex, TRUE);
 			pmap->pm_stats.resident_count -= NBPDR / PAGE_SIZE;
 			anyvalid = 1;
 			continue;
 		}
 
 		/*
 		 * Limit our scan to either the end of the va represented
 		 * by the current page table page, or to the end of the
 		 * range being removed.
 		 */
 		if (pdnxt > eva)
 			pdnxt = eva;
 
 		for (pte = pmap_pte_quick(pmap, sva); sva != pdnxt; pte++,
 		    sva += PAGE_SIZE) {
 			if ((*pte & PG_V) == 0)
 				continue;
 
 			/*
 			 * The TLB entry for a PG_G mapping is invalidated
 			 * by pmap_remove_pte().
 			 */
 			if ((*pte & PG_G) == 0)
 				anyvalid = 1;
 			if (pmap_remove_pte(pmap, pte, sva, &free))
 				break;
 		}
 	}
 	PT_UPDATES_FLUSH();
 	if (*PMAP1)
 		PT_SET_VA_MA(PMAP1, 0, TRUE);
 out:
 	if (anyvalid)
 		pmap_invalidate_all(pmap);
 	sched_unpin();
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 	pmap_free_zero_pages(free);
 }
 
 /*
  *	Routine:	pmap_remove_all
  *	Function:
  *		Removes this physical page from
  *		all physical maps in which it resides.
  *		Reflects back modify bits to the pager.
  *
  *	Notes:
  *		Original versions of this routine were very
  *		inefficient because they iteratively called
  *		pmap_remove (slow...)
  */
 
 void
 pmap_remove_all(vm_page_t m)
 {
 	pv_entry_t pv;
 	pmap_t pmap;
 	pt_entry_t *pte, tpte;
 	vm_page_t free;
 
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_remove_all: page %p is fictitious", m));
 	free = NULL;
 	vm_page_lock_queues();
 	sched_pin();
 	while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pmap->pm_stats.resident_count--;
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 
 		tpte = *pte;
 		PT_SET_VA_MA(pte, 0, TRUE);
 		if (tpte & PG_W)
 			pmap->pm_stats.wired_count--;
 		if (tpte & PG_A)
 			vm_page_flag_set(m, PG_REFERENCED);
 
 		/*
 		 * Update the vm_page_t clean and reference bits.
 		 */
 		if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			vm_page_dirty(m);
 		pmap_unuse_pt(pmap, pv->pv_va, &free);
 		pmap_invalidate_page(pmap, pv->pv_va);
 		TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 		free_pv_entry(pmap, pv);
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	PT_UPDATES_FLUSH();
 	if (*PMAP1)
 		PT_SET_MA(PADDR1, 0);
 	sched_unpin();
 	vm_page_unlock_queues();
 	pmap_free_zero_pages(free);
 }
 
 /*
  *	Set the physical protection on the
  *	specified range of this map as requested.
  */
 void
 pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
 {
 	vm_offset_t pdnxt;
 	pd_entry_t ptpaddr;
 	pt_entry_t *pte;
 	int anychanged;
 
 	CTR4(KTR_PMAP, "pmap_protect: pmap=%p sva=0x%x eva=0x%x prot=0x%x",
 	    pmap, sva, eva, prot);
 	
 	if ((prot & VM_PROT_READ) == VM_PROT_NONE) {
 		pmap_remove(pmap, sva, eva);
 		return;
 	}
 
 #ifdef PAE
 	if ((prot & (VM_PROT_WRITE|VM_PROT_EXECUTE)) ==
 	    (VM_PROT_WRITE|VM_PROT_EXECUTE))
 		return;
 #else
 	if (prot & VM_PROT_WRITE)
 		return;
 #endif
 
 	anychanged = 0;
 
 	vm_page_lock_queues();
 	sched_pin();
 	PMAP_LOCK(pmap);
 	for (; sva < eva; sva = pdnxt) {
 		pt_entry_t obits, pbits;
 		unsigned pdirindex;
 
 		pdnxt = (sva + NBPDR) & ~PDRMASK;
 
 		pdirindex = sva >> PDRSHIFT;
 		ptpaddr = pmap->pm_pdir[pdirindex];
 
 		/*
 		 * Weed out invalid mappings. Note: we assume that the page
 		 * directory table is always allocated, and in kernel virtual.
 		 */
 		if (ptpaddr == 0)
 			continue;
 
 		/*
 		 * Check for large page.
 		 */
 		if ((ptpaddr & PG_PS) != 0) {
 			if ((prot & VM_PROT_WRITE) == 0)
 				pmap->pm_pdir[pdirindex] &= ~(PG_M|PG_RW);
 #ifdef PAE
 			if ((prot & VM_PROT_EXECUTE) == 0)
 				pmap->pm_pdir[pdirindex] |= pg_nx;
 #endif
 			anychanged = 1;
 			continue;
 		}
 
 		if (pdnxt > eva)
 			pdnxt = eva;
 
 		for (pte = pmap_pte_quick(pmap, sva); sva != pdnxt; pte++,
 		    sva += PAGE_SIZE) {
 			vm_page_t m;
 
 retry:
 			/*
 			 * Regardless of whether a pte is 32 or 64 bits in
 			 * size, PG_RW, PG_A, and PG_M are among the least
 			 * significant 32 bits.
 			 */
 			obits = pbits = *pte;
 			if ((pbits & PG_V) == 0)
 				continue;
 
 			if ((prot & VM_PROT_WRITE) == 0) {
 				if ((pbits & (PG_MANAGED | PG_M | PG_RW)) ==
 				    (PG_MANAGED | PG_M | PG_RW)) {
 					m = PHYS_TO_VM_PAGE(xpmap_mtop(pbits) &
 					    PG_FRAME);
 					vm_page_dirty(m);
 				}
 				pbits &= ~(PG_RW | PG_M);
 			}
 #ifdef PAE
 			if ((prot & VM_PROT_EXECUTE) == 0)
 				pbits |= pg_nx;
 #endif
 
 			if (pbits != obits) {
 				obits = *pte;
 				PT_SET_VA_MA(pte, pbits, TRUE);
 				if (*pte != pbits)
 					goto retry;
 				if (obits & PG_G)
 					pmap_invalidate_page(pmap, sva);
 				else
 					anychanged = 1;
 			}
 		}
 	}
 	PT_UPDATES_FLUSH();
 	if (*PMAP1)
 		PT_SET_VA_MA(PMAP1, 0, TRUE);
 	if (anychanged)
 		pmap_invalidate_all(pmap);
 	sched_unpin();
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	Insert the given physical page (p) at
  *	the specified virtual address (v) in the
  *	target physical map with the protection requested.
  *
  *	If specified, the page will be wired down, meaning
  *	that the related pte can not be reclaimed.
  *
  *	NB:  This is the only routine which MAY NOT lazy-evaluate
  *	or lose information.  That is, this routine must actually
  *	insert this page into the given map NOW.
  */
 void
 pmap_enter(pmap_t pmap, vm_offset_t va, vm_prot_t access, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	pt_entry_t newpte, origpte;
 	pv_entry_t pv;
 	vm_paddr_t opa, pa;
 	vm_page_t mpte, om;
 	boolean_t invlva;
 
 	CTR6(KTR_PMAP, "pmap_enter: pmap=%08p va=0x%08x access=0x%x ma=0x%08x prot=0x%x wired=%d",
 	    pmap, va, access, VM_PAGE_TO_MACH(m), prot, wired);
 	va = trunc_page(va);
 	KASSERT(va <= VM_MAX_KERNEL_ADDRESS, ("pmap_enter: toobig"));
 	KASSERT(va < UPT_MIN_ADDRESS || va >= UPT_MAX_ADDRESS,
 	    ("pmap_enter: invalid to pmap_enter page table pages (va: 0x%x)",
 	    va));
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 	    (m->oflags & VPO_BUSY) != 0,
 	    ("pmap_enter: page %p is not busy", m));
 
 	mpte = NULL;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	sched_pin();
 
 	/*
 	 * In the case that a page table page is not
 	 * resident, we are creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		mpte = pmap_allocpte(pmap, va, M_WAITOK);
 	}
 
 	pde = pmap_pde(pmap, va);
 	if ((*pde & PG_PS) != 0)
 		panic("pmap_enter: attempted pmap_enter on 4MB page");
 	pte = pmap_pte_quick(pmap, va);
 
 	/*
 	 * Page Directory table entry not valid, we need a new PT page
 	 */
 	if (pte == NULL) {
 		panic("pmap_enter: invalid page directory pdir=%#jx, va=%#x",
 			(uintmax_t)pmap->pm_pdir[va >> PDRSHIFT], va);
 	}
 
 	pa = VM_PAGE_TO_PHYS(m);
 	om = NULL;
 	opa = origpte = 0;
 
 #if 0
 	KASSERT((*pte & PG_V) || (*pte == 0), ("address set but not valid pte=%p *pte=0x%016jx",
 		pte, *pte));
 #endif
 	origpte = *pte;
 	if (origpte)
 		origpte = xpmap_mtop(origpte);
 	opa = origpte & PG_FRAME;
 
 	/*
 	 * Mapping has not changed, must be protection or wiring change.
 	 */
 	if (origpte && (opa == pa)) {
 		/*
 		 * Wiring change, just update stats. We don't worry about
 		 * wiring PT pages as they remain resident as long as there
 		 * are valid mappings in them. Hence, if a user page is wired,
 		 * the PT page will be also.
 		 */
 		if (wired && ((origpte & PG_W) == 0))
 			pmap->pm_stats.wired_count++;
 		else if (!wired && (origpte & PG_W))
 			pmap->pm_stats.wired_count--;
 
 		/*
 		 * Remove extra pte reference
 		 */
 		if (mpte)
 			mpte->wire_count--;
 
 		if (origpte & PG_MANAGED) {
 			om = m;
 			pa |= PG_MANAGED;
 		}
 		goto validate;
 	} 
 
 	pv = NULL;
 
 	/*
 	 * Mapping has changed, invalidate old range and fall through to
 	 * handle validating new mapping.
 	 */
 	if (opa) {
 		if (origpte & PG_W)
 			pmap->pm_stats.wired_count--;
 		if (origpte & PG_MANAGED) {
 			om = PHYS_TO_VM_PAGE(opa);
 			pv = pmap_pvh_remove(&om->md, pmap, va);
 		} else if (va < VM_MAXUSER_ADDRESS) 
 			printf("va=0x%x is unmanaged :-( \n", va);
 			
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			KASSERT(mpte->wire_count > 0,
 			    ("pmap_enter: missing reference to page table page,"
 			     " va: 0x%x", va));
 		}
 	} else
 		pmap->pm_stats.resident_count++;
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0) {
 		KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva,
 		    ("pmap_enter: managed mapping within the clean submap"));
 		if (pv == NULL)
 			pv = get_pv_entry(pmap, FALSE);
 		pv->pv_va = va;
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 		pa |= PG_MANAGED;
 	} else if (pv != NULL)
 		free_pv_entry(pmap, pv);
 
 	/*
 	 * Increment counters
 	 */
 	if (wired)
 		pmap->pm_stats.wired_count++;
 
 validate:
 	/*
 	 * Now validate mapping with desired protection/wiring.
 	 */
 	newpte = (pt_entry_t)(pa | PG_V);
 	if ((prot & VM_PROT_WRITE) != 0) {
 		newpte |= PG_RW;
 		if ((newpte & PG_MANAGED) != 0)
 			vm_page_flag_set(m, PG_WRITEABLE);
 	}
 #ifdef PAE
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpte |= pg_nx;
 #endif
 	if (wired)
 		newpte |= PG_W;
 	if (va < VM_MAXUSER_ADDRESS)
 		newpte |= PG_U;
 	if (pmap == kernel_pmap)
 		newpte |= pgeflag;
 
 	critical_enter();
 	/*
 	 * if the mapping or permission bits are different, we need
 	 * to update the pte.
 	 */
 	if ((origpte & ~(PG_M|PG_A)) != newpte) {
 		if (origpte) {
 			invlva = FALSE;
 			origpte = *pte;
 			PT_SET_VA(pte, newpte | PG_A, FALSE);
 			if (origpte & PG_A) {
 				if (origpte & PG_MANAGED)
 					vm_page_flag_set(om, PG_REFERENCED);
 				if (opa != VM_PAGE_TO_PHYS(m))
 					invlva = TRUE;
 #ifdef PAE
 				if ((origpte & PG_NX) == 0 &&
 				    (newpte & PG_NX) != 0)
 					invlva = TRUE;
 #endif
 			}
 			if ((origpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 				if ((origpte & PG_MANAGED) != 0)
 					vm_page_dirty(om);
 				if ((prot & VM_PROT_WRITE) == 0)
 					invlva = TRUE;
 			}
 			if ((origpte & PG_MANAGED) != 0 &&
 			    TAILQ_EMPTY(&om->md.pv_list))
 				vm_page_flag_clear(om, PG_WRITEABLE);
 			if (invlva)
 				pmap_invalidate_page(pmap, va);
 		} else{
 			PT_SET_VA(pte, newpte | PG_A, FALSE);
 		}
 		
 	}
 	PT_UPDATES_FLUSH();
 	critical_exit();
 	if (*PMAP1)
 		PT_SET_VA_MA(PMAP1, 0, TRUE);
 	sched_unpin();
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 pmap_enter_object(pmap_t pmap, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	vm_page_t m, mpte;
 	vm_pindex_t diff, psize;
 	multicall_entry_t mcl[16];
 	multicall_entry_t *mclp = mcl;
 	int error, count = 0;
 	
 	VM_OBJECT_LOCK_ASSERT(m_start->object, MA_OWNED);
 	psize = atop(end - start);
 	    
 	mpte = NULL;
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		mpte = pmap_enter_quick_locked(&mclp, &count, pmap, start + ptoa(diff), m,
 		    prot, mpte);
 		m = TAILQ_NEXT(m, listq);
 		if (count == 16) {
 			error = HYPERVISOR_multicall(mcl, count);
 			KASSERT(error == 0, ("bad multicall %d", error));
 			mclp = mcl;
 			count = 0;
 		}
 	}
 	if (count) {
 		error = HYPERVISOR_multicall(mcl, count);
 		KASSERT(error == 0, ("bad multicall %d", error));
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * this code makes some *MAJOR* assumptions:
  * 1. Current pmap & pmap exists.
  * 2. Not wired.
  * 3. Read access.
  * 4. No page table pages.
  * but is *MUCH* faster than pmap_enter...
  */
 
 void
 pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 	multicall_entry_t mcl, *mclp;
 	int count = 0;
 	mclp = &mcl;
 	
 	CTR4(KTR_PMAP, "pmap_enter_quick: pmap=%p va=0x%x m=%p prot=0x%x",
 	    pmap, va, m, prot);
 	
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	(void)pmap_enter_quick_locked(&mclp, &count, pmap, va, m, prot, NULL);
 	if (count)
 		HYPERVISOR_multicall(&mcl, count);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 #ifdef notyet
 void
 pmap_enter_quick_range(pmap_t pmap, vm_offset_t *addrs, vm_page_t *pages, vm_prot_t *prots, int count)
 {
 	int i, error, index = 0;
 	multicall_entry_t mcl[16];
 	multicall_entry_t *mclp = mcl;
 		
 	PMAP_LOCK(pmap);
 	for (i = 0; i < count; i++, addrs++, pages++, prots++) {
 		if (!pmap_is_prefaultable_locked(pmap, *addrs))
 			continue;
 
 		(void) pmap_enter_quick_locked(&mclp, &index, pmap, *addrs, *pages, *prots, NULL);
 		if (index == 16) {
 			error = HYPERVISOR_multicall(mcl, index);
 			mclp = mcl;
 			index = 0;
 			KASSERT(error == 0, ("bad multicall %d", error));
 		}
 	}
 	if (index) {
 		error = HYPERVISOR_multicall(mcl, index);
 		KASSERT(error == 0, ("bad multicall %d", error));
 	}
 	
 	PMAP_UNLOCK(pmap);
 }
 #endif
 
 static vm_page_t
 pmap_enter_quick_locked(multicall_entry_t **mclpp, int *count, pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, vm_page_t mpte)
 {
 	pt_entry_t *pte;
 	vm_paddr_t pa;
 	vm_page_t free;
 	multicall_entry_t *mcl = *mclpp;
 	
 	KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva ||
 	    (m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0,
 	    ("pmap_enter_quick_locked: managed mapping within the clean submap"));
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * In the case that a page table page is not
 	 * resident, we are creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		unsigned ptepindex;
 		pd_entry_t ptema;
 
 		/*
 		 * Calculate pagetable page index
 		 */
 		ptepindex = va >> PDRSHIFT;
 		if (mpte && (mpte->pindex == ptepindex)) {
 			mpte->wire_count++;
 		} else {
 			/*
 			 * Get the page directory entry
 			 */
 			ptema = pmap->pm_pdir[ptepindex];
 
 			/*
 			 * If the page table page is mapped, we just increment
 			 * the hold count, and activate it.
 			 */
 			if (ptema & PG_V) {
 				if (ptema & PG_PS)
 					panic("pmap_enter_quick: unexpected mapping into 4MB page");
 				mpte = PHYS_TO_VM_PAGE(xpmap_mtop(ptema) & PG_FRAME);
 				mpte->wire_count++;
 			} else {
 				mpte = _pmap_allocpte(pmap, ptepindex,
 				    M_NOWAIT);
 				if (mpte == NULL)
 					return (mpte);
 			}
 		}
 	} else {
 		mpte = NULL;
 	}
 
 	/*
 	 * This call to vtopte makes the assumption that we are
 	 * entering the page into the current pmap.  In order to support
 	 * quick entry into any pmap, one would likely use pmap_pte_quick.
 	 * But that isn't as quick as vtopte.
 	 */
 	KASSERT(pmap_is_current(pmap), ("entering pages in non-current pmap"));
 	pte = vtopte(va);
 	if (*pte & PG_V) {
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0 &&
 	    !pmap_try_insert_pv_entry(pmap, va, m)) {
 		if (mpte != NULL) {
 			free = NULL;
 			if (pmap_unwire_pte_hold(pmap, mpte, &free)) {
 				pmap_invalidate_page(pmap, va);
 				pmap_free_zero_pages(free);
 			}
 			
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Increment counters
 	 */
 	pmap->pm_stats.resident_count++;
 
 	pa = VM_PAGE_TO_PHYS(m);
 #ifdef PAE
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		pa |= pg_nx;
 #endif
 
 #if 0
 	/*
 	 * Now validate mapping with RO protection
 	 */
 	if (m->flags & (PG_FICTITIOUS|PG_UNMANAGED))
 		pte_store(pte, pa | PG_V | PG_U);
 	else
 		pte_store(pte, pa | PG_V | PG_U | PG_MANAGED);
 #else
 	/*
 	 * Now validate mapping with RO protection
 	 */
 	if (m->flags & (PG_FICTITIOUS|PG_UNMANAGED))
 		pa = 	xpmap_ptom(pa | PG_V | PG_U);
 	else
 		pa = xpmap_ptom(pa | PG_V | PG_U | PG_MANAGED);
 
 	mcl->op = __HYPERVISOR_update_va_mapping;
 	mcl->args[0] = va;
 	mcl->args[1] = (uint32_t)(pa & 0xffffffff);
 	mcl->args[2] = (uint32_t)(pa >> 32);
 	mcl->args[3] = 0;
 	*mclpp = mcl + 1;
 	*count = *count + 1;
 #endif	
 	return mpte;
 }
 
 /*
  * Make a temporary mapping for a physical address.  This is only intended
  * to be used for panic dumps.
  */
 void *
 pmap_kenter_temporary(vm_paddr_t pa, int i)
 {
 	vm_offset_t va;
 	vm_paddr_t ma = xpmap_ptom(pa);
 
 	va = (vm_offset_t)crashdumpmap + (i * PAGE_SIZE);
 	PT_SET_MA(va, (ma & ~PAGE_MASK) | PG_V | pgeflag);
 	invlpg(va);
 	return ((void *)crashdumpmap);
 }
 
 /*
  * This code maps large physical mmap regions into the
  * processor address space.  Note that some shortcuts
  * are taken, but the code works.
  */
 void
 pmap_object_init_pt(pmap_t pmap, vm_offset_t addr,
 		    vm_object_t object, vm_pindex_t pindex,
 		    vm_size_t size)
 {
 	pd_entry_t *pde;
 	vm_paddr_t pa, ptepa;
 	vm_page_t p;
 	int pat_mode;
 
 	VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
 	KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG,
 	    ("pmap_object_init_pt: non-device object"));
 	if (pseflag && 
 	    (addr & (NBPDR - 1)) == 0 && (size & (NBPDR - 1)) == 0) {
 		if (!vm_object_populate(object, pindex, pindex + atop(size)))
 			return;
 		p = vm_page_lookup(object, pindex);
 		KASSERT(p->valid == VM_PAGE_BITS_ALL,
 		    ("pmap_object_init_pt: invalid page %p", p));
 		pat_mode = p->md.pat_mode;
 		/*
 		 * Abort the mapping if the first page is not physically
 		 * aligned to a 2/4MB page boundary.
 		 */
 		ptepa = VM_PAGE_TO_PHYS(p);
 		if (ptepa & (NBPDR - 1))
 			return;
 		/*
 		 * Skip the first page.  Abort the mapping if the rest of
 		 * the pages are not physically contiguous or have differing
 		 * memory attributes.
 		 */
 		p = TAILQ_NEXT(p, listq);
 		for (pa = ptepa + PAGE_SIZE; pa < ptepa + size;
 		    pa += PAGE_SIZE) {
 			KASSERT(p->valid == VM_PAGE_BITS_ALL,
 			    ("pmap_object_init_pt: invalid page %p", p));
 			if (pa != VM_PAGE_TO_PHYS(p) ||
 			    pat_mode != p->md.pat_mode)
 				return;
 			p = TAILQ_NEXT(p, listq);
 		}
 		/* Map using 2/4MB pages. */
 		PMAP_LOCK(pmap);
 		for (pa = ptepa | pmap_cache_bits(pat_mode, 1); pa < ptepa +
 		    size; pa += NBPDR) {
 			pde = pmap_pde(pmap, addr);
 			if (*pde == 0) {
 				pde_store(pde, pa | PG_PS | PG_M | PG_A |
 				    PG_U | PG_RW | PG_V);
 				pmap->pm_stats.resident_count += NBPDR /
 				    PAGE_SIZE;
 				pmap_pde_mappings++;
 			}
 			/* Else continue on if the PDE is already valid. */
 			addr += NBPDR;
 		}
 		PMAP_UNLOCK(pmap);
 	}
 }
 
 /*
  *	Routine:	pmap_change_wiring
  *	Function:	Change the wiring attribute for a map/virtual-address
  *			pair.
  *	In/out conditions:
  *			The mapping must already exist in the pmap.
  */
 void
 pmap_change_wiring(pmap_t pmap, vm_offset_t va, boolean_t wired)
 {
 	pt_entry_t *pte;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	pte = pmap_pte(pmap, va);
 
 	if (wired && !pmap_pte_w(pte)) {
 		PT_SET_VA_MA((pte), *(pte) | PG_W, TRUE);
 		pmap->pm_stats.wired_count++;
 	} else if (!wired && pmap_pte_w(pte)) {
 		PT_SET_VA_MA((pte), *(pte) & ~PG_W, TRUE);
 		pmap->pm_stats.wired_count--;
 	}
 	
 	/*
 	 * Wiring is not a hardware characteristic so there is no need to
 	 * invalidate TLB.
 	 */
 	pmap_pte_release(pte);
 	PMAP_UNLOCK(pmap);
 	vm_page_unlock_queues();
 }
 
 
 
 /*
  *	Copy the range specified by src_addr/len
  *	from the source map to the range dst_addr/len
  *	in the destination map.
  *
  *	This routine is only advisory and need not do anything.
  */
 
 void
 pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, vm_size_t len,
 	  vm_offset_t src_addr)
 {
 	vm_page_t   free;
 	vm_offset_t addr;
 	vm_offset_t end_addr = src_addr + len;
 	vm_offset_t pdnxt;
 
 	if (dst_addr != src_addr)
 		return;
 
 	if (!pmap_is_current(src_pmap)) {
 		CTR2(KTR_PMAP,
 		    "pmap_copy, skipping: pdir[PTDPTDI]=0x%jx PTDpde[0]=0x%jx",
 		    (src_pmap->pm_pdir[PTDPTDI] & PG_FRAME), (PTDpde[0] & PG_FRAME));
 		
 		return;
 	}
 	CTR5(KTR_PMAP, "pmap_copy:  dst_pmap=%p src_pmap=%p dst_addr=0x%x len=%d src_addr=0x%x",
 	    dst_pmap, src_pmap, dst_addr, len, src_addr);
 	
 #ifdef HAMFISTED_LOCKING
 	mtx_lock(&createdelete_lock);
 #endif
 
 	vm_page_lock_queues();
 	if (dst_pmap < src_pmap) {
 		PMAP_LOCK(dst_pmap);
 		PMAP_LOCK(src_pmap);
 	} else {
 		PMAP_LOCK(src_pmap);
 		PMAP_LOCK(dst_pmap);
 	}
 	sched_pin();
 	for (addr = src_addr; addr < end_addr; addr = pdnxt) {
 		pt_entry_t *src_pte, *dst_pte;
 		vm_page_t dstmpte, srcmpte;
 		pd_entry_t srcptepaddr;
 		unsigned ptepindex;
 
 		KASSERT(addr < UPT_MIN_ADDRESS,
 		    ("pmap_copy: invalid to pmap_copy page tables"));
 
 		pdnxt = (addr + NBPDR) & ~PDRMASK;
 		ptepindex = addr >> PDRSHIFT;
 
 		srcptepaddr = PT_GET(&src_pmap->pm_pdir[ptepindex]);
 		if (srcptepaddr == 0)
 			continue;
 			
 		if (srcptepaddr & PG_PS) {
 			if (dst_pmap->pm_pdir[ptepindex] == 0) {
 				PD_SET_VA(dst_pmap, ptepindex, srcptepaddr & ~PG_W, TRUE);
 				dst_pmap->pm_stats.resident_count +=
 				    NBPDR / PAGE_SIZE;
 			}
 			continue;
 		}
 
 		srcmpte = PHYS_TO_VM_PAGE(srcptepaddr & PG_FRAME);
 		KASSERT(srcmpte->wire_count > 0,
 		    ("pmap_copy: source page table page is unused"));
 
 		if (pdnxt > end_addr)
 			pdnxt = end_addr;
 
 		src_pte = vtopte(addr);
 		while (addr < pdnxt) {
 			pt_entry_t ptetemp;
 			ptetemp = *src_pte;
 			/*
 			 * we only virtual copy managed pages
 			 */
 			if ((ptetemp & PG_MANAGED) != 0) {
 				dstmpte = pmap_allocpte(dst_pmap, addr,
 				    M_NOWAIT);
 				if (dstmpte == NULL)
 					break;
 				dst_pte = pmap_pte_quick(dst_pmap, addr);
 				if (*dst_pte == 0 &&
 				    pmap_try_insert_pv_entry(dst_pmap, addr,
 				    PHYS_TO_VM_PAGE(xpmap_mtop(ptetemp) & PG_FRAME))) {
 					/*
 					 * Clear the wired, modified, and
 					 * accessed (referenced) bits
 					 * during the copy.
 					 */
 					KASSERT(ptetemp != 0, ("src_pte not set"));
 					PT_SET_VA_MA(dst_pte, ptetemp & ~(PG_W | PG_M | PG_A), TRUE /* XXX debug */);
 					KASSERT(*dst_pte == (ptetemp & ~(PG_W | PG_M | PG_A)),
 					    ("no pmap copy expected: 0x%jx saw: 0x%jx",
 						ptetemp &  ~(PG_W | PG_M | PG_A), *dst_pte));
 					dst_pmap->pm_stats.resident_count++;
 	 			} else {
 					free = NULL;
 					if (pmap_unwire_pte_hold(dst_pmap,
 					    dstmpte, &free)) {
 						pmap_invalidate_page(dst_pmap,
 						    addr);
 						pmap_free_zero_pages(free);
 					}
 				}
 				if (dstmpte->wire_count >= srcmpte->wire_count)
 					break;
 			}
 			addr += PAGE_SIZE;
 			src_pte++;
 		}
 	}
 	PT_UPDATES_FLUSH();
 	sched_unpin();
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(src_pmap);
 	PMAP_UNLOCK(dst_pmap);
 
 #ifdef HAMFISTED_LOCKING
 	mtx_unlock(&createdelete_lock);
 #endif
 }	
 
 static __inline void
 pagezero(void *page)
 {
 #if defined(I686_CPU)
 	if (cpu_class == CPUCLASS_686) {
 #if defined(CPU_ENABLE_SSE)
 		if (cpu_feature & CPUID_SSE2)
 			sse2_pagezero(page);
 		else
 #endif
 			i686_pagezero(page);
 	} else
 #endif
 		bzero(page, PAGE_SIZE);
 }
 
 /*
  *	pmap_zero_page zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  */
 void
 pmap_zero_page(vm_page_t m)
 {
 	struct sysmaps *sysmaps;
 
 	sysmaps = &sysmaps_pcpu[PCPU_GET(cpuid)];
 	mtx_lock(&sysmaps->lock);
 	if (*sysmaps->CMAP2)
 		panic("pmap_zero_page: CMAP2 busy");
 	sched_pin();
 	PT_SET_MA(sysmaps->CADDR2, PG_V | PG_RW | VM_PAGE_TO_MACH(m) | PG_A | PG_M);
 	pagezero(sysmaps->CADDR2);
 	PT_SET_MA(sysmaps->CADDR2, 0);
 	sched_unpin();
 	mtx_unlock(&sysmaps->lock);
 }
 
 /*
  *	pmap_zero_page_area zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  *
  *	off and size may not cover an area beyond a single hardware page.
  */
 void
 pmap_zero_page_area(vm_page_t m, int off, int size)
 {
 	struct sysmaps *sysmaps;
 
 	sysmaps = &sysmaps_pcpu[PCPU_GET(cpuid)];
 	mtx_lock(&sysmaps->lock);
 	if (*sysmaps->CMAP2)
 		panic("pmap_zero_page: CMAP2 busy");
 	sched_pin();
 	PT_SET_MA(sysmaps->CADDR2, PG_V | PG_RW | VM_PAGE_TO_MACH(m) | PG_A | PG_M);
 
 	if (off == 0 && size == PAGE_SIZE) 
 		pagezero(sysmaps->CADDR2);
 	else
 		bzero((char *)sysmaps->CADDR2 + off, size);
 	PT_SET_MA(sysmaps->CADDR2, 0);
 	sched_unpin();
 	mtx_unlock(&sysmaps->lock);
 }
 
 /*
  *	pmap_zero_page_idle zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.  This
  *	is intended to be called from the vm_pagezero process only and
  *	outside of Giant.
  */
 void
 pmap_zero_page_idle(vm_page_t m)
 {
 
 	if (*CMAP3)
 		panic("pmap_zero_page: CMAP3 busy");
 	sched_pin();
 	PT_SET_MA(CADDR3, PG_V | PG_RW | VM_PAGE_TO_MACH(m) | PG_A | PG_M);
 	pagezero(CADDR3);
 	PT_SET_MA(CADDR3, 0);
 	sched_unpin();
 }
 
 /*
  *	pmap_copy_page copies the specified (machine independent)
  *	page by mapping the page into virtual memory and using
  *	bcopy to copy the page, one machine dependent page at a
  *	time.
  */
 void
 pmap_copy_page(vm_page_t src, vm_page_t dst)
 {
 	struct sysmaps *sysmaps;
 
 	sysmaps = &sysmaps_pcpu[PCPU_GET(cpuid)];
 	mtx_lock(&sysmaps->lock);
 	if (*sysmaps->CMAP1)
 		panic("pmap_copy_page: CMAP1 busy");
 	if (*sysmaps->CMAP2)
 		panic("pmap_copy_page: CMAP2 busy");
 	sched_pin();
 	PT_SET_MA(sysmaps->CADDR1, PG_V | VM_PAGE_TO_MACH(src) | PG_A);
 	PT_SET_MA(sysmaps->CADDR2, PG_V | PG_RW | VM_PAGE_TO_MACH(dst) | PG_A | PG_M);
 	bcopy(sysmaps->CADDR1, sysmaps->CADDR2, PAGE_SIZE);
 	PT_SET_MA(sysmaps->CADDR1, 0);
 	PT_SET_MA(sysmaps->CADDR2, 0);
 	sched_unpin();
 	mtx_unlock(&sysmaps->lock);
 }
 
 /*
  * Returns true if the pmap's pv is one of the first
  * 16 pvs linked to from this page.  This count may
  * be changed upwards or downwards in the future; it
  * is only necessary that true be returned for a small
  * subset of pmaps for proper page aging.
  */
 boolean_t
 pmap_page_exists_quick(pmap_t pmap, vm_page_t m)
 {
 	pv_entry_t pv;
 	int loops = 0;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_page_exists_quick: page %p is not managed", m));
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		if (PV_PMAP(pv) == pmap) {
 			rv = TRUE;
 			break;
 		}
 		loops++;
 		if (loops >= 16)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  *	pmap_page_wired_mappings:
  *
  *	Return the number of managed mappings to the given physical page
  *	that are wired.
  */
 int
 pmap_page_wired_mappings(vm_page_t m)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 	pmap_t pmap;
 	int count;
 
 	count = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	sched_pin();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		if ((*pte & PG_W) != 0)
 			count++;
 		PMAP_UNLOCK(pmap);
 	}
 	sched_unpin();
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  * Returns TRUE if the given page is mapped individually or as part of
  * a 4mpage.  Otherwise, returns FALSE.
  */
 boolean_t
 pmap_page_is_mapped(vm_page_t m)
 {
 	boolean_t rv;
 
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0)
 		return (FALSE);
 	vm_page_lock_queues();
 	rv = !TAILQ_EMPTY(&m->md.pv_list) ||
 	    !TAILQ_EMPTY(&pa_to_pvh(VM_PAGE_TO_PHYS(m))->pv_list);
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Remove all pages from specified address space
  * this aids process exit speeds.  Also, this code
  * is special cased for current process only, but
  * can have the more generic (and slightly slower)
  * mode enabled.  This is much faster than pmap_remove
  * in the case of running down an entire address space.
  */
 void
 pmap_remove_pages(pmap_t pmap)
 {
 	pt_entry_t *pte, tpte;
 	vm_page_t m, free = NULL;
 	pv_entry_t pv;
 	struct pv_chunk *pc, *npc;
 	int field, idx;
 	int32_t bit;
 	uint32_t inuse, bitmask;
 	int allfree;
 
 	CTR1(KTR_PMAP, "pmap_remove_pages: pmap=%p", pmap);
 	
 	if (pmap != vmspace_pmap(curthread->td_proc->p_vmspace)) {
 		printf("warning: pmap_remove_pages called with non-current pmap\n");
 		return;
 	}
 	vm_page_lock_queues();
 	KASSERT(pmap_is_current(pmap), ("removing pages from non-current pmap"));
 	PMAP_LOCK(pmap);
 	sched_pin();
 	TAILQ_FOREACH_SAFE(pc, &pmap->pm_pvchunk, pc_list, npc) {
 		allfree = 1;
 		for (field = 0; field < _NPCM; field++) {
 			inuse = (~(pc->pc_map[field])) & pc_freemask[field];
 			while (inuse != 0) {
 				bit = bsfl(inuse);
 				bitmask = 1UL << bit;
 				idx = field * 32 + bit;
 				pv = &pc->pc_pventry[idx];
 				inuse &= ~bitmask;
 
 				pte = vtopte(pv->pv_va);
 				tpte = *pte ? xpmap_mtop(*pte) : 0;
 
 				if (tpte == 0) {
 					printf(
 					    "TPTE at %p  IS ZERO @ VA %08x\n",
 					    pte, pv->pv_va);
 					panic("bad pte");
 				}
 
 /*
  * We cannot remove wired pages from a process' mapping at this time
  */
 				if (tpte & PG_W) {
 					allfree = 0;
 					continue;
 				}
 
 				m = PHYS_TO_VM_PAGE(tpte & PG_FRAME);
 				KASSERT(m->phys_addr == (tpte & PG_FRAME),
 				    ("vm_page_t %p phys_addr mismatch %016jx %016jx",
 				    m, (uintmax_t)m->phys_addr,
 				    (uintmax_t)tpte));
 
 				KASSERT(m < &vm_page_array[vm_page_array_size],
 					("pmap_remove_pages: bad tpte %#jx",
 					(uintmax_t)tpte));
 
 
 				PT_CLEAR_VA(pte, FALSE);
 				
 				/*
 				 * Update the vm_page_t clean/reference bits.
 				 */
 				if (tpte & PG_M)
 					vm_page_dirty(m);
 
 				TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 				if (TAILQ_EMPTY(&m->md.pv_list))
 					vm_page_flag_clear(m, PG_WRITEABLE);
 
 				pmap_unuse_pt(pmap, pv->pv_va, &free);
 
 				/* Mark free */
 				PV_STAT(pv_entry_frees++);
 				PV_STAT(pv_entry_spare++);
 				pv_entry_count--;
 				pc->pc_map[field] |= bitmask;
 				pmap->pm_stats.resident_count--;			
 			}
 		}
 		PT_UPDATES_FLUSH();
 		if (allfree) {
 			PV_STAT(pv_entry_spare -= _NPCPV);
 			PV_STAT(pc_chunk_count--);
 			PV_STAT(pc_chunk_frees++);
 			TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 			m = PHYS_TO_VM_PAGE(pmap_kextract((vm_offset_t)pc));
 			pmap_qremove((vm_offset_t)pc, 1);
 			vm_page_unwire(m, 0);
 			vm_page_free(m);
 			pmap_ptelist_free(&pv_vafree, (vm_offset_t)pc);
 		}
 	}
 	PT_UPDATES_FLUSH();
 	if (*PMAP1)
 		PT_SET_MA(PADDR1, 0);
 
 	sched_unpin();
 	pmap_invalidate_all(pmap);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 	pmap_free_zero_pages(free);
 }
 
 /*
  *	pmap_is_modified:
  *
  *	Return whether or not the specified physical page was modified
  *	in any physical maps.
  */
 boolean_t
 pmap_is_modified(vm_page_t m)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 	pmap_t pmap;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_modified: page %p is not managed", m));
 	rv = FALSE;
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no PTEs can have PG_M set.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return (rv);
 	vm_page_lock_queues();
 	sched_pin();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		rv = (*pte & PG_M) != 0;
 		PMAP_UNLOCK(pmap);
 		if (rv)
 			break;
 	}
 	if (*PMAP1)
 		PT_SET_MA(PADDR1, 0);
 	sched_unpin();
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  *	pmap_is_prefaultable:
  *
  *	Return whether or not the specified virtual address is elgible
  *	for prefault.
  */
 static boolean_t
 pmap_is_prefaultable_locked(pmap_t pmap, vm_offset_t addr)
 {
 	pt_entry_t *pte;
 	boolean_t rv = FALSE;
 
 	return (rv);
 	
 	if (pmap_is_current(pmap) && *pmap_pde(pmap, addr)) {
 		pte = vtopte(addr);
 		rv = (*pte == 0);
 	}
 	return (rv);
 }
 
 boolean_t
 pmap_is_prefaultable(pmap_t pmap, vm_offset_t addr)
 {
 	boolean_t rv;
 	
 	PMAP_LOCK(pmap);
 	rv = pmap_is_prefaultable_locked(pmap, addr);
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 boolean_t
 pmap_is_referenced(vm_page_t m)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 	pmap_t pmap;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_referenced: page %p is not managed", m));
 	rv = FALSE;
 	vm_page_lock_queues();
 	sched_pin();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		rv = (*pte & (PG_A | PG_V)) == (PG_A | PG_V);
 		PMAP_UNLOCK(pmap);
 		if (rv)
 			break;
 	}
 	if (*PMAP1)
 		PT_SET_MA(PADDR1, 0);
 	sched_unpin();
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 void
 pmap_map_readonly(pmap_t pmap, vm_offset_t va, int len)
 {
 	int i, npages = round_page(len) >> PAGE_SHIFT;
 	for (i = 0; i < npages; i++) {
 		pt_entry_t *pte;
 		pte = pmap_pte(pmap, (vm_offset_t)(va + i*PAGE_SIZE));
 		vm_page_lock_queues();
 		pte_store(pte, xpmap_mtop(*pte & ~(PG_RW|PG_M)));
 		vm_page_unlock_queues();
 		PMAP_MARK_PRIV(xpmap_mtop(*pte));
 		pmap_pte_release(pte);
 	}
 }
 
 void
 pmap_map_readwrite(pmap_t pmap, vm_offset_t va, int len)
 {
 	int i, npages = round_page(len) >> PAGE_SHIFT;
 	for (i = 0; i < npages; i++) {
 		pt_entry_t *pte;
 		pte = pmap_pte(pmap, (vm_offset_t)(va + i*PAGE_SIZE));
 		PMAP_MARK_UNPRIV(xpmap_mtop(*pte));
 		vm_page_lock_queues();
 		pte_store(pte, xpmap_mtop(*pte) | (PG_RW|PG_M));
 		vm_page_unlock_queues();
 		pmap_pte_release(pte);
 	}
 }
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 void
 pmap_remove_write(vm_page_t m)
 {
 	pv_entry_t pv;
 	pmap_t pmap;
 	pt_entry_t oldpte, *pte;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	sched_pin();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 retry:
 		oldpte = *pte;
 		if ((oldpte & PG_RW) != 0) {
 			vm_paddr_t newpte = oldpte & ~(PG_RW | PG_M);
 			
 			/*
 			 * Regardless of whether a pte is 32 or 64 bits
 			 * in size, PG_RW and PG_M are among the least
 			 * significant 32 bits.
 			 */
 			PT_SET_VA_MA(pte, newpte, TRUE);
 			if (*pte != newpte)
 				goto retry;
 			
 			if ((oldpte & PG_M) != 0)
 				vm_page_dirty(m);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	PT_UPDATES_FLUSH();
 	if (*PMAP1)
 		PT_SET_MA(PADDR1, 0);
 	sched_unpin();
 	vm_page_unlock_queues();
 }
 
 /*
  *	pmap_ts_referenced:
  *
  *	Return a count of reference bits for a page, clearing those bits.
  *	It is not necessary for every reference bit to be cleared, but it
  *	is necessary that 0 only be returned when there are truly no
  *	reference bits set.
  *
  *	XXX: The exact number of bits to check and clear is a matter that
  *	should be tested and standardized at some point in the future for
  *	optimal aging of shared pages.
  */
 int
 pmap_ts_referenced(vm_page_t m)
 {
 	pv_entry_t pv, pvf, pvn;
 	pmap_t pmap;
 	pt_entry_t *pte;
 	int rtval = 0;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_ts_referenced: page %p is not managed", m));
 	vm_page_lock_queues();
 	sched_pin();
 	if ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		pvf = pv;
 		do {
 			pvn = TAILQ_NEXT(pv, pv_list);
 			TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 			TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 			pmap = PV_PMAP(pv);
 			PMAP_LOCK(pmap);
 			pte = pmap_pte_quick(pmap, pv->pv_va);
 			if ((*pte & PG_A) != 0) {
 				PT_SET_VA_MA(pte, *pte & ~PG_A, FALSE);
 				pmap_invalidate_page(pmap, pv->pv_va);
 				rtval++;
 				if (rtval > 4)
 					pvn = NULL;
 			}
 			PMAP_UNLOCK(pmap);
 		} while ((pv = pvn) != NULL && pv != pvf);
 	}
 	PT_UPDATES_FLUSH();
 	if (*PMAP1)
 		PT_SET_MA(PADDR1, 0);
 
 	sched_unpin();
 	vm_page_unlock_queues();
 	return (rtval);
 }
 
 /*
  *	Clear the modify bits on the specified physical page.
  */
 void
 pmap_clear_modify(vm_page_t m)
 {
 	pv_entry_t pv;
 	pmap_t pmap;
 	pt_entry_t *pte;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("pmap_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no PTEs can have PG_M set.
 	 * If the object containing the page is locked and the page is not
 	 * VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	sched_pin();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		if ((*pte & PG_M) != 0) {
 			/*
 			 * Regardless of whether a pte is 32 or 64 bits
 			 * in size, PG_M is among the least significant
 			 * 32 bits. 
 			 */
 			PT_SET_VA_MA(pte, *pte & ~PG_M, FALSE);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	sched_unpin();
 	vm_page_unlock_queues();
 }
 
 /*
  *	pmap_clear_reference:
  *
  *	Clear the reference bit on the specified physical page.
  */
 void
 pmap_clear_reference(vm_page_t m)
 {
 	pv_entry_t pv;
 	pmap_t pmap;
 	pt_entry_t *pte;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_reference: page %p is not managed", m));
 	vm_page_lock_queues();
 	sched_pin();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		pte = pmap_pte_quick(pmap, pv->pv_va);
 		if ((*pte & PG_A) != 0) {
 			/*
 			 * Regardless of whether a pte is 32 or 64 bits
 			 * in size, PG_A is among the least significant
 			 * 32 bits. 
 			 */
 			PT_SET_VA_MA(pte, *pte & ~PG_A, FALSE);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	sched_unpin();
 	vm_page_unlock_queues();
 }
 
 /*
  * Miscellaneous support routines follow
  */
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  */
 void *
 pmap_mapdev_attr(vm_paddr_t pa, vm_size_t size, int mode)
 {
 	vm_offset_t va, offset;
 	vm_size_t tmpsize;
 
 	offset = pa & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 	pa = pa & PG_FRAME;
 
 	if (pa < KERNLOAD && pa + size <= KERNLOAD)
 		va = KERNBASE + pa;
 	else
 		va = kmem_alloc_nofault(kernel_map, size);
 	if (!va)
 		panic("pmap_mapdev: Couldn't alloc kernel virtual memory");
 
 	for (tmpsize = 0; tmpsize < size; tmpsize += PAGE_SIZE)
 		pmap_kenter_attr(va + tmpsize, pa + tmpsize, mode);
 	pmap_invalidate_range(kernel_pmap, va, va + tmpsize);
 	pmap_invalidate_cache_range(va, va + size);
 	return ((void *)(va + offset));
 }
 
 void *
 pmap_mapdev(vm_paddr_t pa, vm_size_t size)
 {
 
 	return (pmap_mapdev_attr(pa, size, PAT_UNCACHEABLE));
 }
 
 void *
 pmap_mapbios(vm_paddr_t pa, vm_size_t size)
 {
 
 	return (pmap_mapdev_attr(pa, size, PAT_WRITE_BACK));
 }
 
 void
 pmap_unmapdev(vm_offset_t va, vm_size_t size)
 {
 	vm_offset_t base, offset, tmpva;
 
 	if (va >= KERNBASE && va + size <= KERNBASE + KERNLOAD)
 		return;
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 	critical_enter();
 	for (tmpva = base; tmpva < (base + size); tmpva += PAGE_SIZE)
 		pmap_kremove(tmpva);
 	pmap_invalidate_range(kernel_pmap, va, tmpva);
 	critical_exit();
 	kmem_free(kernel_map, base, size);
 }
 
 /*
  * Sets the memory attribute for the specified page.
  */
 void
 pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma)
 {
 	struct sysmaps *sysmaps;
 	vm_offset_t sva, eva;
 
 	m->md.pat_mode = ma;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return;
 
 	/*
 	 * If "m" is a normal page, flush it from the cache.
 	 * See pmap_invalidate_cache_range().
 	 *
 	 * First, try to find an existing mapping of the page by sf
 	 * buffer. sf_buf_invalidate_cache() modifies mapping and
 	 * flushes the cache.
 	 */    
 	if (sf_buf_invalidate_cache(m))
 		return;
 
 	/*
 	 * If page is not mapped by sf buffer, but CPU does not
 	 * support self snoop, map the page transient and do
 	 * invalidation. In the worst case, whole cache is flushed by
 	 * pmap_invalidate_cache_range().
 	 */
 	if ((cpu_feature & (CPUID_SS|CPUID_CLFSH)) == CPUID_CLFSH) {
 		sysmaps = &sysmaps_pcpu[PCPU_GET(cpuid)];
 		mtx_lock(&sysmaps->lock);
 		if (*sysmaps->CMAP2)
 			panic("pmap_page_set_memattr: CMAP2 busy");
 		sched_pin();
 		PT_SET_MA(sysmaps->CADDR2, PG_V | PG_RW |
 		    VM_PAGE_TO_MACH(m) | PG_A | PG_M |
 		    pmap_cache_bits(m->md.pat_mode, 0));
 		invlcaddr(sysmaps->CADDR2);
 		sva = (vm_offset_t)sysmaps->CADDR2;
 		eva = sva + PAGE_SIZE;
 	} else
 		sva = eva = 0; /* gcc */
 	pmap_invalidate_cache_range(sva, eva);
 	if (sva != 0) {
 		PT_SET_MA(sysmaps->CADDR2, 0);
 		sched_unpin();
 		mtx_unlock(&sysmaps->lock);
 	}
 }
 
 int
 pmap_change_attr(va, size, mode)
 	vm_offset_t va;
 	vm_size_t size;
 	int mode;
 {
 	vm_offset_t base, offset, tmpva;
 	pt_entry_t *pte;
 	u_int opte, npte;
 	pd_entry_t *pde;
 	boolean_t changed;
 
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 
 	/* Only supported on kernel virtual addresses. */
 	if (base <= VM_MAXUSER_ADDRESS)
 		return (EINVAL);
 
 	/* 4MB pages and pages that aren't mapped aren't supported. */
 	for (tmpva = base; tmpva < (base + size); tmpva += PAGE_SIZE) {
 		pde = pmap_pde(kernel_pmap, tmpva);
 		if (*pde & PG_PS)
 			return (EINVAL);
 		if ((*pde & PG_V) == 0)
 			return (EINVAL);
 		pte = vtopte(va);
 		if ((*pte & PG_V) == 0)
 			return (EINVAL);
 	}
 
 	changed = FALSE;
 
 	/*
 	 * Ok, all the pages exist and are 4k, so run through them updating
 	 * their cache mode.
 	 */
 	for (tmpva = base; size > 0; ) {
 		pte = vtopte(tmpva);
 
 		/*
 		 * The cache mode bits are all in the low 32-bits of the
 		 * PTE, so we can just spin on updating the low 32-bits.
 		 */
 		do {
 			opte = *(u_int *)pte;
 			npte = opte & ~(PG_PTE_PAT | PG_NC_PCD | PG_NC_PWT);
 			npte |= pmap_cache_bits(mode, 0);
 			PT_SET_VA_MA(pte, npte, TRUE);
 		} while (npte != opte && (*pte != npte));
 		if (npte != opte)
 			changed = TRUE;
 		tmpva += PAGE_SIZE;
 		size -= PAGE_SIZE;
 	}
 
 	/*
 	 * Flush CPU caches to make sure any data isn't cached that shouldn't
 	 * be, etc.
 	 */
 	if (changed) {
 		pmap_invalidate_range(kernel_pmap, base, tmpva);
 		pmap_invalidate_cache_range(base, tmpva);
 	}
 	return (0);
 }
 
 /*
  * perform the pmap work for mincore
  */
 int
 pmap_mincore(pmap_t pmap, vm_offset_t addr, vm_paddr_t *locked_pa)
 {
 	pt_entry_t *ptep, pte;
 	vm_paddr_t pa;
 	int val;
 	
 	PMAP_LOCK(pmap);
 retry:
 	ptep = pmap_pte(pmap, addr);
 	pte = (ptep != NULL) ? PT_GET(ptep) : 0;
 	pmap_pte_release(ptep);
 	val = 0;
 	if ((pte & PG_V) != 0) {
 		val |= MINCORE_INCORE;
 		if ((pte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			val |= MINCORE_MODIFIED | MINCORE_MODIFIED_OTHER;
 		if ((pte & PG_A) != 0)
 			val |= MINCORE_REFERENCED | MINCORE_REFERENCED_OTHER;
 	}
 	if ((val & (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER)) !=
 	    (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER) &&
 	    (pte & (PG_MANAGED | PG_V)) == (PG_MANAGED | PG_V)) {
 		pa = pte & PG_FRAME;
 		/* Ensure that "PHYS_TO_VM_PAGE(pa)->object" doesn't change. */
 		if (vm_page_pa_tryrelock(pmap, pa, locked_pa))
 			goto retry;
 	} else
 		PA_UNLOCK_COND(*locked_pa);
 	PMAP_UNLOCK(pmap);
 	return (val);
 }
 
 void
 pmap_activate(struct thread *td)
 {
 	pmap_t	pmap, oldpmap;
 	u_int32_t  cr3;
 
 	critical_enter();
 	pmap = vmspace_pmap(td->td_proc->p_vmspace);
 	oldpmap = PCPU_GET(curpmap);
 #if defined(SMP)
-	atomic_clear_int(&oldpmap->pm_active, PCPU_GET(cpumask));
-	atomic_set_int(&pmap->pm_active, PCPU_GET(cpumask));
+	CPU_NAND_ATOMIC(&oldpmap->pm_active, PCPU_PTR(cpumask));
+	CPU_OR_ATOMIC(&pmap->pm_active, PCPU_PTR(cpumask));
 #else
-	oldpmap->pm_active &= ~1;
-	pmap->pm_active |= 1;
+	CPU_NAND(&oldpmap->pm_active, PCPU_PTR(cpumask));
+	CPU_OR(&pmap->pm_active, PCPU_PTR(cpumask));
 #endif
 #ifdef PAE
 	cr3 = vtophys(pmap->pm_pdpt);
 #else
 	cr3 = vtophys(pmap->pm_pdir);
 #endif
 	/*
 	 * pmap_activate is for the current thread on the current cpu
 	 */
 	td->td_pcb->pcb_cr3 = cr3;
 	PT_UPDATES_FLUSH();
 	load_cr3(cr3);
 	PCPU_SET(curpmap, pmap);
 	critical_exit();
 }
 
 void
 pmap_sync_icache(pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 }
 
 /*
  *	Increase the starting virtual address of the given mapping if a
  *	different alignment might result in more superpage mappings.
  */
 void
 pmap_align_superpage(vm_object_t object, vm_ooffset_t offset,
     vm_offset_t *addr, vm_size_t size)
 {
 	vm_offset_t superpage_offset;
 
 	if (size < NBPDR)
 		return;
 	if (object != NULL && (object->flags & OBJ_COLORED) != 0)
 		offset += ptoa(object->pg_color);
 	superpage_offset = offset & PDRMASK;
 	if (size - ((NBPDR - superpage_offset) & PDRMASK) < NBPDR ||
 	    (*addr & PDRMASK) == superpage_offset)
 		return;
 	if ((*addr & PDRMASK) < superpage_offset)
 		*addr = (*addr & ~PDRMASK) + superpage_offset;
 	else
 		*addr = ((*addr + PDRMASK) & ~PDRMASK) + superpage_offset;
 }
 
 void
 pmap_suspend()
 {
 	pmap_t pmap;
 	int i, pdir, offset;
 	vm_paddr_t pdirma;
 	mmu_update_t mu[4];
 
 	/*
 	 * We need to remove the recursive mapping structure from all
 	 * our pmaps so that Xen doesn't get confused when it restores
 	 * the page tables. The recursive map lives at page directory
 	 * index PTDPTDI. We assume that the suspend code has stopped
 	 * the other vcpus (if any).
 	 */
 	LIST_FOREACH(pmap, &allpmaps, pm_list) {
 		for (i = 0; i < 4; i++) {
 			/*
 			 * Figure out which page directory (L2) page
 			 * contains this bit of the recursive map and
 			 * the offset within that page of the map
 			 * entry
 			 */
 			pdir = (PTDPTDI + i) / NPDEPG;
 			offset = (PTDPTDI + i) % NPDEPG;
 			pdirma = pmap->pm_pdpt[pdir] & PG_FRAME;
 			mu[i].ptr = pdirma + offset * sizeof(pd_entry_t);
 			mu[i].val = 0;
 		}
 		HYPERVISOR_mmu_update(mu, 4, NULL, DOMID_SELF);
 	}
 }
 
 void
 pmap_resume()
 {
 	pmap_t pmap;
 	int i, pdir, offset;
 	vm_paddr_t pdirma;
 	mmu_update_t mu[4];
 
 	/*
 	 * Restore the recursive map that we removed on suspend.
 	 */
 	LIST_FOREACH(pmap, &allpmaps, pm_list) {
 		for (i = 0; i < 4; i++) {
 			/*
 			 * Figure out which page directory (L2) page
 			 * contains this bit of the recursive map and
 			 * the offset within that page of the map
 			 * entry
 			 */
 			pdir = (PTDPTDI + i) / NPDEPG;
 			offset = (PTDPTDI + i) % NPDEPG;
 			pdirma = pmap->pm_pdpt[pdir] & PG_FRAME;
 			mu[i].ptr = pdirma + offset * sizeof(pd_entry_t);
 			mu[i].val = (pmap->pm_pdpt[i] & PG_FRAME) | PG_V;
 		}
 		HYPERVISOR_mmu_update(mu, 4, NULL, DOMID_SELF);
 	}
 }
 
 #if defined(PMAP_DEBUG)
 pmap_pid_dump(int pid)
 {
 	pmap_t pmap;
 	struct proc *p;
 	int npte = 0;
 	int index;
 
 	sx_slock(&allproc_lock);
 	FOREACH_PROC_IN_SYSTEM(p) {
 		if (p->p_pid != pid)
 			continue;
 
 		if (p->p_vmspace) {
 			int i,j;
 			index = 0;
 			pmap = vmspace_pmap(p->p_vmspace);
 			for (i = 0; i < NPDEPTD; i++) {
 				pd_entry_t *pde;
 				pt_entry_t *pte;
 				vm_offset_t base = i << PDRSHIFT;
 				
 				pde = &pmap->pm_pdir[i];
 				if (pde && pmap_pde_v(pde)) {
 					for (j = 0; j < NPTEPG; j++) {
 						vm_offset_t va = base + (j << PAGE_SHIFT);
 						if (va >= (vm_offset_t) VM_MIN_KERNEL_ADDRESS) {
 							if (index) {
 								index = 0;
 								printf("\n");
 							}
 							sx_sunlock(&allproc_lock);
 							return npte;
 						}
 						pte = pmap_pte(pmap, va);
 						if (pte && pmap_pte_v(pte)) {
 							pt_entry_t pa;
 							vm_page_t m;
 							pa = PT_GET(pte);
 							m = PHYS_TO_VM_PAGE(pa & PG_FRAME);
 							printf("va: 0x%x, pt: 0x%x, h: %d, w: %d, f: 0x%x",
 								va, pa, m->hold_count, m->wire_count, m->flags);
 							npte++;
 							index++;
 							if (index >= 2) {
 								index = 0;
 								printf("\n");
 							} else {
 								printf(" ");
 							}
 						}
 					}
 				}
 			}
 		}
 	}
 	sx_sunlock(&allproc_lock);
 	return npte;
 }
 #endif
 
 #if defined(DEBUG)
 
 static void	pads(pmap_t pm);
 void		pmap_pvdump(vm_paddr_t pa);
 
 /* print address space of pmap*/
 static void
 pads(pmap_t pm)
 {
 	int i, j;
 	vm_paddr_t va;
 	pt_entry_t *ptep;
 
 	if (pm == kernel_pmap)
 		return;
 	for (i = 0; i < NPDEPTD; i++)
 		if (pm->pm_pdir[i])
 			for (j = 0; j < NPTEPG; j++) {
 				va = (i << PDRSHIFT) + (j << PAGE_SHIFT);
 				if (pm == kernel_pmap && va < KERNBASE)
 					continue;
 				if (pm != kernel_pmap && va > UPT_MAX_ADDRESS)
 					continue;
 				ptep = pmap_pte(pm, va);
 				if (pmap_pte_v(ptep))
 					printf("%x:%x ", va, *ptep);
 			};
 
 }
 
 void
 pmap_pvdump(vm_paddr_t pa)
 {
 	pv_entry_t pv;
 	pmap_t pmap;
 	vm_page_t m;
 
 	printf("pa %x", pa);
 	m = PHYS_TO_VM_PAGE(pa);
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = PV_PMAP(pv);
 		printf(" -> pmap %p, va %x", (void *)pmap, pv->pv_va);
 		pads(pmap);
 	}
 	printf(" ");
 }
 #endif
Index: head/sys/ia64/ia64/mp_machdep.c
===================================================================
--- head/sys/ia64/ia64/mp_machdep.c	(revision 222812)
+++ head/sys/ia64/ia64/mp_machdep.c	(revision 222813)
@@ -1,513 +1,514 @@
 /*-
  * Copyright (c) 2001-2005 Marcel Moolenaar
  * Copyright (c) 2000 Doug Rabson
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_kstack_pages.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/ktr.h>
 #include <sys/proc.h>
 #include <sys/bus.h>
 #include <sys/kthread.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/kernel.h>
 #include <sys/pcpu.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 #include <sys/uuid.h>
 
 #include <machine/atomic.h>
 #include <machine/bootinfo.h>
 #include <machine/cpu.h>
 #include <machine/fpu.h>
 #include <machine/intr.h>
 #include <machine/mca.h>
 #include <machine/md_var.h>
 #include <machine/pal.h>
 #include <machine/pcb.h>
 #include <machine/sal.h>
 #include <machine/smp.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 
 extern uint64_t bdata[];
 
 MALLOC_DEFINE(M_SMP, "SMP", "SMP related allocations");
 
 void ia64_ap_startup(void);
 
 #define	SAPIC_ID_GET_ID(x)	((u_int)((x) >> 8) & 0xff)
 #define	SAPIC_ID_GET_EID(x)	((u_int)(x) & 0xff)
 #define	SAPIC_ID_SET(id, eid)	((u_int)(((id) & 0xff) << 8) | ((eid) & 0xff))
 
 /* State used to wake and bootstrap APs. */
 struct ia64_ap_state ia64_ap_state;
 
 int ia64_ipi_ast;
 int ia64_ipi_highfp;
 int ia64_ipi_nmi;
 int ia64_ipi_preempt;
 int ia64_ipi_rndzvs;
 int ia64_ipi_stop;
 
 static u_int
 sz2shft(uint64_t sz)
 {
 	uint64_t s;
 	u_int shft;
 
 	shft = 12;      /* Start with 4K */
 	s = 1 << shft;
 	while (s < sz) {
 		shft++;
 		s <<= 1;
 	}
 	return (shft);
 }
 
 static u_int
 ia64_ih_ast(struct thread *td, u_int xiv, struct trapframe *tf)
 {
 
 	PCPU_INC(md.stats.pcs_nasts);
 	CTR1(KTR_SMP, "IPI_AST, cpuid=%d", PCPU_GET(cpuid));
 	return (0);
 }
 
 static u_int
 ia64_ih_highfp(struct thread *td, u_int xiv, struct trapframe *tf)
 {
 
 	PCPU_INC(md.stats.pcs_nhighfps);
 	ia64_highfp_save_ipi();
 	return (0);
 }
 
 static u_int
 ia64_ih_preempt(struct thread *td, u_int xiv, struct trapframe *tf)
 {
 
 	PCPU_INC(md.stats.pcs_npreempts);
 	CTR1(KTR_SMP, "IPI_PREEMPT, cpuid=%d", PCPU_GET(cpuid));
 	sched_preempt(curthread);
 	return (0);
 }
 
 static u_int
 ia64_ih_rndzvs(struct thread *td, u_int xiv, struct trapframe *tf)
 {
 
 	PCPU_INC(md.stats.pcs_nrdvs);
 	CTR1(KTR_SMP, "IPI_RENDEZVOUS, cpuid=%d", PCPU_GET(cpuid));
 	smp_rendezvous_action();
 	return (0);
 }
 
 static u_int
 ia64_ih_stop(struct thread *td, u_int xiv, struct trapframe *tf)
 {
-	cpumask_t mybit;
+	cpuset_t mybit;
 
 	PCPU_INC(md.stats.pcs_nstops);
 	mybit = PCPU_GET(cpumask);
 
 	savectx(PCPU_PTR(md.pcb));
 
-	atomic_set_int(&stopped_cpus, mybit);
-	while ((started_cpus & mybit) == 0)
+	CPU_OR_ATOMIC(&stopped_cpus, &mybit);
+	while (!CPU_OVERLAP(&started_cpus, &mybit))
 		cpu_spinwait();
-	atomic_clear_int(&started_cpus, mybit);
-	atomic_clear_int(&stopped_cpus, mybit);
+	CPU_NAND_ATOMIC(&started_cpus, &mybit);
+	CPU_NAND_ATOMIC(&stopped_cpus, &mybit);
 	return (0);
 }
 
 struct cpu_group *
 cpu_topo(void)
 {
 
 	return smp_topo_none();
 }
 
 static void
 ia64_store_mca_state(void* arg)
 {
 	struct pcpu *pc = arg;
 	struct thread *td = curthread;
 
 	/*
 	 * ia64_mca_save_state() is CPU-sensitive, so bind ourself to our
 	 * target CPU.
 	 */
 	thread_lock(td);
 	sched_bind(td, pc->pc_cpuid);
 	thread_unlock(td);
 
 	ia64_mca_init_ap();
 
 	/*
 	 * Get and save the CPU specific MCA records. Should we get the
 	 * MCA state for each processor, or just the CMC state?
 	 */
 	ia64_mca_save_state(SAL_INFO_MCA);
 	ia64_mca_save_state(SAL_INFO_CMC);
 
 	kproc_exit(0);
 }
 
 void
 ia64_ap_startup(void)
 {
 	uint64_t vhpt;
 
 	ia64_ap_state.as_trace = 0x100;
 
 	ia64_set_rr(IA64_RR_BASE(5), (5 << 8) | (PAGE_SHIFT << 2) | 1);
 	ia64_set_rr(IA64_RR_BASE(6), (6 << 8) | (PAGE_SHIFT << 2));
 	ia64_set_rr(IA64_RR_BASE(7), (7 << 8) | (PAGE_SHIFT << 2));
 	ia64_srlz_d();
 
 	pcpup = ia64_ap_state.as_pcpu;
 	ia64_set_k4((intptr_t)pcpup);
 
 	ia64_ap_state.as_trace = 0x108;
 
 	vhpt = PCPU_GET(md.vhpt);
 	map_vhpt(vhpt);
 	ia64_set_pta(vhpt + (1 << 8) + (pmap_vhpt_log2size << 2) + 1);
 	ia64_srlz_i();
 
 	ia64_ap_state.as_trace = 0x110;
 
 	ia64_ap_state.as_awake = 1;
 	ia64_ap_state.as_delay = 0;
 
 	map_pal_code();
 	map_gateway_page();
 
 	ia64_set_fpsr(IA64_FPSR_DEFAULT);
 
 	/* Wait until it's time for us to be unleashed */
 	while (ia64_ap_state.as_spin)
 		cpu_spinwait();
 
 	/* Initialize curthread. */
 	KASSERT(PCPU_GET(idlethread) != NULL, ("no idle thread"));
 	PCPU_SET(curthread, PCPU_GET(idlethread));
 
 	atomic_add_int(&ia64_ap_state.as_awake, 1);
 	while (!smp_started)
 		cpu_spinwait();
 
 	CTR1(KTR_SMP, "SMP: cpu%d launched", PCPU_GET(cpuid));
 
 	/* Mask interval timer interrupts on APs. */
 	ia64_set_itv(0x10000);
 	ia64_set_tpr(0);
 	ia64_srlz_d();
 	ia64_enable_intr();
 
 	sched_throw(NULL);
 	/* NOTREACHED */
 }
 
 void
 cpu_mp_setmaxid(void)
 {
 
 	/*
 	 * Count the number of processors in the system by walking the ACPI
 	 * tables. Note that we record the actual number of processors, even
 	 * if this is larger than MAXCPU. We only activate MAXCPU processors.
 	 */
 	mp_ncpus = ia64_count_cpus();
 
 	/*
 	 * Set the largest cpuid we're going to use. This is necessary for
 	 * VM initialization.
 	 */
 	mp_maxid = min(mp_ncpus, MAXCPU) - 1;
 }
 
 int
 cpu_mp_probe(void)
 {
 
 	/*
 	 * If there's only 1 processor, or we don't have a wake-up vector,
 	 * we're not going to enable SMP. Note that no wake-up vector can
 	 * also mean that the wake-up mechanism is not supported. In this
 	 * case we can have multiple processors, but we simply can't wake
 	 * them up...
 	 */
 	return (mp_ncpus > 1 && ia64_ipi_wakeup != 0);
 }
 
 void
 cpu_mp_add(u_int acpi_id, u_int id, u_int eid)
 {
 	struct pcpu *pc;
 	void *dpcpu;
 	u_int cpuid, sapic_id;
 
 	sapic_id = SAPIC_ID_SET(id, eid);
 	cpuid = (IA64_LID_GET_SAPIC_ID(ia64_get_lid()) == sapic_id)
 	    ? 0 : smp_cpus++;
 
-	KASSERT((all_cpus & (1UL << cpuid)) == 0,
+	KASSERT(!CPU_ISSET(cpuid, &all_cpus),
 	    ("%s: cpu%d already in CPU map", __func__, acpi_id));
 
 	if (cpuid != 0) {
 		pc = (struct pcpu *)malloc(sizeof(*pc), M_SMP, M_WAITOK);
 		pcpu_init(pc, cpuid, sizeof(*pc));
 		dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE);
 		dpcpu_init(dpcpu, cpuid);
 	} else
 		pc = pcpup;
 
 	pc->pc_acpi_id = acpi_id;
 	pc->pc_md.lid = IA64_LID_SET_SAPIC_ID(sapic_id);
 
-	all_cpus |= (1UL << pc->pc_cpuid);
+	CPU_SET(pc->pc_cpuid, &all_cpus);
 }
 
 void
 cpu_mp_announce()
 {
 	struct pcpu *pc;
 	uint32_t sapic_id;
 	int i;
 
 	for (i = 0; i <= mp_maxid; i++) {
 		pc = pcpu_find(i);
 		if (pc != NULL) {
 			sapic_id = IA64_LID_GET_SAPIC_ID(pc->pc_md.lid);
 			printf("cpu%d: ACPI Id=%x, SAPIC Id=%x, SAPIC Eid=%x",
 			    i, pc->pc_acpi_id, SAPIC_ID_GET_ID(sapic_id),
 			    SAPIC_ID_GET_EID(sapic_id));
 			if (i == 0)
 				printf(" (BSP)\n");
 			else
 				printf("\n");
 		}
 	}
 }
 
 void
 cpu_mp_start()
 {
 	struct ia64_sal_result result;
 	struct ia64_fdesc *fd;
 	struct pcpu *pc;
 	uintptr_t state;
 	u_char *stp;
 
 	state = ia64_tpa((uintptr_t)&ia64_ap_state);
 	fd = (struct ia64_fdesc *) os_boot_rendez;
 	result = ia64_sal_entry(SAL_SET_VECTORS, SAL_OS_BOOT_RENDEZ,
 	    ia64_tpa(fd->func), state, 0, 0, 0, 0);
 
 	ia64_ap_state.as_pgtbl_pte = PTE_PRESENT | PTE_MA_WB |
 	    PTE_ACCESSED | PTE_DIRTY | PTE_PL_KERN | PTE_AR_RW |
 	    (bootinfo->bi_pbvm_pgtbl & PTE_PPN_MASK);
 	ia64_ap_state.as_pgtbl_itir = sz2shft(bootinfo->bi_pbvm_pgtblsz) << 2;
 	ia64_ap_state.as_text_va = IA64_PBVM_BASE;
 	ia64_ap_state.as_text_pte = PTE_PRESENT | PTE_MA_WB |
 	    PTE_ACCESSED | PTE_DIRTY | PTE_PL_KERN | PTE_AR_RX |
 	    (ia64_tpa(IA64_PBVM_BASE) & PTE_PPN_MASK);
 	ia64_ap_state.as_text_itir = bootinfo->bi_text_mapped << 2;
 	ia64_ap_state.as_data_va = (uintptr_t)bdata;
 	ia64_ap_state.as_data_pte = PTE_PRESENT | PTE_MA_WB |
 	    PTE_ACCESSED | PTE_DIRTY | PTE_PL_KERN | PTE_AR_RW |
 	    (ia64_tpa((uintptr_t)bdata) & PTE_PPN_MASK);
 	ia64_ap_state.as_data_itir = bootinfo->bi_data_mapped << 2;
 
 	/* Keep 'em spinning until we unleash them... */
 	ia64_ap_state.as_spin = 1;
 
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 		pc->pc_md.current_pmap = kernel_pmap;
-		pc->pc_other_cpus = all_cpus & ~pc->pc_cpumask;
+		pc->pc_other_cpus = all_cpus;
+		CPU_NAND(&pc->pc_other_cpus, &pc->pc_cpumask);
 		/* The BSP is obviously running already. */
 		if (pc->pc_cpuid == 0) {
 			pc->pc_md.awake = 1;
 			continue;
 		}
 
 		ia64_ap_state.as_pcpu = pc;
 		pc->pc_md.vhpt = pmap_alloc_vhpt();
 		if (pc->pc_md.vhpt == 0) {
 			printf("SMP: WARNING: unable to allocate VHPT"
 			    " for cpu%d", pc->pc_cpuid);
 			continue;
 		}
 
 		stp = malloc(KSTACK_PAGES * PAGE_SIZE, M_SMP, M_WAITOK);
 		ia64_ap_state.as_kstack = stp;
 		ia64_ap_state.as_kstack_top = stp + KSTACK_PAGES * PAGE_SIZE;
 
 		ia64_ap_state.as_trace = 0;
 		ia64_ap_state.as_delay = 2000;
 		ia64_ap_state.as_awake = 0;
 
 		if (bootverbose)
 			printf("SMP: waking up cpu%d\n", pc->pc_cpuid);
 
 		/* Here she goes... */
 		ipi_send(pc, ia64_ipi_wakeup);
 		do {
 			DELAY(1000);
 		} while (--ia64_ap_state.as_delay > 0);
 
 		pc->pc_md.awake = ia64_ap_state.as_awake;
 
 		if (!ia64_ap_state.as_awake) {
 			printf("SMP: WARNING: cpu%d did not wake up (code "
 			    "%#lx)\n", pc->pc_cpuid,
 			    ia64_ap_state.as_trace - state);
 		}
 	}
 }
 
 static void
 cpu_mp_unleash(void *dummy)
 {
 	struct pcpu *pc;
 	int cpus;
 
 	if (mp_ncpus <= 1)
 		return;
 
 	/* Allocate XIVs for IPIs */
 	ia64_ipi_ast = ia64_xiv_alloc(PI_DULL, IA64_XIV_IPI, ia64_ih_ast);
 	ia64_ipi_highfp = ia64_xiv_alloc(PI_AV, IA64_XIV_IPI, ia64_ih_highfp);
 	ia64_ipi_preempt = ia64_xiv_alloc(PI_SOFT, IA64_XIV_IPI,
 	    ia64_ih_preempt);
 	ia64_ipi_rndzvs = ia64_xiv_alloc(PI_AV, IA64_XIV_IPI, ia64_ih_rndzvs);
 	ia64_ipi_stop = ia64_xiv_alloc(PI_REALTIME, IA64_XIV_IPI, ia64_ih_stop);
 
 	/* Reserve the NMI vector for IPI_STOP_HARD if possible */
 	ia64_ipi_nmi = (ia64_xiv_reserve(2, IA64_XIV_IPI, ia64_ih_stop) != 0)
 	    ? ia64_ipi_stop : 0x400;	/* DM=NMI, Vector=n/a */
 
 	cpus = 0;
 	smp_cpus = 0;
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 		cpus++;
 		if (pc->pc_md.awake) {
 			kproc_create(ia64_store_mca_state, pc, NULL, 0, 0,
 			    "mca %u", pc->pc_cpuid);
 			smp_cpus++;
 		}
 	}
 
 	ia64_ap_state.as_awake = 1;
 	ia64_ap_state.as_spin = 0;
 
 	while (ia64_ap_state.as_awake != smp_cpus)
 		cpu_spinwait();
 
 	if (smp_cpus != cpus || cpus != mp_ncpus) {
 		printf("SMP: %d CPUs found; %d CPUs usable; %d CPUs woken\n",
 		    mp_ncpus, cpus, smp_cpus);
 	}
 
 	smp_active = 1;
 	smp_started = 1;
 
 	/*
 	 * Now that all CPUs are up and running, bind interrupts to each of
 	 * them.
 	 */
 	ia64_bind_intr();
 }
 
 /*
  * send an IPI to a set of cpus.
  */
 void
-ipi_selected(cpumask_t cpus, int ipi)
+ipi_selected(cpuset_t cpus, int ipi)
 {
 	struct pcpu *pc;
 
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
-		if (cpus & pc->pc_cpumask)
+		if (CPU_OVERLAP(&cpus, &pc->pc_cpumask))
 			ipi_send(pc, ipi);
 	}
 }
 
 /*
  * send an IPI to a specific CPU.
  */
 void
 ipi_cpu(int cpu, u_int ipi)
 {
 
 	ipi_send(cpuid_to_pcpu[cpu], ipi);
 }
 
 /*
  * send an IPI to all CPUs EXCEPT myself.
  */
 void
 ipi_all_but_self(int ipi)
 {
 	struct pcpu *pc;
 
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 		if (pc != pcpup)
 			ipi_send(pc, ipi);
 	}
 }
 
 /*
  * Send an IPI to the specified processor.
  */
 void
 ipi_send(struct pcpu *cpu, int xiv)
 {
 	u_int sapic_id;
 
 	KASSERT(xiv != 0, ("ipi_send"));
 
 	sapic_id = IA64_LID_GET_SAPIC_ID(cpu->pc_md.lid);
 
 	ia64_mf();
 	ia64_st8(&(ia64_pib->ib_ipi[sapic_id][0]), xiv);
 	ia64_mf_a();
 	CTR3(KTR_SMP, "ipi_send(%p, %d): cpuid=%d", cpu, xiv, PCPU_GET(cpuid));
 }
 
 SYSINIT(start_aps, SI_SUB_SMP, SI_ORDER_FIRST, cpu_mp_unleash, NULL);
Index: head/sys/ia64/include/_types.h
===================================================================
--- head/sys/ia64/include/_types.h	(revision 222812)
+++ head/sys/ia64/include/_types.h	(revision 222813)
@@ -1,119 +1,118 @@
 /*-
  * Copyright (c) 2002 Mike Barcroft <mike@FreeBSD.org>
  * Copyright (c) 1990, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	From: @(#)ansi.h	8.2 (Berkeley) 1/4/94
  *	From: @(#)types.h	8.3 (Berkeley) 1/5/94
  * $FreeBSD$
  */
 
 #ifndef _MACHINE__TYPES_H_
 #define	_MACHINE__TYPES_H_
 
 #ifndef _SYS_CDEFS_H_
 #error this file needs sys/cdefs.h as a prerequisite
 #endif
 
 /*
  * Basic types upon which most other types are built.
  */
 typedef	__signed char		__int8_t;
 typedef	unsigned char		__uint8_t;
 typedef	short			__int16_t;
 typedef	unsigned short		__uint16_t;
 typedef	int			__int32_t;
 typedef	unsigned int		__uint32_t;
 typedef	long			__int64_t;
 typedef	unsigned long		__uint64_t;
 
 /*
  * Standard type definitions.
  */
 typedef	__int32_t	__clock_t;		/* clock()... */
-typedef	unsigned int	__cpumask_t;
 typedef	__int64_t	__critical_t;
 typedef	double		__double_t;
 typedef	float		__float_t;
 typedef	__int64_t	__intfptr_t;
 typedef	__int64_t	__intmax_t;
 typedef	__int64_t	__intptr_t;
 typedef	__int32_t	__int_fast8_t;
 typedef	__int32_t	__int_fast16_t;
 typedef	__int32_t	__int_fast32_t;
 typedef	__int64_t	__int_fast64_t;
 typedef	__int8_t	__int_least8_t;
 typedef	__int16_t	__int_least16_t;
 typedef	__int32_t	__int_least32_t;
 typedef	__int64_t	__int_least64_t;
 typedef	__int64_t	__ptrdiff_t;		/* ptr1 - ptr2 */
 typedef	__int64_t	__register_t;
 typedef	__int64_t	__segsz_t;		/* segment size (in pages) */
 typedef	__uint64_t	__size_t;		/* sizeof() */
 typedef	__int64_t	__ssize_t;		/* byte count or error */
 typedef	__int64_t	__time_t;		/* time()... */
 typedef	__uint64_t	__uintfptr_t;
 typedef	__uint64_t	__uintmax_t;
 typedef	__uint64_t	__uintptr_t;
 typedef	__uint32_t	__uint_fast8_t;
 typedef	__uint32_t	__uint_fast16_t;
 typedef	__uint32_t	__uint_fast32_t;
 typedef	__uint64_t	__uint_fast64_t;
 typedef	__uint8_t	__uint_least8_t;
 typedef	__uint16_t	__uint_least16_t;
 typedef	__uint32_t	__uint_least32_t;
 typedef	__uint64_t	__uint_least64_t;
 typedef	__uint64_t	__u_register_t;
 typedef	__uint64_t	__vm_offset_t;
 typedef	__int64_t	__vm_ooffset_t;
 typedef	__uint64_t	__vm_paddr_t;
 typedef	__uint64_t	__vm_pindex_t;
 typedef	__uint64_t	__vm_size_t;
 
 /*
  * Unusual type definitions.
  */
 #ifdef __GNUCLIKE_BUILTIN_VARARGS
 typedef __builtin_va_list	__va_list;	/* internally known to gcc */
 #if defined(__GNUC_VA_LIST_COMPATIBILITY) && !defined(__GNUC_VA_LIST) \
     && !defined(__NO_GNUC_VA_LIST)
 #define	__GNUC_VA_LIST
 typedef	__va_list	__gnuc_va_list;		/* compat. with GNU headers */
 #endif
 #else
 #ifdef lint
 typedef char *			__va_list;	/* non-functional */
 #else
 #error Must add va_list support for this non-GCC compiler.   
 #endif /* lint */
 #endif /* __GNUCLIKE_BUILTIN_VARARGS */
 
 #endif /* !_MACHINE__TYPES_H_ */
Index: head/sys/ia64/include/smp.h
===================================================================
--- head/sys/ia64/include/smp.h	(revision 222812)
+++ head/sys/ia64/include/smp.h	(revision 222813)
@@ -1,52 +1,54 @@
 /*
  * $FreeBSD$
  */
 #ifndef _MACHINE_SMP_H_
 #define _MACHINE_SMP_H_
 
 #ifdef _KERNEL
 
 #define	IPI_AST			ia64_ipi_ast
 #define	IPI_PREEMPT		ia64_ipi_preempt
 #define	IPI_RENDEZVOUS		ia64_ipi_rndzvs
 #define	IPI_STOP		ia64_ipi_stop
 #define	IPI_STOP_HARD		ia64_ipi_nmi
 
 #ifndef LOCORE
 
+#include <sys/_cpuset.h>
+
 struct pcpu;
 
 struct ia64_ap_state {
 	uint64_t	as_trace;
 	uint64_t	as_pgtbl_pte;
 	uint64_t	as_pgtbl_itir;
 	uint64_t	as_text_va;
 	uint64_t	as_text_pte;
 	uint64_t	as_text_itir;
 	uint64_t	as_data_va;
 	uint64_t	as_data_pte;
 	uint64_t	as_data_itir;
 	void		*as_kstack;
 	void		*as_kstack_top;
 	struct pcpu	*as_pcpu;
 	volatile int	as_delay;
 	volatile u_int	as_awake;
 	volatile u_int	as_spin;
 };
 
 extern int ia64_ipi_ast;
 extern int ia64_ipi_highfp;
 extern int ia64_ipi_nmi;
 extern int ia64_ipi_preempt;
 extern int ia64_ipi_rndzvs;
 extern int ia64_ipi_stop;
 extern int ia64_ipi_wakeup;
 
 void	ipi_all_but_self(int ipi);
 void	ipi_cpu(int cpu, u_int ipi);
-void	ipi_selected(cpumask_t cpus, int ipi);
+void	ipi_selected(cpuset_t cpus, int ipi);
 void	ipi_send(struct pcpu *, int ipi);
 
 #endif /* !LOCORE */
 #endif /* _KERNEL */
 #endif /* !_MACHINE_SMP_H */
Index: head/sys/kern/kern_cpuset.c
===================================================================
--- head/sys/kern/kern_cpuset.c	(revision 222812)
+++ head/sys/kern/kern_cpuset.c	(revision 222813)
@@ -1,1097 +1,1173 @@
 /*-
  * Copyright (c) 2008,  Jeffrey Roberson <jeff@freebsd.org>
  * All rights reserved.
  * 
  * Copyright (c) 2008 Nokia Corporation
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/sysproto.h>
 #include <sys/jail.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/refcount.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/syscallsubr.h>
 #include <sys/cpuset.h>
 #include <sys/sx.h>
 #include <sys/queue.h>
+#include <sys/libkern.h>
 #include <sys/limits.h>
 #include <sys/bus.h>
 #include <sys/interrupt.h>
 
 #include <vm/uma.h>
 
 #ifdef DDB
 #include <ddb/ddb.h>
 #endif /* DDB */
 
 /*
  * cpusets provide a mechanism for creating and manipulating sets of
  * processors for the purpose of constraining the scheduling of threads to
  * specific processors.
  *
  * Each process belongs to an identified set, by default this is set 1.  Each
  * thread may further restrict the cpus it may run on to a subset of this
  * named set.  This creates an anonymous set which other threads and processes
  * may not join by number.
  *
  * The named set is referred to herein as the 'base' set to avoid ambiguity.
  * This set is usually a child of a 'root' set while the anonymous set may
  * simply be referred to as a mask.  In the syscall api these are referred to
  * as the ROOT, CPUSET, and MASK levels where CPUSET is called 'base' here.
  *
  * Threads inherit their set from their creator whether it be anonymous or
  * not.  This means that anonymous sets are immutable because they may be
  * shared.  To modify an anonymous set a new set is created with the desired
  * mask and the same parent as the existing anonymous set.  This gives the
  * illusion of each thread having a private mask.
  *
  * Via the syscall apis a user may ask to retrieve or modify the root, base,
  * or mask that is discovered via a pid, tid, or setid.  Modifying a set
  * modifies all numbered and anonymous child sets to comply with the new mask.
  * Modifying a pid or tid's mask applies only to that tid but must still
  * exist within the assigned parent set.
  *
  * A thread may not be assigned to a group separate from other threads in
  * the process.  This is to remove ambiguity when the setid is queried with
  * a pid argument.  There is no other technical limitation.
  *
  * This somewhat complex arrangement is intended to make it easy for
  * applications to query available processors and bind their threads to
  * specific processors while also allowing administrators to dynamically
  * reprovision by changing sets which apply to groups of processes.
  *
  * A simple application should not concern itself with sets at all and
  * rather apply masks to its own threads via CPU_WHICH_TID and a -1 id
  * meaning 'curthread'.  It may query available cpus for that tid with a
  * getaffinity call using (CPU_LEVEL_CPUSET, CPU_WHICH_PID, -1, ...).
  */
 static uma_zone_t cpuset_zone;
 static struct mtx cpuset_lock;
 static struct setlist cpuset_ids;
 static struct unrhdr *cpuset_unr;
 static struct cpuset *cpuset_zero;
 
 /* Return the size of cpuset_t at the kernel level */
 SYSCTL_INT(_kern_sched, OID_AUTO, cpusetsize, CTLFLAG_RD,
 	0, sizeof(cpuset_t), "sizeof(cpuset_t)");
 
 cpuset_t *cpuset_root;
 
 /*
  * Acquire a reference to a cpuset, all pointers must be tracked with refs.
  */
 struct cpuset *
 cpuset_ref(struct cpuset *set)
 {
 
 	refcount_acquire(&set->cs_ref);
 	return (set);
 }
 
 /*
  * Walks up the tree from 'set' to find the root.  Returns the root
  * referenced.
  */
 static struct cpuset *
 cpuset_refroot(struct cpuset *set)
 {
 
 	for (; set->cs_parent != NULL; set = set->cs_parent)
 		if (set->cs_flags & CPU_SET_ROOT)
 			break;
 	cpuset_ref(set);
 
 	return (set);
 }
 
 /*
  * Find the first non-anonymous set starting from 'set'.  Returns this set
  * referenced.  May return the passed in set with an extra ref if it is
  * not anonymous. 
  */
 static struct cpuset *
 cpuset_refbase(struct cpuset *set)
 {
 
 	if (set->cs_id == CPUSET_INVALID)
 		set = set->cs_parent;
 	cpuset_ref(set);
 
 	return (set);
 }
 
 /*
  * Release a reference in a context where it is safe to allocate.
  */
 void
 cpuset_rel(struct cpuset *set)
 {
 	cpusetid_t id;
 
 	if (refcount_release(&set->cs_ref) == 0)
 		return;
 	mtx_lock_spin(&cpuset_lock);
 	LIST_REMOVE(set, cs_siblings);
 	id = set->cs_id;
 	if (id != CPUSET_INVALID)
 		LIST_REMOVE(set, cs_link);
 	mtx_unlock_spin(&cpuset_lock);
 	cpuset_rel(set->cs_parent);
 	uma_zfree(cpuset_zone, set);
 	if (id != CPUSET_INVALID)
 		free_unr(cpuset_unr, id);
 }
 
 /*
  * Deferred release must be used when in a context that is not safe to
  * allocate/free.  This places any unreferenced sets on the list 'head'.
  */
 static void
 cpuset_rel_defer(struct setlist *head, struct cpuset *set)
 {
 
 	if (refcount_release(&set->cs_ref) == 0)
 		return;
 	mtx_lock_spin(&cpuset_lock);
 	LIST_REMOVE(set, cs_siblings);
 	if (set->cs_id != CPUSET_INVALID)
 		LIST_REMOVE(set, cs_link);
 	LIST_INSERT_HEAD(head, set, cs_link);
 	mtx_unlock_spin(&cpuset_lock);
 }
 
 /*
  * Complete a deferred release.  Removes the set from the list provided to
  * cpuset_rel_defer.
  */
 static void
 cpuset_rel_complete(struct cpuset *set)
 {
 	LIST_REMOVE(set, cs_link);
 	cpuset_rel(set->cs_parent);
 	uma_zfree(cpuset_zone, set);
 }
 
 /*
  * Find a set based on an id.  Returns it with a ref.
  */
 static struct cpuset *
 cpuset_lookup(cpusetid_t setid, struct thread *td)
 {
 	struct cpuset *set;
 
 	if (setid == CPUSET_INVALID)
 		return (NULL);
 	mtx_lock_spin(&cpuset_lock);
 	LIST_FOREACH(set, &cpuset_ids, cs_link)
 		if (set->cs_id == setid)
 			break;
 	if (set)
 		cpuset_ref(set);
 	mtx_unlock_spin(&cpuset_lock);
 
 	KASSERT(td != NULL, ("[%s:%d] td is NULL", __func__, __LINE__));
 	if (set != NULL && jailed(td->td_ucred)) {
 		struct cpuset *jset, *tset;
 
 		jset = td->td_ucred->cr_prison->pr_cpuset;
 		for (tset = set; tset != NULL; tset = tset->cs_parent)
 			if (tset == jset)
 				break;
 		if (tset == NULL) {
 			cpuset_rel(set);
 			set = NULL;
 		}
 	}
 
 	return (set);
 }
 
 /*
  * Create a set in the space provided in 'set' with the provided parameters.
  * The set is returned with a single ref.  May return EDEADLK if the set
  * will have no valid cpu based on restrictions from the parent.
  */
 static int
 _cpuset_create(struct cpuset *set, struct cpuset *parent, const cpuset_t *mask,
     cpusetid_t id)
 {
 
 	if (!CPU_OVERLAP(&parent->cs_mask, mask))
 		return (EDEADLK);
 	CPU_COPY(mask, &set->cs_mask);
 	LIST_INIT(&set->cs_children);
 	refcount_init(&set->cs_ref, 1);
 	set->cs_flags = 0;
 	mtx_lock_spin(&cpuset_lock);
 	CPU_AND(&set->cs_mask, &parent->cs_mask);
 	set->cs_id = id;
 	set->cs_parent = cpuset_ref(parent);
 	LIST_INSERT_HEAD(&parent->cs_children, set, cs_siblings);
 	if (set->cs_id != CPUSET_INVALID)
 		LIST_INSERT_HEAD(&cpuset_ids, set, cs_link);
 	mtx_unlock_spin(&cpuset_lock);
 
 	return (0);
 }
 
 /*
  * Create a new non-anonymous set with the requested parent and mask.  May
  * return failures if the mask is invalid or a new number can not be
  * allocated.
  */
 static int
 cpuset_create(struct cpuset **setp, struct cpuset *parent, const cpuset_t *mask)
 {
 	struct cpuset *set;
 	cpusetid_t id;
 	int error;
 
 	id = alloc_unr(cpuset_unr);
 	if (id == -1)
 		return (ENFILE);
 	*setp = set = uma_zalloc(cpuset_zone, M_WAITOK);
 	error = _cpuset_create(set, parent, mask, id);
 	if (error == 0)
 		return (0);
 	free_unr(cpuset_unr, id);
 	uma_zfree(cpuset_zone, set);
 
 	return (error);
 }
 
 /*
  * Recursively check for errors that would occur from applying mask to
  * the tree of sets starting at 'set'.  Checks for sets that would become
  * empty as well as RDONLY flags.
  */
 static int
 cpuset_testupdate(struct cpuset *set, cpuset_t *mask)
 {
 	struct cpuset *nset;
 	cpuset_t newmask;
 	int error;
 
 	mtx_assert(&cpuset_lock, MA_OWNED);
 	if (set->cs_flags & CPU_SET_RDONLY)
 		return (EPERM);
 	if (!CPU_OVERLAP(&set->cs_mask, mask))
 		return (EDEADLK);
 	CPU_COPY(&set->cs_mask, &newmask);
 	CPU_AND(&newmask, mask);
 	error = 0;
 	LIST_FOREACH(nset, &set->cs_children, cs_siblings) 
 		if ((error = cpuset_testupdate(nset, &newmask)) != 0)
 			break;
 	return (error);
 }
 
 /*
  * Applies the mask 'mask' without checking for empty sets or permissions.
  */
 static void
 cpuset_update(struct cpuset *set, cpuset_t *mask)
 {
 	struct cpuset *nset;
 
 	mtx_assert(&cpuset_lock, MA_OWNED);
 	CPU_AND(&set->cs_mask, mask);
 	LIST_FOREACH(nset, &set->cs_children, cs_siblings) 
 		cpuset_update(nset, &set->cs_mask);
 
 	return;
 }
 
 /*
  * Modify the set 'set' to use a copy of the mask provided.  Apply this new
  * mask to restrict all children in the tree.  Checks for validity before
  * applying the changes.
  */
 static int
 cpuset_modify(struct cpuset *set, cpuset_t *mask)
 {
 	struct cpuset *root;
 	int error;
 
 	error = priv_check(curthread, PRIV_SCHED_CPUSET);
 	if (error)
 		return (error);
 	/*
 	 * In case we are called from within the jail
 	 * we do not allow modifying the dedicated root
 	 * cpuset of the jail but may still allow to
 	 * change child sets.
 	 */
 	if (jailed(curthread->td_ucred) &&
 	    set->cs_flags & CPU_SET_ROOT)
 		return (EPERM);
 	/*
 	 * Verify that we have access to this set of
 	 * cpus.
 	 */
 	root = set->cs_parent;
 	if (root && !CPU_SUBSET(&root->cs_mask, mask))
 		return (EINVAL);
 	mtx_lock_spin(&cpuset_lock);
 	error = cpuset_testupdate(set, mask);
 	if (error)
 		goto out;
 	cpuset_update(set, mask);
 	CPU_COPY(mask, &set->cs_mask);
 out:
 	mtx_unlock_spin(&cpuset_lock);
 
 	return (error);
 }
 
 /*
  * Resolve the 'which' parameter of several cpuset apis.
  *
  * For WHICH_PID and WHICH_TID return a locked proc and valid proc/tid.  Also
  * checks for permission via p_cansched().
  *
  * For WHICH_SET returns a valid set with a new reference.
  *
  * -1 may be supplied for any argument to mean the current proc/thread or
  * the base set of the current thread.  May fail with ESRCH/EPERM.
  */
 static int
 cpuset_which(cpuwhich_t which, id_t id, struct proc **pp, struct thread **tdp,
     struct cpuset **setp)
 {
 	struct cpuset *set;
 	struct thread *td;
 	struct proc *p;
 	int error;
 
 	*pp = p = NULL;
 	*tdp = td = NULL;
 	*setp = set = NULL;
 	switch (which) {
 	case CPU_WHICH_PID:
 		if (id == -1) {
 			PROC_LOCK(curproc);
 			p = curproc;
 			break;
 		}
 		if ((p = pfind(id)) == NULL)
 			return (ESRCH);
 		break;
 	case CPU_WHICH_TID:
 		if (id == -1) {
 			PROC_LOCK(curproc);
 			p = curproc;
 			td = curthread;
 			break;
 		}
 		td = tdfind(id, -1);
 		if (td == NULL)
 			return (ESRCH);
 		p = td->td_proc;
 		break;
 	case CPU_WHICH_CPUSET:
 		if (id == -1) {
 			thread_lock(curthread);
 			set = cpuset_refbase(curthread->td_cpuset);
 			thread_unlock(curthread);
 		} else
 			set = cpuset_lookup(id, curthread);
 		if (set) {
 			*setp = set;
 			return (0);
 		}
 		return (ESRCH);
 	case CPU_WHICH_JAIL:
 	{
 		/* Find `set' for prison with given id. */
 		struct prison *pr;
 
 		sx_slock(&allprison_lock);
 		pr = prison_find_child(curthread->td_ucred->cr_prison, id);
 		sx_sunlock(&allprison_lock);
 		if (pr == NULL)
 			return (ESRCH);
 		cpuset_ref(pr->pr_cpuset);
 		*setp = pr->pr_cpuset;
 		mtx_unlock(&pr->pr_mtx);
 		return (0);
 	}
 	case CPU_WHICH_IRQ:
 		return (0);
 	default:
 		return (EINVAL);
 	}
 	error = p_cansched(curthread, p);
 	if (error) {
 		PROC_UNLOCK(p);
 		return (error);
 	}
 	if (td == NULL)
 		td = FIRST_THREAD_IN_PROC(p);
 	*pp = p;
 	*tdp = td;
 	return (0);
 }
 
 /*
  * Create an anonymous set with the provided mask in the space provided by
  * 'fset'.  If the passed in set is anonymous we use its parent otherwise
  * the new set is a child of 'set'.
  */
 static int
 cpuset_shadow(struct cpuset *set, struct cpuset *fset, const cpuset_t *mask)
 {
 	struct cpuset *parent;
 
 	if (set->cs_id == CPUSET_INVALID)
 		parent = set->cs_parent;
 	else
 		parent = set;
 	if (!CPU_SUBSET(&parent->cs_mask, mask))
 		return (EDEADLK);
 	return (_cpuset_create(fset, parent, mask, CPUSET_INVALID));
 }
 
 /*
  * Handle two cases for replacing the base set or mask of an entire process.
  *
  * 1) Set is non-null and mask is null.  This reparents all anonymous sets
  *    to the provided set and replaces all non-anonymous td_cpusets with the
  *    provided set.
  * 2) Mask is non-null and set is null.  This replaces or creates anonymous
  *    sets for every thread with the existing base as a parent.
  *
  * This is overly complicated because we can't allocate while holding a 
  * spinlock and spinlocks must be held while changing and examining thread
  * state.
  */
 static int
 cpuset_setproc(pid_t pid, struct cpuset *set, cpuset_t *mask)
 {
 	struct setlist freelist;
 	struct setlist droplist;
 	struct cpuset *tdset;
 	struct cpuset *nset;
 	struct thread *td;
 	struct proc *p;
 	int threads;
 	int nfree;
 	int error;
 	/*
 	 * The algorithm requires two passes due to locking considerations.
 	 * 
 	 * 1) Lookup the process and acquire the locks in the required order.
 	 * 2) If enough cpusets have not been allocated release the locks and
 	 *    allocate them.  Loop.
 	 */
 	LIST_INIT(&freelist);
 	LIST_INIT(&droplist);
 	nfree = 0;
 	for (;;) {
 		error = cpuset_which(CPU_WHICH_PID, pid, &p, &td, &nset);
 		if (error)
 			goto out;
 		if (nfree >= p->p_numthreads)
 			break;
 		threads = p->p_numthreads;
 		PROC_UNLOCK(p);
 		for (; nfree < threads; nfree++) {
 			nset = uma_zalloc(cpuset_zone, M_WAITOK);
 			LIST_INSERT_HEAD(&freelist, nset, cs_link);
 		}
 	}
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	/*
 	 * Now that the appropriate locks are held and we have enough cpusets,
 	 * make sure the operation will succeed before applying changes.  The
 	 * proc lock prevents td_cpuset from changing between calls.
 	 */
 	error = 0;
 	FOREACH_THREAD_IN_PROC(p, td) {
 		thread_lock(td);
 		tdset = td->td_cpuset;
 		/*
 		 * Verify that a new mask doesn't specify cpus outside of
 		 * the set the thread is a member of.
 		 */
 		if (mask) {
 			if (tdset->cs_id == CPUSET_INVALID)
 				tdset = tdset->cs_parent;
 			if (!CPU_SUBSET(&tdset->cs_mask, mask))
 				error = EDEADLK;
 		/*
 		 * Verify that a new set won't leave an existing thread
 		 * mask without a cpu to run on.  It can, however, restrict
 		 * the set.
 		 */
 		} else if (tdset->cs_id == CPUSET_INVALID) {
 			if (!CPU_OVERLAP(&set->cs_mask, &tdset->cs_mask))
 				error = EDEADLK;
 		}
 		thread_unlock(td);
 		if (error)
 			goto unlock_out;
 	}
 	/*
 	 * Replace each thread's cpuset while using deferred release.  We
 	 * must do this because the thread lock must be held while operating
 	 * on the thread and this limits the type of operations allowed.
 	 */
 	FOREACH_THREAD_IN_PROC(p, td) {
 		thread_lock(td);
 		/*
 		 * If we presently have an anonymous set or are applying a
 		 * mask we must create an anonymous shadow set.  That is
 		 * either parented to our existing base or the supplied set.
 		 *
 		 * If we have a base set with no anonymous shadow we simply
 		 * replace it outright.
 		 */
 		tdset = td->td_cpuset;
 		if (tdset->cs_id == CPUSET_INVALID || mask) {
 			nset = LIST_FIRST(&freelist);
 			LIST_REMOVE(nset, cs_link);
 			if (mask)
 				error = cpuset_shadow(tdset, nset, mask);
 			else
 				error = _cpuset_create(nset, set,
 				    &tdset->cs_mask, CPUSET_INVALID);
 			if (error) {
 				LIST_INSERT_HEAD(&freelist, nset, cs_link);
 				thread_unlock(td);
 				break;
 			}
 		} else
 			nset = cpuset_ref(set);
 		cpuset_rel_defer(&droplist, tdset);
 		td->td_cpuset = nset;
 		sched_affinity(td);
 		thread_unlock(td);
 	}
 unlock_out:
 	PROC_UNLOCK(p);
 out:
 	while ((nset = LIST_FIRST(&droplist)) != NULL)
 		cpuset_rel_complete(nset);
 	while ((nset = LIST_FIRST(&freelist)) != NULL) {
 		LIST_REMOVE(nset, cs_link);
 		uma_zfree(cpuset_zone, nset);
 	}
 	return (error);
 }
 
 /*
+ * Calculate the ffs() of the cpuset.
+ */
+int
+cpusetobj_ffs(const cpuset_t *set)
+{
+	size_t i;
+	int cbit;
+
+	cbit = 0;
+	for (i = 0; i < _NCPUWORDS; i++) {
+		if (set->__bits[i] != 0) {
+			cbit = ffsl(set->__bits[i]);
+			cbit += i * _NCPUBITS;
+			break;
+		}
+	}
+	return (cbit);
+}
+
+/*
+ * Return a string representing a valid layout for a cpuset_t object.
+ * It expects an incoming buffer at least sized as CPUSETBUFSIZ.
+ */
+char *
+cpusetobj_strprint(char *buf, const cpuset_t *set)
+{
+	char *tbuf;
+	size_t i, bytesp, bufsiz;
+
+	tbuf = buf;
+	bytesp = 0;
+	bufsiz = CPUSETBUFSIZ;
+
+	for (i = _NCPUWORDS - 1; i > 0; i--) {
+		bytesp = snprintf(tbuf, bufsiz, "%lx, ", set->__bits[i]);
+		bufsiz -= bytesp;
+		tbuf += bytesp;
+	}
+	snprintf(tbuf, bufsiz, "%lx", set->__bits[0]);
+	return (buf);
+}
+
+/*
+ * Build a valid cpuset_t object from a string representation.
+ * It expects an incoming buffer at least sized as CPUSETBUFSIZ.
+ */
+int
+cpusetobj_strscan(cpuset_t *set, const char *buf)
+{
+	u_int nwords;
+	int i, ret;
+
+	if (strlen(buf) > CPUSETBUFSIZ - 1)
+		return (-1);
+
+	/* Allow to pass a shorter version of the mask when necessary. */
+	nwords = 1;
+	for (i = 0; buf[i] != '\0'; i++)
+		if (buf[i] == ',')
+			nwords++;
+	if (nwords > _NCPUWORDS)
+		return (-1);
+
+	CPU_ZERO(set);
+	for (i = nwords - 1; i > 0; i--) {
+		ret = sscanf(buf, "%lx, ", &set->__bits[i]);
+		if (ret == 0 || ret == -1)
+			return (-1);
+		buf = strstr(buf, " ");
+		if (buf == NULL)
+			return (-1);
+		buf++;
+	}
+	ret = sscanf(buf, "%lx", &set->__bits[0]);
+	if (ret == 0 || ret == -1)
+		return (-1);
+	return (0);
+}
+
+/*
  * Apply an anonymous mask to a single thread.
  */
 int
 cpuset_setthread(lwpid_t id, cpuset_t *mask)
 {
 	struct cpuset *nset;
 	struct cpuset *set;
 	struct thread *td;
 	struct proc *p;
 	int error;
 
 	nset = uma_zalloc(cpuset_zone, M_WAITOK);
 	error = cpuset_which(CPU_WHICH_TID, id, &p, &td, &set);
 	if (error)
 		goto out;
 	set = NULL;
 	thread_lock(td);
 	error = cpuset_shadow(td->td_cpuset, nset, mask);
 	if (error == 0) {
 		set = td->td_cpuset;
 		td->td_cpuset = nset;
 		sched_affinity(td);
 		nset = NULL;
 	}
 	thread_unlock(td);
 	PROC_UNLOCK(p);
 	if (set)
 		cpuset_rel(set);
 out:
 	if (nset)
 		uma_zfree(cpuset_zone, nset);
 	return (error);
 }
 
 /*
  * Creates the cpuset for thread0.  We make two sets:
  * 
  * 0 - The root set which should represent all valid processors in the
  *     system.  It is initially created with a mask of all processors
  *     because we don't know what processors are valid until cpuset_init()
  *     runs.  This set is immutable.
  * 1 - The default set which all processes are a member of until changed.
  *     This allows an administrator to move all threads off of given cpus to
  *     dedicate them to high priority tasks or save power etc.
  */
 struct cpuset *
 cpuset_thread0(void)
 {
 	struct cpuset *set;
 	int error;
 
 	cpuset_zone = uma_zcreate("cpuset", sizeof(struct cpuset), NULL, NULL,
 	    NULL, NULL, UMA_ALIGN_PTR, 0);
 	mtx_init(&cpuset_lock, "cpuset", NULL, MTX_SPIN | MTX_RECURSE);
 	/*
 	 * Create the root system set for the whole machine.  Doesn't use
 	 * cpuset_create() due to NULL parent.
 	 */
 	set = uma_zalloc(cpuset_zone, M_WAITOK | M_ZERO);
 	CPU_FILL(&set->cs_mask);
 	LIST_INIT(&set->cs_children);
 	LIST_INSERT_HEAD(&cpuset_ids, set, cs_link);
 	set->cs_ref = 1;
 	set->cs_flags = CPU_SET_ROOT;
 	cpuset_zero = set;
 	cpuset_root = &set->cs_mask;
 	/*
 	 * Now derive a default, modifiable set from that to give out.
 	 */
 	set = uma_zalloc(cpuset_zone, M_WAITOK);
 	error = _cpuset_create(set, cpuset_zero, &cpuset_zero->cs_mask, 1);
 	KASSERT(error == 0, ("Error creating default set: %d\n", error));
 	/*
 	 * Initialize the unit allocator. 0 and 1 are allocated above.
 	 */
 	cpuset_unr = new_unrhdr(2, INT_MAX, NULL);
 
 	return (set);
 }
 
 /*
  * Create a cpuset, which would be cpuset_create() but
  * mark the new 'set' as root.
  *
  * We are not going to reparent the td to it.  Use cpuset_setproc_update_set()
  * for that.
  *
  * In case of no error, returns the set in *setp locked with a reference.
  */
 int
 cpuset_create_root(struct prison *pr, struct cpuset **setp)
 {
 	struct cpuset *set;
 	int error;
 
 	KASSERT(pr != NULL, ("[%s:%d] invalid pr", __func__, __LINE__));
 	KASSERT(setp != NULL, ("[%s:%d] invalid setp", __func__, __LINE__));
 
 	error = cpuset_create(setp, pr->pr_cpuset, &pr->pr_cpuset->cs_mask);
 	if (error)
 		return (error);
 
 	KASSERT(*setp != NULL, ("[%s:%d] cpuset_create returned invalid data",
 	    __func__, __LINE__));
 
 	/* Mark the set as root. */
 	set = *setp;
 	set->cs_flags |= CPU_SET_ROOT;
 
 	return (0);
 }
 
 int
 cpuset_setproc_update_set(struct proc *p, struct cpuset *set)
 {
 	int error;
 
 	KASSERT(p != NULL, ("[%s:%d] invalid proc", __func__, __LINE__));
 	KASSERT(set != NULL, ("[%s:%d] invalid set", __func__, __LINE__));
 
 	cpuset_ref(set);
 	error = cpuset_setproc(p->p_pid, set, NULL);
 	if (error)
 		return (error);
 	cpuset_rel(set);
 	return (0);
 }
 
 /*
  * This is called once the final set of system cpus is known.  Modifies
  * the root set and all children and mark the root read-only.  
  */
 static void
 cpuset_init(void *arg)
 {
 	cpuset_t mask;
 
-	CPU_ZERO(&mask);
-#ifdef SMP
-	mask.__bits[0] = all_cpus;
-#else
-	mask.__bits[0] = 1;
-#endif
+	mask = all_cpus;
 	if (cpuset_modify(cpuset_zero, &mask))
 		panic("Can't set initial cpuset mask.\n");
 	cpuset_zero->cs_flags |= CPU_SET_RDONLY;
 }
 SYSINIT(cpuset, SI_SUB_SMP, SI_ORDER_ANY, cpuset_init, NULL);
 
 #ifndef _SYS_SYSPROTO_H_
 struct cpuset_args {
 	cpusetid_t	*setid;
 };
 #endif
 int
 cpuset(struct thread *td, struct cpuset_args *uap)
 {
 	struct cpuset *root;
 	struct cpuset *set;
 	int error;
 
 	thread_lock(td);
 	root = cpuset_refroot(td->td_cpuset);
 	thread_unlock(td);
 	error = cpuset_create(&set, root, &root->cs_mask);
 	cpuset_rel(root);
 	if (error)
 		return (error);
 	error = copyout(&set->cs_id, uap->setid, sizeof(set->cs_id));
 	if (error == 0)
 		error = cpuset_setproc(-1, set, NULL);
 	cpuset_rel(set);
 	return (error);
 }
 
 #ifndef _SYS_SYSPROTO_H_
 struct cpuset_setid_args {
 	cpuwhich_t	which;
 	id_t		id;
 	cpusetid_t	setid;
 };
 #endif
 int
 cpuset_setid(struct thread *td, struct cpuset_setid_args *uap)
 {
 	struct cpuset *set;
 	int error;
 
 	/*
 	 * Presently we only support per-process sets.
 	 */
 	if (uap->which != CPU_WHICH_PID)
 		return (EINVAL);
 	set = cpuset_lookup(uap->setid, td);
 	if (set == NULL)
 		return (ESRCH);
 	error = cpuset_setproc(uap->id, set, NULL);
 	cpuset_rel(set);
 	return (error);
 }
 
 #ifndef _SYS_SYSPROTO_H_
 struct cpuset_getid_args {
 	cpulevel_t	level;
 	cpuwhich_t	which;
 	id_t		id;
 	cpusetid_t	*setid;
 #endif
 int
 cpuset_getid(struct thread *td, struct cpuset_getid_args *uap)
 {
 	struct cpuset *nset;
 	struct cpuset *set;
 	struct thread *ttd;
 	struct proc *p;
 	cpusetid_t id;
 	int error;
 
 	if (uap->level == CPU_LEVEL_WHICH && uap->which != CPU_WHICH_CPUSET)
 		return (EINVAL);
 	error = cpuset_which(uap->which, uap->id, &p, &ttd, &set);
 	if (error)
 		return (error);
 	switch (uap->which) {
 	case CPU_WHICH_TID:
 	case CPU_WHICH_PID:
 		thread_lock(ttd);
 		set = cpuset_refbase(ttd->td_cpuset);
 		thread_unlock(ttd);
 		PROC_UNLOCK(p);
 		break;
 	case CPU_WHICH_CPUSET:
 	case CPU_WHICH_JAIL:
 		break;
 	case CPU_WHICH_IRQ:
 		return (EINVAL);
 	}
 	switch (uap->level) {
 	case CPU_LEVEL_ROOT:
 		nset = cpuset_refroot(set);
 		cpuset_rel(set);
 		set = nset;
 		break;
 	case CPU_LEVEL_CPUSET:
 		break;
 	case CPU_LEVEL_WHICH:
 		break;
 	}
 	id = set->cs_id;
 	cpuset_rel(set);
 	if (error == 0)
 		error = copyout(&id, uap->setid, sizeof(id));
 
 	return (error);
 }
 
 #ifndef _SYS_SYSPROTO_H_
 struct cpuset_getaffinity_args {
 	cpulevel_t	level;
 	cpuwhich_t	which;
 	id_t		id;
 	size_t		cpusetsize;
 	cpuset_t	*mask;
 };
 #endif
 int
 cpuset_getaffinity(struct thread *td, struct cpuset_getaffinity_args *uap)
 {
 	struct thread *ttd;
 	struct cpuset *nset;
 	struct cpuset *set;
 	struct proc *p;
 	cpuset_t *mask;
 	int error;
 	size_t size;
 
 	if (uap->cpusetsize < sizeof(cpuset_t) ||
 	    uap->cpusetsize > CPU_MAXSIZE / NBBY)
 		return (ERANGE);
 	size = uap->cpusetsize;
 	mask = malloc(size, M_TEMP, M_WAITOK | M_ZERO);
 	error = cpuset_which(uap->which, uap->id, &p, &ttd, &set);
 	if (error)
 		goto out;
 	switch (uap->level) {
 	case CPU_LEVEL_ROOT:
 	case CPU_LEVEL_CPUSET:
 		switch (uap->which) {
 		case CPU_WHICH_TID:
 		case CPU_WHICH_PID:
 			thread_lock(ttd);
 			set = cpuset_ref(ttd->td_cpuset);
 			thread_unlock(ttd);
 			break;
 		case CPU_WHICH_CPUSET:
 		case CPU_WHICH_JAIL:
 			break;
 		case CPU_WHICH_IRQ:
 			error = EINVAL;
 			goto out;
 		}
 		if (uap->level == CPU_LEVEL_ROOT)
 			nset = cpuset_refroot(set);
 		else
 			nset = cpuset_refbase(set);
 		CPU_COPY(&nset->cs_mask, mask);
 		cpuset_rel(nset);
 		break;
 	case CPU_LEVEL_WHICH:
 		switch (uap->which) {
 		case CPU_WHICH_TID:
 			thread_lock(ttd);
 			CPU_COPY(&ttd->td_cpuset->cs_mask, mask);
 			thread_unlock(ttd);
 			break;
 		case CPU_WHICH_PID:
 			FOREACH_THREAD_IN_PROC(p, ttd) {
 				thread_lock(ttd);
 				CPU_OR(mask, &ttd->td_cpuset->cs_mask);
 				thread_unlock(ttd);
 			}
 			break;
 		case CPU_WHICH_CPUSET:
 		case CPU_WHICH_JAIL:
 			CPU_COPY(&set->cs_mask, mask);
 			break;
 		case CPU_WHICH_IRQ:
 			error = intr_getaffinity(uap->id, mask);
 			break;
 		}
 		break;
 	default:
 		error = EINVAL;
 		break;
 	}
 	if (set)
 		cpuset_rel(set);
 	if (p)
 		PROC_UNLOCK(p);
 	if (error == 0)
 		error = copyout(mask, uap->mask, size);
 out:
 	free(mask, M_TEMP);
 	return (error);
 }
 
 #ifndef _SYS_SYSPROTO_H_
 struct cpuset_setaffinity_args {
 	cpulevel_t	level;
 	cpuwhich_t	which;
 	id_t		id;
 	size_t		cpusetsize;
 	const cpuset_t	*mask;
 };
 #endif
 int
 cpuset_setaffinity(struct thread *td, struct cpuset_setaffinity_args *uap)
 {
 	struct cpuset *nset;
 	struct cpuset *set;
 	struct thread *ttd;
 	struct proc *p;
 	cpuset_t *mask;
 	int error;
 
 	if (uap->cpusetsize < sizeof(cpuset_t) ||
 	    uap->cpusetsize > CPU_MAXSIZE / NBBY)
 		return (ERANGE);
 	mask = malloc(uap->cpusetsize, M_TEMP, M_WAITOK | M_ZERO);
 	error = copyin(uap->mask, mask, uap->cpusetsize);
 	if (error)
 		goto out;
 	/*
 	 * Verify that no high bits are set.
 	 */
 	if (uap->cpusetsize > sizeof(cpuset_t)) {
 		char *end;
 		char *cp;
 
 		end = cp = (char *)&mask->__bits;
 		end += uap->cpusetsize;
 		cp += sizeof(cpuset_t);
 		while (cp != end)
 			if (*cp++ != 0) {
 				error = EINVAL;
 				goto out;
 			}
 
 	}
 	switch (uap->level) {
 	case CPU_LEVEL_ROOT:
 	case CPU_LEVEL_CPUSET:
 		error = cpuset_which(uap->which, uap->id, &p, &ttd, &set);
 		if (error)
 			break;
 		switch (uap->which) {
 		case CPU_WHICH_TID:
 		case CPU_WHICH_PID:
 			thread_lock(ttd);
 			set = cpuset_ref(ttd->td_cpuset);
 			thread_unlock(ttd);
 			PROC_UNLOCK(p);
 			break;
 		case CPU_WHICH_CPUSET:
 		case CPU_WHICH_JAIL:
 			break;
 		case CPU_WHICH_IRQ:
 			error = EINVAL;
 			goto out;
 		}
 		if (uap->level == CPU_LEVEL_ROOT)
 			nset = cpuset_refroot(set);
 		else
 			nset = cpuset_refbase(set);
 		error = cpuset_modify(nset, mask);
 		cpuset_rel(nset);
 		cpuset_rel(set);
 		break;
 	case CPU_LEVEL_WHICH:
 		switch (uap->which) {
 		case CPU_WHICH_TID:
 			error = cpuset_setthread(uap->id, mask);
 			break;
 		case CPU_WHICH_PID:
 			error = cpuset_setproc(uap->id, NULL, mask);
 			break;
 		case CPU_WHICH_CPUSET:
 		case CPU_WHICH_JAIL:
 			error = cpuset_which(uap->which, uap->id, &p,
 			    &ttd, &set);
 			if (error == 0) {
 				error = cpuset_modify(set, mask);
 				cpuset_rel(set);
 			}
 			break;
 		case CPU_WHICH_IRQ:
 			error = intr_setaffinity(uap->id, mask);
 			break;
 		default:
 			error = EINVAL;
 			break;
 		}
 		break;
 	default:
 		error = EINVAL;
 		break;
 	}
 out:
 	free(mask, M_TEMP);
 	return (error);
 }
 
 #ifdef DDB
 DB_SHOW_COMMAND(cpusets, db_show_cpusets)
 {
 	struct cpuset *set;
 	int cpu, once;
 
 	LIST_FOREACH(set, &cpuset_ids, cs_link) {
 		db_printf("set=%p id=%-6u ref=%-6d flags=0x%04x parent id=%d\n",
 		    set, set->cs_id, set->cs_ref, set->cs_flags,
 		    (set->cs_parent != NULL) ? set->cs_parent->cs_id : 0);
 		db_printf("  mask=");
 		for (once = 0, cpu = 0; cpu < CPU_SETSIZE; cpu++) {
 			if (CPU_ISSET(cpu, &set->cs_mask)) {
 				if (once == 0) {
 					db_printf("%d", cpu);
 					once = 1;
 				} else  
 					db_printf(",%d", cpu);
 			}
 		}
 		db_printf("\n");
 		if (db_pager_quit)
 			break;
 	}
 }
 #endif /* DDB */
Index: head/sys/kern/kern_ktr.c
===================================================================
--- head/sys/kern/kern_ktr.c	(revision 222812)
+++ head/sys/kern/kern_ktr.c	(revision 222813)
@@ -1,359 +1,400 @@
 /*-
  * Copyright (c) 2000 John Baldwin <jhb@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * This module holds the global variables used by KTR and the ktr_tracepoint()
  * function that does the actual tracing.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 #include "opt_ktr.h"
 #include "opt_alq.h"
 
 #include <sys/param.h>
+#include <sys/queue.h>
 #include <sys/alq.h>
 #include <sys/cons.h>
+#include <sys/cpuset.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/libkern.h>
 #include <sys/proc.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/time.h>
 
 #include <machine/cpu.h>
 #ifdef __sparc64__
 #include <machine/ktr.h>
 #endif
 
 #ifdef DDB
 #include <ddb/ddb.h>
 #include <ddb/db_output.h>
 #endif
 
 #ifndef KTR_ENTRIES
 #define	KTR_ENTRIES	1024
 #endif
 
 #ifndef KTR_MASK
 #define	KTR_MASK	(0)
 #endif
 
-#ifndef KTR_CPUMASK
-#define	KTR_CPUMASK	(~0)
-#endif
-
 #ifndef KTR_TIME
 #define	KTR_TIME	get_cyclecount()
 #endif
 
 #ifndef KTR_CPU
 #define	KTR_CPU		PCPU_GET(cpuid)
 #endif
 
 FEATURE(ktr, "Kernel support for KTR kernel tracing facility");
 
 SYSCTL_NODE(_debug, OID_AUTO, ktr, CTLFLAG_RD, 0, "KTR options");
 
-int	ktr_cpumask = KTR_CPUMASK;
-TUNABLE_INT("debug.ktr.cpumask", &ktr_cpumask);
-SYSCTL_INT(_debug_ktr, OID_AUTO, cpumask, CTLFLAG_RW,
-    &ktr_cpumask, 0, "Bitmask of CPUs on which KTR logging is enabled");
-
 int	ktr_mask = KTR_MASK;
 TUNABLE_INT("debug.ktr.mask", &ktr_mask);
 SYSCTL_INT(_debug_ktr, OID_AUTO, mask, CTLFLAG_RW,
     &ktr_mask, 0, "Bitmask of KTR event classes for which logging is enabled");
 
 int	ktr_compile = KTR_COMPILE;
 SYSCTL_INT(_debug_ktr, OID_AUTO, compile, CTLFLAG_RD,
     &ktr_compile, 0, "Bitmask of KTR event classes compiled into the kernel");
 
 int	ktr_entries = KTR_ENTRIES;
 SYSCTL_INT(_debug_ktr, OID_AUTO, entries, CTLFLAG_RD,
     &ktr_entries, 0, "Number of entries in the KTR buffer");
 
 int	ktr_version = KTR_VERSION;
 SYSCTL_INT(_debug_ktr, OID_AUTO, version, CTLFLAG_RD,
     &ktr_version, 0, "Version of the KTR interface");
 
+cpuset_t ktr_cpumask;
+static char ktr_cpumask_str[CPUSETBUFSIZ];
+TUNABLE_STR("debug.ktr.cpumask", ktr_cpumask_str, sizeof(ktr_cpumask_str));
+
+static void
+ktr_cpumask_initializer(void *dummy __unused)
+{
+
+	CPU_FILL(&ktr_cpumask);
+#ifdef KTR_CPUMASK
+	if (cpusetobj_strscan(&ktr_cpumask, KTR_CPUMASK) == -1)
+		CPU_FILL(&ktr_cpumask);
+#endif
+
+	/*
+	 * TUNABLE_STR() runs with SI_ORDER_MIDDLE priority, thus it must be
+	 * already set, if necessary.
+	 */
+	if (ktr_cpumask_str[0] != '\0' &&
+	    cpusetobj_strscan(&ktr_cpumask, ktr_cpumask_str) == -1)
+		CPU_FILL(&ktr_cpumask);
+}
+SYSINIT(ktr_cpumask_initializer, SI_SUB_TUNABLES, SI_ORDER_ANY,
+    ktr_cpumask_initializer, NULL);
+
+static int
+sysctl_debug_ktr_cpumask(SYSCTL_HANDLER_ARGS)
+{
+	char lktr_cpumask_str[CPUSETBUFSIZ];
+	cpuset_t imask;
+	int error;
+
+	cpusetobj_strprint(lktr_cpumask_str, &ktr_cpumask);
+	error = sysctl_handle_string(oidp, lktr_cpumask_str,
+	    sizeof(lktr_cpumask_str), req);
+	if (error != 0 || req->newptr == NULL)
+		return (error);
+	if (cpusetobj_strscan(&imask, lktr_cpumask_str) == -1)
+		return (EINVAL);
+	CPU_COPY(&imask, &ktr_cpumask);
+
+	return (error);
+}
+SYSCTL_PROC(_debug_ktr, OID_AUTO, cpumask,
+    CTLFLAG_RW | CTLFLAG_MPSAFE | CTLTYPE_STRING, NULL, 0,
+    sysctl_debug_ktr_cpumask, "S",
+    "Bitmask of CPUs on which KTR logging is enabled");
+
 volatile int	ktr_idx = 0;
 struct	ktr_entry ktr_buf[KTR_ENTRIES];
 
 static int
 sysctl_debug_ktr_clear(SYSCTL_HANDLER_ARGS)
 {
 	int clear, error;
 
 	clear = 0;
 	error = sysctl_handle_int(oidp, &clear, 0, req);
 	if (error || !req->newptr)
 		return (error);
 
 	if (clear) {
 		bzero(ktr_buf, sizeof(ktr_buf));
 		ktr_idx = 0;
 	}
 
 	return (error);
 }
 SYSCTL_PROC(_debug_ktr, OID_AUTO, clear, CTLTYPE_INT|CTLFLAG_RW, 0, 0,
     sysctl_debug_ktr_clear, "I", "Clear KTR Buffer");
 
 #ifdef KTR_VERBOSE
 int	ktr_verbose = KTR_VERBOSE;
 TUNABLE_INT("debug.ktr.verbose", &ktr_verbose);
 SYSCTL_INT(_debug_ktr, OID_AUTO, verbose, CTLFLAG_RW, &ktr_verbose, 0, "");
 #endif
 
 #ifdef KTR_ALQ
 struct alq *ktr_alq;
 char	ktr_alq_file[MAXPATHLEN] = "/tmp/ktr.out";
 int	ktr_alq_cnt = 0;
 int	ktr_alq_depth = KTR_ENTRIES;
 int	ktr_alq_enabled = 0;
 int	ktr_alq_failed = 0;
 int	ktr_alq_max = 0;
 
 SYSCTL_INT(_debug_ktr, OID_AUTO, alq_max, CTLFLAG_RW, &ktr_alq_max, 0,
     "Maximum number of entries to write");
 SYSCTL_INT(_debug_ktr, OID_AUTO, alq_cnt, CTLFLAG_RD, &ktr_alq_cnt, 0,
     "Current number of written entries");
 SYSCTL_INT(_debug_ktr, OID_AUTO, alq_failed, CTLFLAG_RD, &ktr_alq_failed, 0,
     "Number of times we overran the buffer");
 SYSCTL_INT(_debug_ktr, OID_AUTO, alq_depth, CTLFLAG_RW, &ktr_alq_depth, 0,
     "Number of items in the write buffer");
 SYSCTL_STRING(_debug_ktr, OID_AUTO, alq_file, CTLFLAG_RW, ktr_alq_file,
     sizeof(ktr_alq_file), "KTR logging file");
 
 static int
 sysctl_debug_ktr_alq_enable(SYSCTL_HANDLER_ARGS)
 {
 	int error;
 	int enable;
 
 	enable = ktr_alq_enabled;
 
 	error = sysctl_handle_int(oidp, &enable, 0, req);
 	if (error || !req->newptr)
 		return (error);
 
 	if (enable) {
 		if (ktr_alq_enabled)
 			return (0);
 		error = alq_open(&ktr_alq, (const char *)ktr_alq_file,
 		    req->td->td_ucred, ALQ_DEFAULT_CMODE,
 		    sizeof(struct ktr_entry), ktr_alq_depth);
 		if (error == 0) {
 			ktr_alq_cnt = 0;
 			ktr_alq_failed = 0;
 			ktr_alq_enabled = 1;
 		}
 	} else {
 		if (ktr_alq_enabled == 0)
 			return (0);
 		ktr_alq_enabled = 0;
 		alq_close(ktr_alq);
 		ktr_alq = NULL;
 	}
 
 	return (error);
 }
 SYSCTL_PROC(_debug_ktr, OID_AUTO, alq_enable,
     CTLTYPE_INT|CTLFLAG_RW, 0, 0, sysctl_debug_ktr_alq_enable,
     "I", "Enable KTR logging");
 #endif
 
 void
 ktr_tracepoint(u_int mask, const char *file, int line, const char *format,
     u_long arg1, u_long arg2, u_long arg3, u_long arg4, u_long arg5,
     u_long arg6)
 {
 	struct ktr_entry *entry;
 #ifdef KTR_ALQ
 	struct ale *ale = NULL;
 #endif
 	int newindex, saveindex;
 #if defined(KTR_VERBOSE) || defined(KTR_ALQ)
 	struct thread *td;
 #endif
 	int cpu;
 
 	if (panicstr)
 		return;
 	if ((ktr_mask & mask) == 0)
 		return;
 	cpu = KTR_CPU;
-	if (((1 << cpu) & ktr_cpumask) == 0)
+	if (!CPU_ISSET(cpu, &ktr_cpumask))
 		return;
 #if defined(KTR_VERBOSE) || defined(KTR_ALQ)
 	td = curthread;
 	if (td->td_pflags & TDP_INKTR)
 		return;
 	td->td_pflags |= TDP_INKTR;
 #endif
 #ifdef KTR_ALQ
 	if (ktr_alq_enabled) {
 		if (td->td_critnest == 0 &&
 		    (td->td_flags & TDF_IDLETD) == 0 &&
 		    td != ald_thread) {
 			if (ktr_alq_max && ktr_alq_cnt > ktr_alq_max)
 				goto done;
 			if ((ale = alq_get(ktr_alq, ALQ_NOWAIT)) == NULL) {
 				ktr_alq_failed++;
 				goto done;
 			}
 			ktr_alq_cnt++;
 			entry = (struct ktr_entry *)ale->ae_data;
 		} else {
 			goto done;
 		}
 	} else
 #endif
 	{
 		do {
 			saveindex = ktr_idx;
 			newindex = (saveindex + 1) & (KTR_ENTRIES - 1);
 		} while (atomic_cmpset_rel_int(&ktr_idx, saveindex, newindex) == 0);
 		entry = &ktr_buf[saveindex];
 	}
 	entry->ktr_timestamp = KTR_TIME;
 	entry->ktr_cpu = cpu;
 	entry->ktr_thread = curthread;
 	if (file != NULL)
 		while (strncmp(file, "../", 3) == 0)
 			file += 3;
 	entry->ktr_file = file;
 	entry->ktr_line = line;
 #ifdef KTR_VERBOSE
 	if (ktr_verbose) {
 #ifdef SMP
 		printf("cpu%d ", cpu);
 #endif
 		if (ktr_verbose > 1) {
 			printf("%s.%d\t", entry->ktr_file,
 			    entry->ktr_line);
 		}
 		printf(format, arg1, arg2, arg3, arg4, arg5, arg6);
 		printf("\n");
 	}
 #endif
 	entry->ktr_desc = format;
 	entry->ktr_parms[0] = arg1;
 	entry->ktr_parms[1] = arg2;
 	entry->ktr_parms[2] = arg3;
 	entry->ktr_parms[3] = arg4;
 	entry->ktr_parms[4] = arg5;
 	entry->ktr_parms[5] = arg6;
 #ifdef KTR_ALQ
 	if (ktr_alq_enabled && ale)
 		alq_post(ktr_alq, ale);
 done:
 #endif
 #if defined(KTR_VERBOSE) || defined(KTR_ALQ)
 	td->td_pflags &= ~TDP_INKTR;
 #endif
 }
 
 #ifdef DDB
 
 struct tstate {
 	int	cur;
 	int	first;
 };
 static	struct tstate tstate;
 static	int db_ktr_verbose;
 static	int db_mach_vtrace(void);
 
 DB_SHOW_COMMAND(ktr, db_ktr_all)
 {
 	
 	tstate.cur = (ktr_idx - 1) & (KTR_ENTRIES - 1);
 	tstate.first = -1;
 	db_ktr_verbose = 0;
 	db_ktr_verbose |= (index(modif, 'v') != NULL) ? 2 : 0;
 	db_ktr_verbose |= (index(modif, 'V') != NULL) ? 1 : 0; /* just timestap please */
 	if (index(modif, 'a') != NULL) {
 		db_disable_pager();
 		while (cncheckc() != -1)
 			if (db_mach_vtrace() == 0)
 				break;
 	} else {
 		while (!db_pager_quit)
 			if (db_mach_vtrace() == 0)
 				break;
 	}
 }
 
 static int
 db_mach_vtrace(void)
 {
 	struct ktr_entry	*kp;
 
 	if (tstate.cur == tstate.first) {
 		db_printf("--- End of trace buffer ---\n");
 		return (0);
 	}
 	kp = &ktr_buf[tstate.cur];
 
 	/* Skip over unused entries. */
 	if (kp->ktr_desc == NULL) {
 		db_printf("--- End of trace buffer ---\n");
 		return (0);
 	}
 	db_printf("%d (%p", tstate.cur, kp->ktr_thread);
 #ifdef SMP
 	db_printf(":cpu%d", kp->ktr_cpu);
 #endif
 	db_printf(")");
 	if (db_ktr_verbose >= 1) {
 		db_printf(" %10.10lld", (long long)kp->ktr_timestamp);
 	}
 	if (db_ktr_verbose >= 2) {
 		db_printf(" %s.%d", kp->ktr_file, kp->ktr_line);
 	}
 	db_printf(": ");
 	db_printf(kp->ktr_desc, kp->ktr_parms[0], kp->ktr_parms[1],
 	    kp->ktr_parms[2], kp->ktr_parms[3], kp->ktr_parms[4],
 	    kp->ktr_parms[5]);
 	db_printf("\n");
 
 	if (tstate.first == -1)
 		tstate.first = tstate.cur;
 
 	if (--tstate.cur < 0)
 		tstate.cur = KTR_ENTRIES - 1;
 
 	return (1);
 }
 
 #endif	/* DDB */
Index: head/sys/kern/kern_pmc.c
===================================================================
--- head/sys/kern/kern_pmc.c	(revision 222812)
+++ head/sys/kern/kern_pmc.c	(revision 222813)
@@ -1,184 +1,184 @@
 /*-
  * Copyright (c) 2003-2008 Joseph Koshy
  * Copyright (c) 2007 The FreeBSD Foundation
  * All rights reserved.
  *
  * Portions of this software were developed by A. Joseph Koshy under
  * sponsorship from the FreeBSD Foundation and Google, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_hwpmc_hooks.h"
 
 #include <sys/types.h>
 #include <sys/pmc.h>
 #include <sys/pmckern.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 
 #ifdef	HWPMC_HOOKS
 FEATURE(hwpmc_hooks, "Kernel support for HW PMC");
 #define	PMC_KERNEL_VERSION	PMC_VERSION
 #else
 #define	PMC_KERNEL_VERSION	0
 #endif
 
 const int pmc_kernel_version = PMC_KERNEL_VERSION;
 
 /* Hook variable. */
 int (*pmc_hook)(struct thread *td, int function, void *arg) = NULL;
 
 /* Interrupt handler */
 int (*pmc_intr)(int cpu, struct trapframe *tf) = NULL;
 
 /* Bitmask of CPUs requiring servicing at hardclock time */
-volatile cpumask_t pmc_cpumask;
+volatile cpuset_t pmc_cpumask;
 
 /*
  * A global count of SS mode PMCs.  When non-zero, this means that
  * we have processes that are sampling the system as a whole.
  */
 volatile int pmc_ss_count;
 
 /*
  * Since PMC(4) may not be loaded in the current kernel, the
  * convention followed is that a non-NULL value of 'pmc_hook' implies
  * the presence of this kernel module.
  *
  * This requires us to protect 'pmc_hook' with a
  * shared (sx) lock -- thus making the process of calling into PMC(4)
  * somewhat more expensive than a simple 'if' check and indirect call.
  */
 struct sx pmc_sx;
 
 static void
 pmc_init_sx(void)
 {
 	sx_init_flags(&pmc_sx, "pmc-sx", SX_NOWITNESS);
 }
 
 SYSINIT(pmcsx, SI_SUB_LOCK, SI_ORDER_MIDDLE, pmc_init_sx, NULL);
 
 /*
  * Helper functions.
  */
 
 /*
  * A note on the CPU numbering scheme used by the hwpmc(4) driver.
  *
  * CPUs are denoted using numbers in the range 0..[pmc_cpu_max()-1].
  * CPUs could be numbered "sparsely" in this range; the predicate
  * `pmc_cpu_is_present()' is used to test whether a given CPU is
  * physically present.
  *
  * Further, a CPU that is physically present may be administratively
  * disabled or otherwise unavailable for use by hwpmc(4).  The
  * `pmc_cpu_is_active()' predicate tests for CPU usability.  An
  * "active" CPU participates in thread scheduling and can field
  * interrupts raised by PMC hardware.
  *
  * On systems with hyperthreaded CPUs, multiple logical CPUs may share
  * PMC hardware resources.  For such processors one logical CPU is
  * denoted as the primary owner of the in-CPU PMC resources. The
  * pmc_cpu_is_primary() predicate is used to distinguish this primary
  * CPU from the others.
  */
 
 int
 pmc_cpu_is_active(int cpu)
 {
 #ifdef	SMP
 	return (pmc_cpu_is_present(cpu) &&
-	    (hlt_cpus_mask & (1 << cpu)) == 0);
+	    !CPU_ISSET(cpu, &hlt_cpus_mask));
 #else
 	return (1);
 #endif
 }
 
 /* Deprecated. */
 int
 pmc_cpu_is_disabled(int cpu)
 {
 	return (!pmc_cpu_is_active(cpu));
 }
 
 int
 pmc_cpu_is_present(int cpu)
 {
 #ifdef	SMP
 	return (!CPU_ABSENT(cpu));
 #else
 	return (1);
 #endif
 }
 
 int
 pmc_cpu_is_primary(int cpu)
 {
 #ifdef	SMP
-	return ((logical_cpus_mask & (1 << cpu)) == 0);
+	return (!CPU_ISSET(cpu, &logical_cpus_mask));
 #else
 	return (1);
 #endif
 }
 
 
 /*
  * Return the maximum CPU number supported by the system.  The return
  * value is used for scaling internal data structures and for runtime
  * checks.
  */
 unsigned int
 pmc_cpu_max(void)
 {
 #ifdef	SMP
 	return (mp_maxid+1);
 #else
 	return (1);
 #endif
 }
 
 #ifdef	INVARIANTS
 
 /*
  * Return the count of CPUs in the `active' state in the system.
  */
 int
 pmc_cpu_max_active(void)
 {
 #ifdef	SMP
 	/*
 	 * When support for CPU hot-plugging is added to the kernel,
 	 * this function would change to return the current number
 	 * of "active" CPUs.
 	 */
 	return (mp_ncpus);
 #else
 	return (1);
 #endif
 }
 
 #endif
Index: head/sys/kern/kern_rmlock.c
===================================================================
--- head/sys/kern/kern_rmlock.c	(revision 222812)
+++ head/sys/kern/kern_rmlock.c	(revision 222813)
@@ -1,588 +1,589 @@
 /*-
  * Copyright (c) 2007 Stephan Uphoff <ups@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * Machine independent bits of reader/writer lock implementation.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 #include "opt_kdtrace.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/rmlock.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/turnstile.h>
 #include <sys/lock_profile.h>
 #include <machine/cpu.h>
 
 #ifdef DDB
 #include <ddb/ddb.h>
 #endif
 
 #define RMPF_ONQUEUE	1
 #define RMPF_SIGNAL	2
 
 /*
  * To support usage of rmlock in CVs and msleep yet another list for the
  * priority tracker would be needed.  Using this lock for cv and msleep also
  * does not seem very useful
  */
 
 static __inline void compiler_memory_barrier(void) {
 	__asm __volatile("":::"memory");
 }
 
 static void	assert_rm(struct lock_object *lock, int what);
 static void	lock_rm(struct lock_object *lock, int how);
 #ifdef KDTRACE_HOOKS
 static int	owner_rm(struct lock_object *lock, struct thread **owner);
 #endif
 static int	unlock_rm(struct lock_object *lock);
 
 struct lock_class lock_class_rm = {
 	.lc_name = "rm",
 	.lc_flags = LC_SLEEPLOCK | LC_RECURSABLE,
 	.lc_assert = assert_rm,
 #if 0
 #ifdef DDB
 	.lc_ddb_show = db_show_rwlock,
 #endif
 #endif
 	.lc_lock = lock_rm,
 	.lc_unlock = unlock_rm,
 #ifdef KDTRACE_HOOKS
 	.lc_owner = owner_rm,
 #endif
 };
 
 static void
 assert_rm(struct lock_object *lock, int what)
 {
 
 	panic("assert_rm called");
 }
 
 static void
 lock_rm(struct lock_object *lock, int how)
 {
 
 	panic("lock_rm called");
 }
 
 static int
 unlock_rm(struct lock_object *lock)
 {
 
 	panic("unlock_rm called");
 }
 
 #ifdef KDTRACE_HOOKS
 static int
 owner_rm(struct lock_object *lock, struct thread **owner)
 {
 
 	panic("owner_rm called");
 }
 #endif
 
 static struct mtx rm_spinlock;
 
 MTX_SYSINIT(rm_spinlock, &rm_spinlock, "rm_spinlock", MTX_SPIN);
 
 /*
  * Add or remove tracker from per-cpu list.
  *
  * The per-cpu list can be traversed at any time in forward direction from an
  * interrupt on the *local* cpu.
  */
 static void inline
 rm_tracker_add(struct pcpu *pc, struct rm_priotracker *tracker)
 {
 	struct rm_queue *next;
 
 	/* Initialize all tracker pointers */
 	tracker->rmp_cpuQueue.rmq_prev = &pc->pc_rm_queue;
 	next = pc->pc_rm_queue.rmq_next;
 	tracker->rmp_cpuQueue.rmq_next = next;
 
 	/* rmq_prev is not used during froward traversal. */
 	next->rmq_prev = &tracker->rmp_cpuQueue;
 
 	/* Update pointer to first element. */
 	pc->pc_rm_queue.rmq_next = &tracker->rmp_cpuQueue;
 }
 
 static void inline
 rm_tracker_remove(struct pcpu *pc, struct rm_priotracker *tracker)
 {
 	struct rm_queue *next, *prev;
 
 	next = tracker->rmp_cpuQueue.rmq_next;
 	prev = tracker->rmp_cpuQueue.rmq_prev;
 
 	/* Not used during forward traversal. */
 	next->rmq_prev = prev;
 
 	/* Remove from list. */
 	prev->rmq_next = next;
 }
 
 static void
 rm_cleanIPI(void *arg)
 {
 	struct pcpu *pc;
 	struct rmlock *rm = arg;
 	struct rm_priotracker *tracker;
 	struct rm_queue *queue;
 	pc = pcpu_find(curcpu);
 
 	for (queue = pc->pc_rm_queue.rmq_next; queue != &pc->pc_rm_queue;
 	    queue = queue->rmq_next) {
 		tracker = (struct rm_priotracker *)queue;
 		if (tracker->rmp_rmlock == rm && tracker->rmp_flags == 0) {
 			tracker->rmp_flags = RMPF_ONQUEUE;
 			mtx_lock_spin(&rm_spinlock);
 			LIST_INSERT_HEAD(&rm->rm_activeReaders, tracker,
 			    rmp_qentry);
 			mtx_unlock_spin(&rm_spinlock);
 		}
 	}
 }
 
 CTASSERT((RM_SLEEPABLE & LO_CLASSFLAGS) == RM_SLEEPABLE);
 
 void
 rm_init_flags(struct rmlock *rm, const char *name, int opts)
 {
 	int liflags;
 
 	liflags = 0;
 	if (!(opts & RM_NOWITNESS))
 		liflags |= LO_WITNESS;
 	if (opts & RM_RECURSE)
 		liflags |= LO_RECURSABLE;
 	rm->rm_writecpus = all_cpus;
 	LIST_INIT(&rm->rm_activeReaders);
 	if (opts & RM_SLEEPABLE) {
 		liflags |= RM_SLEEPABLE;
 		sx_init_flags(&rm->rm_lock_sx, "rmlock_sx", SX_RECURSE);
 	} else
 		mtx_init(&rm->rm_lock_mtx, name, "rmlock_mtx", MTX_NOWITNESS);
 	lock_init(&rm->lock_object, &lock_class_rm, name, NULL, liflags);
 }
 
 void
 rm_init(struct rmlock *rm, const char *name)
 {
 
 	rm_init_flags(rm, name, 0);
 }
 
 void
 rm_destroy(struct rmlock *rm)
 {
 
 	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
 		sx_destroy(&rm->rm_lock_sx);
 	else
 		mtx_destroy(&rm->rm_lock_mtx);
 	lock_destroy(&rm->lock_object);
 }
 
 int
 rm_wowned(struct rmlock *rm)
 {
 
 	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
 		return (sx_xlocked(&rm->rm_lock_sx));
 	else
 		return (mtx_owned(&rm->rm_lock_mtx));
 }
 
 void
 rm_sysinit(void *arg)
 {
 	struct rm_args *args = arg;
 
 	rm_init(args->ra_rm, args->ra_desc);
 }
 
 void
 rm_sysinit_flags(void *arg)
 {
 	struct rm_args_flags *args = arg;
 
 	rm_init_flags(args->ra_rm, args->ra_desc, args->ra_opts);
 }
 
 static int
 _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)
 {
 	struct pcpu *pc;
 	struct rm_queue *queue;
 	struct rm_priotracker *atracker;
 
 	critical_enter();
 	pc = pcpu_find(curcpu);
 
 	/* Check if we just need to do a proper critical_exit. */
-	if (!(pc->pc_cpumask & rm->rm_writecpus)) {
+	if (!CPU_OVERLAP(&pc->pc_cpumask, &rm->rm_writecpus)) {
 		critical_exit();
 		return (1);
 	}
 
 	/* Remove our tracker from the per-cpu list. */
 	rm_tracker_remove(pc, tracker);
 
 	/* Check to see if the IPI granted us the lock after all. */
 	if (tracker->rmp_flags) {
 		/* Just add back tracker - we hold the lock. */
 		rm_tracker_add(pc, tracker);
 		critical_exit();
 		return (1);
 	}
 
 	/*
 	 * We allow readers to aquire a lock even if a writer is blocked if
 	 * the lock is recursive and the reader already holds the lock.
 	 */
 	if ((rm->lock_object.lo_flags & LO_RECURSABLE) != 0) {
 		/*
 		 * Just grant the lock if this thread already has a tracker
 		 * for this lock on the per-cpu queue.
 		 */
 		for (queue = pc->pc_rm_queue.rmq_next;
 		    queue != &pc->pc_rm_queue; queue = queue->rmq_next) {
 			atracker = (struct rm_priotracker *)queue;
 			if ((atracker->rmp_rmlock == rm) &&
 			    (atracker->rmp_thread == tracker->rmp_thread)) {
 				mtx_lock_spin(&rm_spinlock);
 				LIST_INSERT_HEAD(&rm->rm_activeReaders,
 				    tracker, rmp_qentry);
 				tracker->rmp_flags = RMPF_ONQUEUE;
 				mtx_unlock_spin(&rm_spinlock);
 				rm_tracker_add(pc, tracker);
 				critical_exit();
 				return (1);
 			}
 		}
 	}
 
 	sched_unpin();
 	critical_exit();
 
 	if (trylock) {
 		if (rm->lock_object.lo_flags & RM_SLEEPABLE) {
 			if (!sx_try_xlock(&rm->rm_lock_sx))
 				return (0);
 		} else {
 			if (!mtx_trylock(&rm->rm_lock_mtx))
 				return (0);
 		}
 	} else {
 		if (rm->lock_object.lo_flags & RM_SLEEPABLE)
 			sx_xlock(&rm->rm_lock_sx);
 		else
 			mtx_lock(&rm->rm_lock_mtx);
 	}
 
 	critical_enter();
 	pc = pcpu_find(curcpu);
-	rm->rm_writecpus &= ~pc->pc_cpumask;
+	CPU_NAND(&rm->rm_writecpus, &pc->pc_cpumask);
 	rm_tracker_add(pc, tracker);
 	sched_pin();
 	critical_exit();
 
 	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
 		sx_xunlock(&rm->rm_lock_sx);
 	else
 		mtx_unlock(&rm->rm_lock_mtx);
 
 	return (1);
 }
 
 int
 _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)
 {
 	struct thread *td = curthread;
 	struct pcpu *pc;
 
 	tracker->rmp_flags  = 0;
 	tracker->rmp_thread = td;
 	tracker->rmp_rmlock = rm;
 
 	td->td_critnest++;	/* critical_enter(); */
 
 	compiler_memory_barrier();
 
 	pc = cpuid_to_pcpu[td->td_oncpu]; /* pcpu_find(td->td_oncpu); */
 
 	rm_tracker_add(pc, tracker);
 
 	sched_pin();
 
 	compiler_memory_barrier();
 
 	td->td_critnest--;
 
 	/*
 	 * Fast path to combine two common conditions into a single
 	 * conditional jump.
 	 */
-	if (0 == (td->td_owepreempt | (rm->rm_writecpus & pc->pc_cpumask)))
+	if (0 == (td->td_owepreempt |
+	    CPU_OVERLAP(&rm->rm_writecpus,  &pc->pc_cpumask)))
 		return (1);
 
 	/* We do not have a read token and need to acquire one. */
 	return _rm_rlock_hard(rm, tracker, trylock);
 }
 
 static void
 _rm_unlock_hard(struct thread *td,struct rm_priotracker *tracker)
 {
 
 	if (td->td_owepreempt) {
 		td->td_critnest++;
 		critical_exit();
 	}
 
 	if (!tracker->rmp_flags)
 		return;
 
 	mtx_lock_spin(&rm_spinlock);
 	LIST_REMOVE(tracker, rmp_qentry);
 
 	if (tracker->rmp_flags & RMPF_SIGNAL) {
 		struct rmlock *rm;
 		struct turnstile *ts;
 
 		rm = tracker->rmp_rmlock;
 
 		turnstile_chain_lock(&rm->lock_object);
 		mtx_unlock_spin(&rm_spinlock);
 
 		ts = turnstile_lookup(&rm->lock_object);
 
 		turnstile_signal(ts, TS_EXCLUSIVE_QUEUE);
 		turnstile_unpend(ts, TS_EXCLUSIVE_LOCK);
 		turnstile_chain_unlock(&rm->lock_object);
 	} else
 		mtx_unlock_spin(&rm_spinlock);
 }
 
 void
 _rm_runlock(struct rmlock *rm, struct rm_priotracker *tracker)
 {
 	struct pcpu *pc;
 	struct thread *td = tracker->rmp_thread;
 
 	td->td_critnest++;	/* critical_enter(); */
 	pc = cpuid_to_pcpu[td->td_oncpu]; /* pcpu_find(td->td_oncpu); */
 	rm_tracker_remove(pc, tracker);
 	td->td_critnest--;
 	sched_unpin();
 
 	if (0 == (td->td_owepreempt | tracker->rmp_flags))
 		return;
 
 	_rm_unlock_hard(td, tracker);
 }
 
 void
 _rm_wlock(struct rmlock *rm)
 {
 	struct rm_priotracker *prio;
 	struct turnstile *ts;
-	cpumask_t readcpus;
+	cpuset_t readcpus;
 
 	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
 		sx_xlock(&rm->rm_lock_sx);
 	else
 		mtx_lock(&rm->rm_lock_mtx);
 
-	if (rm->rm_writecpus != all_cpus) {
+	if (CPU_CMP(&rm->rm_writecpus, &all_cpus)) {
 		/* Get all read tokens back */
-
-		readcpus = all_cpus & (all_cpus & ~rm->rm_writecpus);
+		readcpus = all_cpus;
+		CPU_NAND(&readcpus, &rm->rm_writecpus);
 		rm->rm_writecpus = all_cpus;
 
 		/*
 		 * Assumes rm->rm_writecpus update is visible on other CPUs
 		 * before rm_cleanIPI is called.
 		 */
 #ifdef SMP
 		smp_rendezvous_cpus(readcpus,
 		    smp_no_rendevous_barrier,
 		    rm_cleanIPI,
 		    smp_no_rendevous_barrier,
 		    rm);
 
 #else
 		rm_cleanIPI(rm);
 #endif
 
 		mtx_lock_spin(&rm_spinlock);
 		while ((prio = LIST_FIRST(&rm->rm_activeReaders)) != NULL) {
 			ts = turnstile_trywait(&rm->lock_object);
 			prio->rmp_flags = RMPF_ONQUEUE | RMPF_SIGNAL;
 			mtx_unlock_spin(&rm_spinlock);
 			turnstile_wait(ts, prio->rmp_thread,
 			    TS_EXCLUSIVE_QUEUE);
 			mtx_lock_spin(&rm_spinlock);
 		}
 		mtx_unlock_spin(&rm_spinlock);
 	}
 }
 
 void
 _rm_wunlock(struct rmlock *rm)
 {
 
 	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
 		sx_xunlock(&rm->rm_lock_sx);
 	else
 		mtx_unlock(&rm->rm_lock_mtx);
 }
 
 #ifdef LOCK_DEBUG
 
 void _rm_wlock_debug(struct rmlock *rm, const char *file, int line)
 {
 
 	WITNESS_CHECKORDER(&rm->lock_object, LOP_NEWORDER | LOP_EXCLUSIVE,
 	    file, line, NULL);
 
 	_rm_wlock(rm);
 
 	LOCK_LOG_LOCK("RMWLOCK", &rm->lock_object, 0, 0, file, line);
 
 	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
 		WITNESS_LOCK(&rm->rm_lock_sx.lock_object, LOP_EXCLUSIVE,
 		    file, line);	
 	else
 		WITNESS_LOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line);
 
 	curthread->td_locks++;
 
 }
 
 void
 _rm_wunlock_debug(struct rmlock *rm, const char *file, int line)
 {
 
 	curthread->td_locks--;
 	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
 		WITNESS_UNLOCK(&rm->rm_lock_sx.lock_object, LOP_EXCLUSIVE,
 		    file, line);
 	else
 		WITNESS_UNLOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line);
 	LOCK_LOG_LOCK("RMWUNLOCK", &rm->lock_object, 0, 0, file, line);
 	_rm_wunlock(rm);
 }
 
 int
 _rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker,
     int trylock, const char *file, int line)
 {
 	if (!trylock && (rm->lock_object.lo_flags & RM_SLEEPABLE))
 		WITNESS_CHECKORDER(&rm->rm_lock_sx.lock_object, LOP_NEWORDER,
 		    file, line, NULL);
 	WITNESS_CHECKORDER(&rm->lock_object, LOP_NEWORDER, file, line, NULL);
 
 	if (_rm_rlock(rm, tracker, trylock)) {
 		LOCK_LOG_LOCK("RMRLOCK", &rm->lock_object, 0, 0, file, line);
 
 		WITNESS_LOCK(&rm->lock_object, 0, file, line);
 
 		curthread->td_locks++;
 
 		return (1);
 	}
 
 	return (0);
 }
 
 void
 _rm_runlock_debug(struct rmlock *rm, struct rm_priotracker *tracker,
     const char *file, int line)
 {
 
 	curthread->td_locks--;
 	WITNESS_UNLOCK(&rm->lock_object, 0, file, line);
 	LOCK_LOG_LOCK("RMRUNLOCK", &rm->lock_object, 0, 0, file, line);
 	_rm_runlock(rm, tracker);
 }
 
 #else
 
 /*
  * Just strip out file and line arguments if no lock debugging is enabled in
  * the kernel - we are called from a kernel module.
  */
 void
 _rm_wlock_debug(struct rmlock *rm, const char *file, int line)
 {
 
 	_rm_wlock(rm);
 }
 
 void
 _rm_wunlock_debug(struct rmlock *rm, const char *file, int line)
 {
 
 	_rm_wunlock(rm);
 }
 
 int
 _rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker,
     int trylock, const char *file, int line)
 {
 
 	return _rm_rlock(rm, tracker, trylock);
 }
 
 void
 _rm_runlock_debug(struct rmlock *rm, struct rm_priotracker *tracker,
     const char *file, int line)
 {
 
 	_rm_runlock(rm, tracker);
 }
 
 #endif
Index: head/sys/kern/sched_4bsd.c
===================================================================
--- head/sys/kern/sched_4bsd.c	(revision 222812)
+++ head/sys/kern/sched_4bsd.c	(revision 222813)
@@ -1,1665 +1,1687 @@
 /*-
  * Copyright (c) 1982, 1986, 1990, 1991, 1993
  *	The Regents of the University of California.  All rights reserved.
  * (c) UNIX System Laboratories, Inc.
  * All or some portions of this file are derived from material licensed
  * to the University of California by American Telephone and Telegraph
  * Co. or Unix System Laboratories, Inc. and are reproduced herein with
  * the permission of UNIX System Laboratories, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_hwpmc_hooks.h"
 #include "opt_sched.h"
 #include "opt_kdtrace.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/cpuset.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/kthread.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/resourcevar.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 #include <sys/sx.h>
 #include <sys/turnstile.h>
 #include <sys/umtx.h>
 #include <machine/pcb.h>
 #include <machine/smp.h>
 
 #ifdef HWPMC_HOOKS
 #include <sys/pmckern.h>
 #endif
 
 #ifdef KDTRACE_HOOKS
 #include <sys/dtrace_bsd.h>
 int				dtrace_vtime_active;
 dtrace_vtime_switch_func_t	dtrace_vtime_switch_func;
 #endif
 
 /*
  * INVERSE_ESTCPU_WEIGHT is only suitable for statclock() frequencies in
  * the range 100-256 Hz (approximately).
  */
 #define	ESTCPULIM(e) \
     min((e), INVERSE_ESTCPU_WEIGHT * (NICE_WEIGHT * (PRIO_MAX - PRIO_MIN) - \
     RQ_PPQ) + INVERSE_ESTCPU_WEIGHT - 1)
 #ifdef SMP
 #define	INVERSE_ESTCPU_WEIGHT	(8 * smp_cpus)
 #else
 #define	INVERSE_ESTCPU_WEIGHT	8	/* 1 / (priorities per estcpu level). */
 #endif
 #define	NICE_WEIGHT		1	/* Priorities per nice level. */
 
 #define	TS_NAME_LEN (MAXCOMLEN + sizeof(" td ") + sizeof(__XSTRING(UINT_MAX)))
 
 /*
  * The schedulable entity that runs a context.
  * This is  an extension to the thread structure and is tailored to
  * the requirements of this scheduler
  */
 struct td_sched {
 	fixpt_t		ts_pctcpu;	/* (j) %cpu during p_swtime. */
 	int		ts_cpticks;	/* (j) Ticks of cpu time. */
 	int		ts_slptime;	/* (j) Seconds !RUNNING. */
 	int		ts_flags;
 	struct runq	*ts_runq;	/* runq the thread is currently on */
 #ifdef KTR
 	char		ts_name[TS_NAME_LEN];
 #endif
 };
 
 /* flags kept in td_flags */
 #define TDF_DIDRUN	TDF_SCHED0	/* thread actually ran. */
 #define TDF_BOUND	TDF_SCHED1	/* Bound to one CPU. */
 
 /* flags kept in ts_flags */
 #define	TSF_AFFINITY	0x0001		/* Has a non-"full" CPU set. */
 
 #define SKE_RUNQ_PCPU(ts)						\
     ((ts)->ts_runq != 0 && (ts)->ts_runq != &runq)
 
 #define	THREAD_CAN_SCHED(td, cpu)	\
     CPU_ISSET((cpu), &(td)->td_cpuset->cs_mask)
 
 static struct td_sched td_sched0;
 struct mtx sched_lock;
 
 static int	sched_tdcnt;	/* Total runnable threads in the system. */
 static int	sched_quantum;	/* Roundrobin scheduling quantum in ticks. */
 #define	SCHED_QUANTUM	(hz / 10)	/* Default sched quantum */
 
 static void	setup_runqs(void);
 static void	schedcpu(void);
 static void	schedcpu_thread(void);
 static void	sched_priority(struct thread *td, u_char prio);
 static void	sched_setup(void *dummy);
 static void	maybe_resched(struct thread *td);
 static void	updatepri(struct thread *td);
 static void	resetpriority(struct thread *td);
 static void	resetpriority_thread(struct thread *td);
 #ifdef SMP
 static int	sched_pickcpu(struct thread *td);
 static int	forward_wakeup(int cpunum);
 static void	kick_other_cpu(int pri, int cpuid);
 #endif
 
 static struct kproc_desc sched_kp = {
         "schedcpu",
         schedcpu_thread,
         NULL
 };
 SYSINIT(schedcpu, SI_SUB_RUN_SCHEDULER, SI_ORDER_FIRST, kproc_start,
     &sched_kp);
 SYSINIT(sched_setup, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, sched_setup, NULL);
 
 /*
  * Global run queue.
  */
 static struct runq runq;
 
 #ifdef SMP
 /*
  * Per-CPU run queues
  */
 static struct runq runq_pcpu[MAXCPU];
 long runq_length[MAXCPU];
 
-static cpumask_t idle_cpus_mask;
+static cpuset_t idle_cpus_mask;
 #endif
 
 struct pcpuidlestat {
 	u_int idlecalls;
 	u_int oldidlecalls;
 };
 static DPCPU_DEFINE(struct pcpuidlestat, idlestat);
 
 static void
 setup_runqs(void)
 {
 #ifdef SMP
 	int i;
 
 	for (i = 0; i < MAXCPU; ++i)
 		runq_init(&runq_pcpu[i]);
 #endif
 
 	runq_init(&runq);
 }
 
 static int
 sysctl_kern_quantum(SYSCTL_HANDLER_ARGS)
 {
 	int error, new_val;
 
 	new_val = sched_quantum * tick;
 	error = sysctl_handle_int(oidp, &new_val, 0, req);
         if (error != 0 || req->newptr == NULL)
 		return (error);
 	if (new_val < tick)
 		return (EINVAL);
 	sched_quantum = new_val / tick;
 	hogticks = 2 * sched_quantum;
 	return (0);
 }
 
 SYSCTL_NODE(_kern, OID_AUTO, sched, CTLFLAG_RD, 0, "Scheduler");
 
 SYSCTL_STRING(_kern_sched, OID_AUTO, name, CTLFLAG_RD, "4BSD", 0,
     "Scheduler name");
 
 SYSCTL_PROC(_kern_sched, OID_AUTO, quantum, CTLTYPE_INT | CTLFLAG_RW,
     0, sizeof sched_quantum, sysctl_kern_quantum, "I",
     "Roundrobin scheduling quantum in microseconds");
 
 #ifdef SMP
 /* Enable forwarding of wakeups to all other cpus */
 SYSCTL_NODE(_kern_sched, OID_AUTO, ipiwakeup, CTLFLAG_RD, NULL, "Kernel SMP");
 
 static int runq_fuzz = 1;
 SYSCTL_INT(_kern_sched, OID_AUTO, runq_fuzz, CTLFLAG_RW, &runq_fuzz, 0, "");
 
 static int forward_wakeup_enabled = 1;
 SYSCTL_INT(_kern_sched_ipiwakeup, OID_AUTO, enabled, CTLFLAG_RW,
 	   &forward_wakeup_enabled, 0,
 	   "Forwarding of wakeup to idle CPUs");
 
 static int forward_wakeups_requested = 0;
 SYSCTL_INT(_kern_sched_ipiwakeup, OID_AUTO, requested, CTLFLAG_RD,
 	   &forward_wakeups_requested, 0,
 	   "Requests for Forwarding of wakeup to idle CPUs");
 
 static int forward_wakeups_delivered = 0;
 SYSCTL_INT(_kern_sched_ipiwakeup, OID_AUTO, delivered, CTLFLAG_RD,
 	   &forward_wakeups_delivered, 0,
 	   "Completed Forwarding of wakeup to idle CPUs");
 
 static int forward_wakeup_use_mask = 1;
 SYSCTL_INT(_kern_sched_ipiwakeup, OID_AUTO, usemask, CTLFLAG_RW,
 	   &forward_wakeup_use_mask, 0,
 	   "Use the mask of idle cpus");
 
 static int forward_wakeup_use_loop = 0;
 SYSCTL_INT(_kern_sched_ipiwakeup, OID_AUTO, useloop, CTLFLAG_RW,
 	   &forward_wakeup_use_loop, 0,
 	   "Use a loop to find idle cpus");
 
 #endif
 #if 0
 static int sched_followon = 0;
 SYSCTL_INT(_kern_sched, OID_AUTO, followon, CTLFLAG_RW,
 	   &sched_followon, 0,
 	   "allow threads to share a quantum");
 #endif
 
 static __inline void
 sched_load_add(void)
 {
 
 	sched_tdcnt++;
 	KTR_COUNTER0(KTR_SCHED, "load", "global load", sched_tdcnt);
 }
 
 static __inline void
 sched_load_rem(void)
 {
 
 	sched_tdcnt--;
 	KTR_COUNTER0(KTR_SCHED, "load", "global load", sched_tdcnt);
 }
 /*
  * Arrange to reschedule if necessary, taking the priorities and
  * schedulers into account.
  */
 static void
 maybe_resched(struct thread *td)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	if (td->td_priority < curthread->td_priority)
 		curthread->td_flags |= TDF_NEEDRESCHED;
 }
 
 /*
  * This function is called when a thread is about to be put on run queue
  * because it has been made runnable or its priority has been adjusted.  It
  * determines if the new thread should be immediately preempted to.  If so,
  * it switches to it and eventually returns true.  If not, it returns false
  * so that the caller may place the thread on an appropriate run queue.
  */
 int
 maybe_preempt(struct thread *td)
 {
 #ifdef PREEMPTION
 	struct thread *ctd;
 	int cpri, pri;
 
 	/*
 	 * The new thread should not preempt the current thread if any of the
 	 * following conditions are true:
 	 *
 	 *  - The kernel is in the throes of crashing (panicstr).
 	 *  - The current thread has a higher (numerically lower) or
 	 *    equivalent priority.  Note that this prevents curthread from
 	 *    trying to preempt to itself.
 	 *  - It is too early in the boot for context switches (cold is set).
 	 *  - The current thread has an inhibitor set or is in the process of
 	 *    exiting.  In this case, the current thread is about to switch
 	 *    out anyways, so there's no point in preempting.  If we did,
 	 *    the current thread would not be properly resumed as well, so
 	 *    just avoid that whole landmine.
 	 *  - If the new thread's priority is not a realtime priority and
 	 *    the current thread's priority is not an idle priority and
 	 *    FULL_PREEMPTION is disabled.
 	 *
 	 * If all of these conditions are false, but the current thread is in
 	 * a nested critical section, then we have to defer the preemption
 	 * until we exit the critical section.  Otherwise, switch immediately
 	 * to the new thread.
 	 */
 	ctd = curthread;
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT((td->td_inhibitors == 0),
 			("maybe_preempt: trying to run inhibited thread"));
 	pri = td->td_priority;
 	cpri = ctd->td_priority;
 	if (panicstr != NULL || pri >= cpri || cold /* || dumping */ ||
 	    TD_IS_INHIBITED(ctd))
 		return (0);
 #ifndef FULL_PREEMPTION
 	if (pri > PRI_MAX_ITHD && cpri < PRI_MIN_IDLE)
 		return (0);
 #endif
 
 	if (ctd->td_critnest > 1) {
 		CTR1(KTR_PROC, "maybe_preempt: in critical section %d",
 		    ctd->td_critnest);
 		ctd->td_owepreempt = 1;
 		return (0);
 	}
 	/*
 	 * Thread is runnable but not yet put on system run queue.
 	 */
 	MPASS(ctd->td_lock == td->td_lock);
 	MPASS(TD_ON_RUNQ(td));
 	TD_SET_RUNNING(td);
 	CTR3(KTR_PROC, "preempting to thread %p (pid %d, %s)\n", td,
 	    td->td_proc->p_pid, td->td_name);
 	mi_switch(SW_INVOL | SW_PREEMPT | SWT_PREEMPT, td);
 	/*
 	 * td's lock pointer may have changed.  We have to return with it
 	 * locked.
 	 */
 	spinlock_enter();
 	thread_unlock(ctd);
 	thread_lock(td);
 	spinlock_exit();
 	return (1);
 #else
 	return (0);
 #endif
 }
 
 /*
  * Constants for digital decay and forget:
  *	90% of (td_estcpu) usage in 5 * loadav time
  *	95% of (ts_pctcpu) usage in 60 seconds (load insensitive)
  *          Note that, as ps(1) mentions, this can let percentages
  *          total over 100% (I've seen 137.9% for 3 processes).
  *
  * Note that schedclock() updates td_estcpu and p_cpticks asynchronously.
  *
  * We wish to decay away 90% of td_estcpu in (5 * loadavg) seconds.
  * That is, the system wants to compute a value of decay such
  * that the following for loop:
  * 	for (i = 0; i < (5 * loadavg); i++)
  * 		td_estcpu *= decay;
  * will compute
  * 	td_estcpu *= 0.1;
  * for all values of loadavg:
  *
  * Mathematically this loop can be expressed by saying:
  * 	decay ** (5 * loadavg) ~= .1
  *
  * The system computes decay as:
  * 	decay = (2 * loadavg) / (2 * loadavg + 1)
  *
  * We wish to prove that the system's computation of decay
  * will always fulfill the equation:
  * 	decay ** (5 * loadavg) ~= .1
  *
  * If we compute b as:
  * 	b = 2 * loadavg
  * then
  * 	decay = b / (b + 1)
  *
  * We now need to prove two things:
  *	1) Given factor ** (5 * loadavg) ~= .1, prove factor == b/(b+1)
  *	2) Given b/(b+1) ** power ~= .1, prove power == (5 * loadavg)
  *
  * Facts:
  *         For x close to zero, exp(x) =~ 1 + x, since
  *              exp(x) = 0! + x**1/1! + x**2/2! + ... .
  *              therefore exp(-1/b) =~ 1 - (1/b) = (b-1)/b.
  *         For x close to zero, ln(1+x) =~ x, since
  *              ln(1+x) = x - x**2/2 + x**3/3 - ...     -1 < x < 1
  *              therefore ln(b/(b+1)) = ln(1 - 1/(b+1)) =~ -1/(b+1).
  *         ln(.1) =~ -2.30
  *
  * Proof of (1):
  *    Solve (factor)**(power) =~ .1 given power (5*loadav):
  *	solving for factor,
  *      ln(factor) =~ (-2.30/5*loadav), or
  *      factor =~ exp(-1/((5/2.30)*loadav)) =~ exp(-1/(2*loadav)) =
  *          exp(-1/b) =~ (b-1)/b =~ b/(b+1).                    QED
  *
  * Proof of (2):
  *    Solve (factor)**(power) =~ .1 given factor == (b/(b+1)):
  *	solving for power,
  *      power*ln(b/(b+1)) =~ -2.30, or
  *      power =~ 2.3 * (b + 1) = 4.6*loadav + 2.3 =~ 5*loadav.  QED
  *
  * Actual power values for the implemented algorithm are as follows:
  *      loadav: 1       2       3       4
  *      power:  5.68    10.32   14.94   19.55
  */
 
 /* calculations for digital decay to forget 90% of usage in 5*loadav sec */
 #define	loadfactor(loadav)	(2 * (loadav))
 #define	decay_cpu(loadfac, cpu)	(((loadfac) * (cpu)) / ((loadfac) + FSCALE))
 
 /* decay 95% of `ts_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
 static fixpt_t	ccpu = 0.95122942450071400909 * FSCALE;	/* exp(-1/20) */
 SYSCTL_UINT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");
 
 /*
  * If `ccpu' is not equal to `exp(-1/20)' and you still want to use the
  * faster/more-accurate formula, you'll have to estimate CCPU_SHIFT below
  * and possibly adjust FSHIFT in "param.h" so that (FSHIFT >= CCPU_SHIFT).
  *
  * To estimate CCPU_SHIFT for exp(-1/20), the following formula was used:
  *	1 - exp(-1/20) ~= 0.0487 ~= 0.0488 == 1 (fixed pt, *11* bits).
  *
  * If you don't want to bother with the faster/more-accurate formula, you
  * can set CCPU_SHIFT to (FSHIFT + 1) which will use a slower/less-accurate
  * (more general) method of calculating the %age of CPU used by a process.
  */
 #define	CCPU_SHIFT	11
 
 /*
  * Recompute process priorities, every hz ticks.
  * MP-safe, called without the Giant mutex.
  */
 /* ARGSUSED */
 static void
 schedcpu(void)
 {
 	register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
 	struct thread *td;
 	struct proc *p;
 	struct td_sched *ts;
 	int awake, realstathz;
 
 	realstathz = stathz ? stathz : hz;
 	sx_slock(&allproc_lock);
 	FOREACH_PROC_IN_SYSTEM(p) {
 		PROC_LOCK(p);
 		if (p->p_state == PRS_NEW) {
 			PROC_UNLOCK(p);
 			continue;
 		}
 		FOREACH_THREAD_IN_PROC(p, td) {
 			awake = 0;
 			thread_lock(td);
 			ts = td->td_sched;
 			/*
 			 * Increment sleep time (if sleeping).  We
 			 * ignore overflow, as above.
 			 */
 			/*
 			 * The td_sched slptimes are not touched in wakeup
 			 * because the thread may not HAVE everything in
 			 * memory? XXX I think this is out of date.
 			 */
 			if (TD_ON_RUNQ(td)) {
 				awake = 1;
 				td->td_flags &= ~TDF_DIDRUN;
 			} else if (TD_IS_RUNNING(td)) {
 				awake = 1;
 				/* Do not clear TDF_DIDRUN */
 			} else if (td->td_flags & TDF_DIDRUN) {
 				awake = 1;
 				td->td_flags &= ~TDF_DIDRUN;
 			}
 
 			/*
 			 * ts_pctcpu is only for ps and ttyinfo().
 			 */
 			ts->ts_pctcpu = (ts->ts_pctcpu * ccpu) >> FSHIFT;
 			/*
 			 * If the td_sched has been idle the entire second,
 			 * stop recalculating its priority until
 			 * it wakes up.
 			 */
 			if (ts->ts_cpticks != 0) {
 #if	(FSHIFT >= CCPU_SHIFT)
 				ts->ts_pctcpu += (realstathz == 100)
 				    ? ((fixpt_t) ts->ts_cpticks) <<
 				    (FSHIFT - CCPU_SHIFT) :
 				    100 * (((fixpt_t) ts->ts_cpticks)
 				    << (FSHIFT - CCPU_SHIFT)) / realstathz;
 #else
 				ts->ts_pctcpu += ((FSCALE - ccpu) *
 				    (ts->ts_cpticks *
 				    FSCALE / realstathz)) >> FSHIFT;
 #endif
 				ts->ts_cpticks = 0;
 			}
 			/*
 			 * If there are ANY running threads in this process,
 			 * then don't count it as sleeping.
 			 * XXX: this is broken.
 			 */
 			if (awake) {
 				if (ts->ts_slptime > 1) {
 					/*
 					 * In an ideal world, this should not
 					 * happen, because whoever woke us
 					 * up from the long sleep should have
 					 * unwound the slptime and reset our
 					 * priority before we run at the stale
 					 * priority.  Should KASSERT at some
 					 * point when all the cases are fixed.
 					 */
 					updatepri(td);
 				}
 				ts->ts_slptime = 0;
 			} else
 				ts->ts_slptime++;
 			if (ts->ts_slptime > 1) {
 				thread_unlock(td);
 				continue;
 			}
 			td->td_estcpu = decay_cpu(loadfac, td->td_estcpu);
 		      	resetpriority(td);
 			resetpriority_thread(td);
 			thread_unlock(td);
 		}
 		PROC_UNLOCK(p);
 	}
 	sx_sunlock(&allproc_lock);
 }
 
 /*
  * Main loop for a kthread that executes schedcpu once a second.
  */
 static void
 schedcpu_thread(void)
 {
 
 	for (;;) {
 		schedcpu();
 		pause("-", hz);
 	}
 }
 
 /*
  * Recalculate the priority of a process after it has slept for a while.
  * For all load averages >= 1 and max td_estcpu of 255, sleeping for at
  * least six times the loadfactor will decay td_estcpu to zero.
  */
 static void
 updatepri(struct thread *td)
 {
 	struct td_sched *ts;
 	fixpt_t loadfac;
 	unsigned int newcpu;
 
 	ts = td->td_sched;
 	loadfac = loadfactor(averunnable.ldavg[0]);
 	if (ts->ts_slptime > 5 * loadfac)
 		td->td_estcpu = 0;
 	else {
 		newcpu = td->td_estcpu;
 		ts->ts_slptime--;	/* was incremented in schedcpu() */
 		while (newcpu && --ts->ts_slptime)
 			newcpu = decay_cpu(loadfac, newcpu);
 		td->td_estcpu = newcpu;
 	}
 }
 
 /*
  * Compute the priority of a process when running in user mode.
  * Arrange to reschedule if the resulting priority is better
  * than that of the current process.
  */
 static void
 resetpriority(struct thread *td)
 {
 	register unsigned int newpriority;
 
 	if (td->td_pri_class == PRI_TIMESHARE) {
 		newpriority = PUSER + td->td_estcpu / INVERSE_ESTCPU_WEIGHT +
 		    NICE_WEIGHT * (td->td_proc->p_nice - PRIO_MIN);
 		newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
 		    PRI_MAX_TIMESHARE);
 		sched_user_prio(td, newpriority);
 	}
 }
 
 /*
  * Update the thread's priority when the associated process's user
  * priority changes.
  */
 static void
 resetpriority_thread(struct thread *td)
 {
 
 	/* Only change threads with a time sharing user priority. */
 	if (td->td_priority < PRI_MIN_TIMESHARE ||
 	    td->td_priority > PRI_MAX_TIMESHARE)
 		return;
 
 	/* XXX the whole needresched thing is broken, but not silly. */
 	maybe_resched(td);
 
 	sched_prio(td, td->td_user_pri);
 }
 
 /* ARGSUSED */
 static void
 sched_setup(void *dummy)
 {
 	setup_runqs();
 
 	if (sched_quantum == 0)
 		sched_quantum = SCHED_QUANTUM;
 	hogticks = 2 * sched_quantum;
 
 	/* Account for thread0. */
 	sched_load_add();
 }
 
 /* External interfaces start here */
 
 /*
  * Very early in the boot some setup of scheduler-specific
  * parts of proc0 and of some scheduler resources needs to be done.
  * Called from:
  *  proc0_init()
  */
 void
 schedinit(void)
 {
 	/*
 	 * Set up the scheduler specific parts of proc0.
 	 */
 	proc0.p_sched = NULL; /* XXX */
 	thread0.td_sched = &td_sched0;
 	thread0.td_lock = &sched_lock;
 	mtx_init(&sched_lock, "sched lock", NULL, MTX_SPIN | MTX_RECURSE);
 }
 
 int
 sched_runnable(void)
 {
 #ifdef SMP
 	return runq_check(&runq) + runq_check(&runq_pcpu[PCPU_GET(cpuid)]);
 #else
 	return runq_check(&runq);
 #endif
 }
 
 int
 sched_rr_interval(void)
 {
 	if (sched_quantum == 0)
 		sched_quantum = SCHED_QUANTUM;
 	return (sched_quantum);
 }
 
 /*
  * We adjust the priority of the current process.  The priority of
  * a process gets worse as it accumulates CPU time.  The cpu usage
  * estimator (td_estcpu) is increased here.  resetpriority() will
  * compute a different priority each time td_estcpu increases by
  * INVERSE_ESTCPU_WEIGHT
  * (until MAXPRI is reached).  The cpu usage estimator ramps up
  * quite quickly when the process is running (linearly), and decays
  * away exponentially, at a rate which is proportionally slower when
  * the system is busy.  The basic principle is that the system will
  * 90% forget that the process used a lot of CPU time in 5 * loadav
  * seconds.  This causes the system to favor processes which haven't
  * run much recently, and to round-robin among other processes.
  */
 void
 sched_clock(struct thread *td)
 {
 	struct pcpuidlestat *stat;
 	struct td_sched *ts;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	ts = td->td_sched;
 
 	ts->ts_cpticks++;
 	td->td_estcpu = ESTCPULIM(td->td_estcpu + 1);
 	if ((td->td_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
 		resetpriority(td);
 		resetpriority_thread(td);
 	}
 
 	/*
 	 * Force a context switch if the current thread has used up a full
 	 * quantum (default quantum is 100ms).
 	 */
 	if (!TD_IS_IDLETHREAD(td) &&
 	    ticks - PCPU_GET(switchticks) >= sched_quantum)
 		td->td_flags |= TDF_NEEDRESCHED;
 
 	stat = DPCPU_PTR(idlestat);
 	stat->oldidlecalls = stat->idlecalls;
 	stat->idlecalls = 0;
 }
 
 /*
  * Charge child's scheduling CPU usage to parent.
  */
 void
 sched_exit(struct proc *p, struct thread *td)
 {
 
 	KTR_STATE1(KTR_SCHED, "thread", sched_tdname(td), "proc exit",
 	    "prio:td", td->td_priority);
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	sched_exit_thread(FIRST_THREAD_IN_PROC(p), td);
 }
 
 void
 sched_exit_thread(struct thread *td, struct thread *child)
 {
 
 	KTR_STATE1(KTR_SCHED, "thread", sched_tdname(child), "exit",
 	    "prio:td", child->td_priority);
 	thread_lock(td);
 	td->td_estcpu = ESTCPULIM(td->td_estcpu + child->td_estcpu);
 	thread_unlock(td);
 	thread_lock(child);
 	if ((child->td_flags & TDF_NOLOAD) == 0)
 		sched_load_rem();
 	thread_unlock(child);
 }
 
 void
 sched_fork(struct thread *td, struct thread *childtd)
 {
 	sched_fork_thread(td, childtd);
 }
 
 void
 sched_fork_thread(struct thread *td, struct thread *childtd)
 {
 	struct td_sched *ts;
 
 	childtd->td_estcpu = td->td_estcpu;
 	childtd->td_lock = &sched_lock;
 	childtd->td_cpuset = cpuset_ref(td->td_cpuset);
 	childtd->td_priority = childtd->td_base_pri;
 	ts = childtd->td_sched;
 	bzero(ts, sizeof(*ts));
 	ts->ts_flags |= (td->td_sched->ts_flags & TSF_AFFINITY);
 }
 
 void
 sched_nice(struct proc *p, int nice)
 {
 	struct thread *td;
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	p->p_nice = nice;
 	FOREACH_THREAD_IN_PROC(p, td) {
 		thread_lock(td);
 		resetpriority(td);
 		resetpriority_thread(td);
 		thread_unlock(td);
 	}
 }
 
 void
 sched_class(struct thread *td, int class)
 {
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	td->td_pri_class = class;
 }
 
 /*
  * Adjust the priority of a thread.
  */
 static void
 sched_priority(struct thread *td, u_char prio)
 {
 
 
 	KTR_POINT3(KTR_SCHED, "thread", sched_tdname(td), "priority change",
 	    "prio:%d", td->td_priority, "new prio:%d", prio, KTR_ATTR_LINKED,
 	    sched_tdname(curthread));
 	if (td != curthread && prio > td->td_priority) {
 		KTR_POINT3(KTR_SCHED, "thread", sched_tdname(curthread),
 		    "lend prio", "prio:%d", td->td_priority, "new prio:%d",
 		    prio, KTR_ATTR_LINKED, sched_tdname(td));
 	}
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	if (td->td_priority == prio)
 		return;
 	td->td_priority = prio;
 	if (TD_ON_RUNQ(td) && td->td_rqindex != (prio / RQ_PPQ)) {
 		sched_rem(td);
 		sched_add(td, SRQ_BORING);
 	}
 }
 
 /*
  * Update a thread's priority when it is lent another thread's
  * priority.
  */
 void
 sched_lend_prio(struct thread *td, u_char prio)
 {
 
 	td->td_flags |= TDF_BORROWING;
 	sched_priority(td, prio);
 }
 
 /*
  * Restore a thread's priority when priority propagation is
  * over.  The prio argument is the minimum priority the thread
  * needs to have to satisfy other possible priority lending
  * requests.  If the thread's regulary priority is less
  * important than prio the thread will keep a priority boost
  * of prio.
  */
 void
 sched_unlend_prio(struct thread *td, u_char prio)
 {
 	u_char base_pri;
 
 	if (td->td_base_pri >= PRI_MIN_TIMESHARE &&
 	    td->td_base_pri <= PRI_MAX_TIMESHARE)
 		base_pri = td->td_user_pri;
 	else
 		base_pri = td->td_base_pri;
 	if (prio >= base_pri) {
 		td->td_flags &= ~TDF_BORROWING;
 		sched_prio(td, base_pri);
 	} else
 		sched_lend_prio(td, prio);
 }
 
 void
 sched_prio(struct thread *td, u_char prio)
 {
 	u_char oldprio;
 
 	/* First, update the base priority. */
 	td->td_base_pri = prio;
 
 	/*
 	 * If the thread is borrowing another thread's priority, don't ever
 	 * lower the priority.
 	 */
 	if (td->td_flags & TDF_BORROWING && td->td_priority < prio)
 		return;
 
 	/* Change the real priority. */
 	oldprio = td->td_priority;
 	sched_priority(td, prio);
 
 	/*
 	 * If the thread is on a turnstile, then let the turnstile update
 	 * its state.
 	 */
 	if (TD_ON_LOCK(td) && oldprio != prio)
 		turnstile_adjust(td, oldprio);
 }
 
 void
 sched_user_prio(struct thread *td, u_char prio)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	td->td_base_user_pri = prio;
 	if (td->td_lend_user_pri <= prio)
 		return;
 	td->td_user_pri = prio;
 }
 
 void
 sched_lend_user_prio(struct thread *td, u_char prio)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	td->td_lend_user_pri = prio;
 	td->td_user_pri = min(prio, td->td_base_user_pri);
 	if (td->td_priority > td->td_user_pri)
 		sched_prio(td, td->td_user_pri);
 	else if (td->td_priority != td->td_user_pri)
 		td->td_flags |= TDF_NEEDRESCHED;
 }
 
 void
 sched_sleep(struct thread *td, int pri)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	td->td_slptick = ticks;
 	td->td_sched->ts_slptime = 0;
 	if (pri != 0 && PRI_BASE(td->td_pri_class) == PRI_TIMESHARE)
 		sched_prio(td, pri);
 	if (TD_IS_SUSPENDED(td) || pri >= PSOCK)
 		td->td_flags |= TDF_CANSWAP;
 }
 
 void
 sched_switch(struct thread *td, struct thread *newtd, int flags)
 {
 	struct mtx *tmtx;
 	struct td_sched *ts;
 	struct proc *p;
 
 	tmtx = NULL;
 	ts = td->td_sched;
 	p = td->td_proc;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 
 	/* 
 	 * Switch to the sched lock to fix things up and pick
 	 * a new thread.
 	 * Block the td_lock in order to avoid breaking the critical path.
 	 */
 	if (td->td_lock != &sched_lock) {
 		mtx_lock_spin(&sched_lock);
 		tmtx = thread_lock_block(td);
 	}
 
 	if ((td->td_flags & TDF_NOLOAD) == 0)
 		sched_load_rem();
 
 	td->td_lastcpu = td->td_oncpu;
 	if (!(flags & SW_PREEMPT))
 		td->td_flags &= ~TDF_NEEDRESCHED;
 	td->td_owepreempt = 0;
 	td->td_oncpu = NOCPU;
 
 	/*
 	 * At the last moment, if this thread is still marked RUNNING,
 	 * then put it back on the run queue as it has not been suspended
 	 * or stopped or any thing else similar.  We never put the idle
 	 * threads on the run queue, however.
 	 */
 	if (td->td_flags & TDF_IDLETD) {
 		TD_SET_CAN_RUN(td);
 #ifdef SMP
-		idle_cpus_mask &= ~PCPU_GET(cpumask);
+		/* Spinlock held here, assume no migration. */
+		CPU_NAND(&idle_cpus_mask, PCPU_PTR(cpumask));
 #endif
 	} else {
 		if (TD_IS_RUNNING(td)) {
 			/* Put us back on the run queue. */
 			sched_add(td, (flags & SW_PREEMPT) ?
 			    SRQ_OURSELF|SRQ_YIELDING|SRQ_PREEMPTED :
 			    SRQ_OURSELF|SRQ_YIELDING);
 		}
 	}
 	if (newtd) {
 		/*
 		 * The thread we are about to run needs to be counted
 		 * as if it had been added to the run queue and selected.
 		 * It came from:
 		 * * A preemption
 		 * * An upcall
 		 * * A followon
 		 */
 		KASSERT((newtd->td_inhibitors == 0),
 			("trying to run inhibited thread"));
 		newtd->td_flags |= TDF_DIDRUN;
         	TD_SET_RUNNING(newtd);
 		if ((newtd->td_flags & TDF_NOLOAD) == 0)
 			sched_load_add();
 	} else {
 		newtd = choosethread();
 		MPASS(newtd->td_lock == &sched_lock);
 	}
 
 	if (td != newtd) {
 #ifdef	HWPMC_HOOKS
 		if (PMC_PROC_IS_USING_PMCS(td->td_proc))
 			PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_OUT);
 #endif
                 /* I feel sleepy */
 		lock_profile_release_lock(&sched_lock.lock_object);
 #ifdef KDTRACE_HOOKS
 		/*
 		 * If DTrace has set the active vtime enum to anything
 		 * other than INACTIVE (0), then it should have set the
 		 * function to call.
 		 */
 		if (dtrace_vtime_active)
 			(*dtrace_vtime_switch_func)(newtd);
 #endif
 
 		cpu_switch(td, newtd, tmtx != NULL ? tmtx : td->td_lock);
 		lock_profile_obtain_lock_success(&sched_lock.lock_object,
 		    0, 0, __FILE__, __LINE__);
 		/*
 		 * Where am I?  What year is it?
 		 * We are in the same thread that went to sleep above,
 		 * but any amount of time may have passed. All our context
 		 * will still be available as will local variables.
 		 * PCPU values however may have changed as we may have
 		 * changed CPU so don't trust cached values of them.
 		 * New threads will go to fork_exit() instead of here
 		 * so if you change things here you may need to change
 		 * things there too.
 		 *
 		 * If the thread above was exiting it will never wake
 		 * up again here, so either it has saved everything it
 		 * needed to, or the thread_wait() or wait() will
 		 * need to reap it.
 		 */
 #ifdef	HWPMC_HOOKS
 		if (PMC_PROC_IS_USING_PMCS(td->td_proc))
 			PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_IN);
 #endif
 	}
 
 #ifdef SMP
 	if (td->td_flags & TDF_IDLETD)
-		idle_cpus_mask |= PCPU_GET(cpumask);
+		CPU_OR(&idle_cpus_mask, PCPU_PTR(cpumask));
 #endif
 	sched_lock.mtx_lock = (uintptr_t)td;
 	td->td_oncpu = PCPU_GET(cpuid);
 	MPASS(td->td_lock == &sched_lock);
 }
 
 void
 sched_wakeup(struct thread *td)
 {
 	struct td_sched *ts;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	ts = td->td_sched;
 	td->td_flags &= ~TDF_CANSWAP;
 	if (ts->ts_slptime > 1) {
 		updatepri(td);
 		resetpriority(td);
 	}
 	td->td_slptick = 0;
 	ts->ts_slptime = 0;
 	sched_add(td, SRQ_BORING);
 }
 
 #ifdef SMP
 static int
 forward_wakeup(int cpunum)
 {
 	struct pcpu *pc;
-	cpumask_t dontuse, id, map, map2, me;
+	cpuset_t dontuse, id, map, map2, me;
+	int iscpuset;
 
 	mtx_assert(&sched_lock, MA_OWNED);
 
 	CTR0(KTR_RUNQ, "forward_wakeup()");
 
 	if ((!forward_wakeup_enabled) ||
 	     (forward_wakeup_use_mask == 0 && forward_wakeup_use_loop == 0))
 		return (0);
 	if (!smp_started || cold || panicstr)
 		return (0);
 
 	forward_wakeups_requested++;
 
 	/*
 	 * Check the idle mask we received against what we calculated
 	 * before in the old version.
+	 *
+	 * Also note that sched_lock is held now, thus no migration is
+	 * expected.
 	 */
 	me = PCPU_GET(cpumask);
 
 	/* Don't bother if we should be doing it ourself. */
-	if ((me & idle_cpus_mask) && (cpunum == NOCPU || me == (1 << cpunum)))
+	if (CPU_OVERLAP(&me, &idle_cpus_mask) &&
+	    (cpunum == NOCPU || CPU_ISSET(cpunum, &me)))
 		return (0);
 
-	dontuse = me | stopped_cpus | hlt_cpus_mask;
-	map2 = 0;
+	dontuse = me;
+	CPU_OR(&dontuse, &stopped_cpus);
+	CPU_OR(&dontuse, &hlt_cpus_mask);
+	CPU_ZERO(&map2);
 	if (forward_wakeup_use_loop) {
 		STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 			id = pc->pc_cpumask;
-			if ((id & dontuse) == 0 &&
+			if (!CPU_OVERLAP(&id, &dontuse) &&
 			    pc->pc_curthread == pc->pc_idlethread) {
-				map2 |= id;
+				CPU_OR(&map2, &id);
 			}
 		}
 	}
 
 	if (forward_wakeup_use_mask) {
-		map = 0;
-		map = idle_cpus_mask & ~dontuse;
+		map = idle_cpus_mask;
+		CPU_NAND(&map, &dontuse);
 
 		/* If they are both on, compare and use loop if different. */
 		if (forward_wakeup_use_loop) {
-			if (map != map2) {
+			if (CPU_CMP(&map, &map2)) {
 				printf("map != map2, loop method preferred\n");
 				map = map2;
 			}
 		}
 	} else {
 		map = map2;
 	}
 
 	/* If we only allow a specific CPU, then mask off all the others. */
 	if (cpunum != NOCPU) {
 		KASSERT((cpunum <= mp_maxcpus),("forward_wakeup: bad cpunum."));
-		map &= (1 << cpunum);
+		iscpuset = CPU_ISSET(cpunum, &map);
+		if (iscpuset == 0)
+			CPU_ZERO(&map);
+		else
+			CPU_SETOF(cpunum, &map);
 	}
-	if (map) {
+	if (!CPU_EMPTY(&map)) {
 		forward_wakeups_delivered++;
 		STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 			id = pc->pc_cpumask;
-			if ((map & id) == 0)
+			if (!CPU_OVERLAP(&map, &id))
 				continue;
 			if (cpu_idle_wakeup(pc->pc_cpuid))
-				map &= ~id;
+				CPU_NAND(&map, &id);
 		}
-		if (map)
+		if (!CPU_EMPTY(&map))
 			ipi_selected(map, IPI_AST);
 		return (1);
 	}
 	if (cpunum == NOCPU)
 		printf("forward_wakeup: Idle processor not found\n");
 	return (0);
 }
 
 static void
 kick_other_cpu(int pri, int cpuid)
 {
 	struct pcpu *pcpu;
 	int cpri;
 
 	pcpu = pcpu_find(cpuid);
-	if (idle_cpus_mask & pcpu->pc_cpumask) {
+	if (CPU_OVERLAP(&idle_cpus_mask, &pcpu->pc_cpumask)) {
 		forward_wakeups_delivered++;
 		if (!cpu_idle_wakeup(cpuid))
 			ipi_cpu(cpuid, IPI_AST);
 		return;
 	}
 
 	cpri = pcpu->pc_curthread->td_priority;
 	if (pri >= cpri)
 		return;
 
 #if defined(IPI_PREEMPTION) && defined(PREEMPTION)
 #if !defined(FULL_PREEMPTION)
 	if (pri <= PRI_MAX_ITHD)
 #endif /* ! FULL_PREEMPTION */
 	{
 		ipi_cpu(cpuid, IPI_PREEMPT);
 		return;
 	}
 #endif /* defined(IPI_PREEMPTION) && defined(PREEMPTION) */
 
 	pcpu->pc_curthread->td_flags |= TDF_NEEDRESCHED;
 	ipi_cpu(cpuid, IPI_AST);
 	return;
 }
 #endif /* SMP */
 
 #ifdef SMP
 static int
 sched_pickcpu(struct thread *td)
 {
 	int best, cpu;
 
 	mtx_assert(&sched_lock, MA_OWNED);
 
 	if (THREAD_CAN_SCHED(td, td->td_lastcpu))
 		best = td->td_lastcpu;
 	else
 		best = NOCPU;
 	CPU_FOREACH(cpu) {
 		if (!THREAD_CAN_SCHED(td, cpu))
 			continue;
 	
 		if (best == NOCPU)
 			best = cpu;
 		else if (runq_length[cpu] < runq_length[best])
 			best = cpu;
 	}
 	KASSERT(best != NOCPU, ("no valid CPUs"));
 
 	return (best);
 }
 #endif
 
 void
 sched_add(struct thread *td, int flags)
 #ifdef SMP
 {
+	cpuset_t idle, me, tidlemsk;
 	struct td_sched *ts;
 	int forwarded = 0;
 	int cpu;
 	int single_cpu = 0;
 
 	ts = td->td_sched;
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT((td->td_inhibitors == 0),
 	    ("sched_add: trying to run inhibited thread"));
 	KASSERT((TD_CAN_RUN(td) || TD_IS_RUNNING(td)),
 	    ("sched_add: bad thread state"));
 	KASSERT(td->td_flags & TDF_INMEM,
 	    ("sched_add: thread swapped out"));
 
 	KTR_STATE2(KTR_SCHED, "thread", sched_tdname(td), "runq add",
 	    "prio:%d", td->td_priority, KTR_ATTR_LINKED,
 	    sched_tdname(curthread));
 	KTR_POINT1(KTR_SCHED, "thread", sched_tdname(curthread), "wokeup",
 	    KTR_ATTR_LINKED, sched_tdname(td));
 
 
 	/*
 	 * Now that the thread is moving to the run-queue, set the lock
 	 * to the scheduler's lock.
 	 */
 	if (td->td_lock != &sched_lock) {
 		mtx_lock_spin(&sched_lock);
 		thread_lock_set(td, &sched_lock);
 	}
 	TD_SET_RUNQ(td);
 
 	/*
 	 * If SMP is started and the thread is pinned or otherwise limited to
 	 * a specific set of CPUs, queue the thread to a per-CPU run queue.
 	 * Otherwise, queue the thread to the global run queue.
 	 *
 	 * If SMP has not yet been started we must use the global run queue
 	 * as per-CPU state may not be initialized yet and we may crash if we
 	 * try to access the per-CPU run queues.
 	 */
 	if (smp_started && (td->td_pinned != 0 || td->td_flags & TDF_BOUND ||
 	    ts->ts_flags & TSF_AFFINITY)) {
 		if (td->td_pinned != 0)
 			cpu = td->td_lastcpu;
 		else if (td->td_flags & TDF_BOUND) {
 			/* Find CPU from bound runq. */
 			KASSERT(SKE_RUNQ_PCPU(ts),
 			    ("sched_add: bound td_sched not on cpu runq"));
 			cpu = ts->ts_runq - &runq_pcpu[0];
 		} else
 			/* Find a valid CPU for our cpuset */
 			cpu = sched_pickcpu(td);
 		ts->ts_runq = &runq_pcpu[cpu];
 		single_cpu = 1;
 		CTR3(KTR_RUNQ,
 		    "sched_add: Put td_sched:%p(td:%p) on cpu%d runq", ts, td,
 		    cpu);
 	} else {
 		CTR2(KTR_RUNQ,
 		    "sched_add: adding td_sched:%p (td:%p) to gbl runq", ts,
 		    td);
 		cpu = NOCPU;
 		ts->ts_runq = &runq;
 	}
 
 	if (single_cpu && (cpu != PCPU_GET(cpuid))) {
 	        kick_other_cpu(td->td_priority, cpu);
 	} else {
 		if (!single_cpu) {
-			cpumask_t me = PCPU_GET(cpumask);
-			cpumask_t idle = idle_cpus_mask & me;
 
-			if (!idle && ((flags & SRQ_INTR) == 0) &&
-			    (idle_cpus_mask & ~(hlt_cpus_mask | me)))
+			/*
+			 * Thread spinlock is held here, assume no
+			 * migration is possible.
+			 */
+			me = PCPU_GET(cpumask);
+			idle = idle_cpus_mask;
+			tidlemsk = idle;
+			CPU_AND(&idle, &me);
+			CPU_OR(&me, &hlt_cpus_mask);
+			CPU_NAND(&tidlemsk, &me);
+
+			if (CPU_EMPTY(&idle) && ((flags & SRQ_INTR) == 0) &&
+			    !CPU_EMPTY(&tidlemsk))
 				forwarded = forward_wakeup(cpu);
 		}
 
 		if (!forwarded) {
 			if ((flags & SRQ_YIELDING) == 0 && maybe_preempt(td))
 				return;
 			else
 				maybe_resched(td);
 		}
 	}
 
 	if ((td->td_flags & TDF_NOLOAD) == 0)
 		sched_load_add();
 	runq_add(ts->ts_runq, td, flags);
 	if (cpu != NOCPU)
 		runq_length[cpu]++;
 }
 #else /* SMP */
 {
 	struct td_sched *ts;
 
 	ts = td->td_sched;
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT((td->td_inhibitors == 0),
 	    ("sched_add: trying to run inhibited thread"));
 	KASSERT((TD_CAN_RUN(td) || TD_IS_RUNNING(td)),
 	    ("sched_add: bad thread state"));
 	KASSERT(td->td_flags & TDF_INMEM,
 	    ("sched_add: thread swapped out"));
 	KTR_STATE2(KTR_SCHED, "thread", sched_tdname(td), "runq add",
 	    "prio:%d", td->td_priority, KTR_ATTR_LINKED,
 	    sched_tdname(curthread));
 	KTR_POINT1(KTR_SCHED, "thread", sched_tdname(curthread), "wokeup",
 	    KTR_ATTR_LINKED, sched_tdname(td));
 
 	/*
 	 * Now that the thread is moving to the run-queue, set the lock
 	 * to the scheduler's lock.
 	 */
 	if (td->td_lock != &sched_lock) {
 		mtx_lock_spin(&sched_lock);
 		thread_lock_set(td, &sched_lock);
 	}
 	TD_SET_RUNQ(td);
 	CTR2(KTR_RUNQ, "sched_add: adding td_sched:%p (td:%p) to runq", ts, td);
 	ts->ts_runq = &runq;
 
 	/*
 	 * If we are yielding (on the way out anyhow) or the thread
 	 * being saved is US, then don't try be smart about preemption
 	 * or kicking off another CPU as it won't help and may hinder.
 	 * In the YIEDLING case, we are about to run whoever is being
 	 * put in the queue anyhow, and in the OURSELF case, we are
 	 * puting ourself on the run queue which also only happens
 	 * when we are about to yield.
 	 */
 	if ((flags & SRQ_YIELDING) == 0) {
 		if (maybe_preempt(td))
 			return;
 	}
 	if ((td->td_flags & TDF_NOLOAD) == 0)
 		sched_load_add();
 	runq_add(ts->ts_runq, td, flags);
 	maybe_resched(td);
 }
 #endif /* SMP */
 
 void
 sched_rem(struct thread *td)
 {
 	struct td_sched *ts;
 
 	ts = td->td_sched;
 	KASSERT(td->td_flags & TDF_INMEM,
 	    ("sched_rem: thread swapped out"));
 	KASSERT(TD_ON_RUNQ(td),
 	    ("sched_rem: thread not on run queue"));
 	mtx_assert(&sched_lock, MA_OWNED);
 	KTR_STATE2(KTR_SCHED, "thread", sched_tdname(td), "runq rem",
 	    "prio:%d", td->td_priority, KTR_ATTR_LINKED,
 	    sched_tdname(curthread));
 
 	if ((td->td_flags & TDF_NOLOAD) == 0)
 		sched_load_rem();
 #ifdef SMP
 	if (ts->ts_runq != &runq)
 		runq_length[ts->ts_runq - runq_pcpu]--;
 #endif
 	runq_remove(ts->ts_runq, td);
 	TD_SET_CAN_RUN(td);
 }
 
 /*
  * Select threads to run.  Note that running threads still consume a
  * slot.
  */
 struct thread *
 sched_choose(void)
 {
 	struct thread *td;
 	struct runq *rq;
 
 	mtx_assert(&sched_lock,  MA_OWNED);
 #ifdef SMP
 	struct thread *tdcpu;
 
 	rq = &runq;
 	td = runq_choose_fuzz(&runq, runq_fuzz);
 	tdcpu = runq_choose(&runq_pcpu[PCPU_GET(cpuid)]);
 
 	if (td == NULL ||
 	    (tdcpu != NULL &&
 	     tdcpu->td_priority < td->td_priority)) {
 		CTR2(KTR_RUNQ, "choosing td %p from pcpu runq %d", tdcpu,
 		     PCPU_GET(cpuid));
 		td = tdcpu;
 		rq = &runq_pcpu[PCPU_GET(cpuid)];
 	} else {
 		CTR1(KTR_RUNQ, "choosing td_sched %p from main runq", td);
 	}
 
 #else
 	rq = &runq;
 	td = runq_choose(&runq);
 #endif
 
 	if (td) {
 #ifdef SMP
 		if (td == tdcpu)
 			runq_length[PCPU_GET(cpuid)]--;
 #endif
 		runq_remove(rq, td);
 		td->td_flags |= TDF_DIDRUN;
 
 		KASSERT(td->td_flags & TDF_INMEM,
 		    ("sched_choose: thread swapped out"));
 		return (td);
 	}
 	return (PCPU_GET(idlethread));
 }
 
 void
 sched_preempt(struct thread *td)
 {
 	thread_lock(td);
 	if (td->td_critnest > 1)
 		td->td_owepreempt = 1;
 	else
 		mi_switch(SW_INVOL | SW_PREEMPT | SWT_PREEMPT, NULL);
 	thread_unlock(td);
 }
 
 void
 sched_userret(struct thread *td)
 {
 	/*
 	 * XXX we cheat slightly on the locking here to avoid locking in
 	 * the usual case.  Setting td_priority here is essentially an
 	 * incomplete workaround for not setting it properly elsewhere.
 	 * Now that some interrupt handlers are threads, not setting it
 	 * properly elsewhere can clobber it in the window between setting
 	 * it here and returning to user mode, so don't waste time setting
 	 * it perfectly here.
 	 */
 	KASSERT((td->td_flags & TDF_BORROWING) == 0,
 	    ("thread with borrowed priority returning to userland"));
 	if (td->td_priority != td->td_user_pri) {
 		thread_lock(td);
 		td->td_priority = td->td_user_pri;
 		td->td_base_pri = td->td_user_pri;
 		thread_unlock(td);
 	}
 }
 
 void
 sched_bind(struct thread *td, int cpu)
 {
 	struct td_sched *ts;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED|MA_NOTRECURSED);
 	KASSERT(td == curthread, ("sched_bind: can only bind curthread"));
 
 	ts = td->td_sched;
 
 	td->td_flags |= TDF_BOUND;
 #ifdef SMP
 	ts->ts_runq = &runq_pcpu[cpu];
 	if (PCPU_GET(cpuid) == cpu)
 		return;
 
 	mi_switch(SW_VOL, NULL);
 #endif
 }
 
 void
 sched_unbind(struct thread* td)
 {
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT(td == curthread, ("sched_unbind: can only bind curthread"));
 	td->td_flags &= ~TDF_BOUND;
 }
 
 int
 sched_is_bound(struct thread *td)
 {
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	return (td->td_flags & TDF_BOUND);
 }
 
 void
 sched_relinquish(struct thread *td)
 {
 	thread_lock(td);
 	mi_switch(SW_VOL | SWT_RELINQUISH, NULL);
 	thread_unlock(td);
 }
 
 int
 sched_load(void)
 {
 	return (sched_tdcnt);
 }
 
 int
 sched_sizeof_proc(void)
 {
 	return (sizeof(struct proc));
 }
 
 int
 sched_sizeof_thread(void)
 {
 	return (sizeof(struct thread) + sizeof(struct td_sched));
 }
 
 fixpt_t
 sched_pctcpu(struct thread *td)
 {
 	struct td_sched *ts;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	ts = td->td_sched;
 	return (ts->ts_pctcpu);
 }
 
 void
 sched_tick(int cnt)
 {
 }
 
 /*
  * The actual idle process.
  */
 void
 sched_idletd(void *dummy)
 {
 	struct pcpuidlestat *stat;
 
 	stat = DPCPU_PTR(idlestat);
 	for (;;) {
 		mtx_assert(&Giant, MA_NOTOWNED);
 
 		while (sched_runnable() == 0) {
 			cpu_idle(stat->idlecalls + stat->oldidlecalls > 64);
 			stat->idlecalls++;
 		}
 
 		mtx_lock_spin(&sched_lock);
 		mi_switch(SW_VOL | SWT_IDLE, NULL);
 		mtx_unlock_spin(&sched_lock);
 	}
 }
 
 /*
  * A CPU is entering for the first time or a thread is exiting.
  */
 void
 sched_throw(struct thread *td)
 {
 	/*
 	 * Correct spinlock nesting.  The idle thread context that we are
 	 * borrowing was created so that it would start out with a single
 	 * spin lock (sched_lock) held in fork_trampoline().  Since we've
 	 * explicitly acquired locks in this function, the nesting count
 	 * is now 2 rather than 1.  Since we are nested, calling
 	 * spinlock_exit() will simply adjust the counts without allowing
 	 * spin lock using code to interrupt us.
 	 */
 	if (td == NULL) {
 		mtx_lock_spin(&sched_lock);
 		spinlock_exit();
 	} else {
 		lock_profile_release_lock(&sched_lock.lock_object);
 		MPASS(td->td_lock == &sched_lock);
 	}
 	mtx_assert(&sched_lock, MA_OWNED);
 	KASSERT(curthread->td_md.md_spinlock_count == 1, ("invalid count"));
 	PCPU_SET(switchtime, cpu_ticks());
 	PCPU_SET(switchticks, ticks);
 	cpu_throw(td, choosethread());	/* doesn't return */
 }
 
 void
 sched_fork_exit(struct thread *td)
 {
 
 	/*
 	 * Finish setting up thread glue so that it begins execution in a
 	 * non-nested critical section with sched_lock held but not recursed.
 	 */
 	td->td_oncpu = PCPU_GET(cpuid);
 	sched_lock.mtx_lock = (uintptr_t)td;
 	lock_profile_obtain_lock_success(&sched_lock.lock_object,
 	    0, 0, __FILE__, __LINE__);
 	THREAD_LOCK_ASSERT(td, MA_OWNED | MA_NOTRECURSED);
 }
 
 char *
 sched_tdname(struct thread *td)
 {
 #ifdef KTR
 	struct td_sched *ts;
 
 	ts = td->td_sched;
 	if (ts->ts_name[0] == '\0')
 		snprintf(ts->ts_name, sizeof(ts->ts_name),
 		    "%s tid %d", td->td_name, td->td_tid);
 	return (ts->ts_name);
 #else   
 	return (td->td_name);
 #endif
 }
 
 void
 sched_affinity(struct thread *td)
 {
 #ifdef SMP
 	struct td_sched *ts;
 	int cpu;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);	
 
 	/*
 	 * Set the TSF_AFFINITY flag if there is at least one CPU this
 	 * thread can't run on.
 	 */
 	ts = td->td_sched;
 	ts->ts_flags &= ~TSF_AFFINITY;
 	CPU_FOREACH(cpu) {
 		if (!THREAD_CAN_SCHED(td, cpu)) {
 			ts->ts_flags |= TSF_AFFINITY;
 			break;
 		}
 	}
 
 	/*
 	 * If this thread can run on all CPUs, nothing else to do.
 	 */
 	if (!(ts->ts_flags & TSF_AFFINITY))
 		return;
 
 	/* Pinned threads and bound threads should be left alone. */
 	if (td->td_pinned != 0 || td->td_flags & TDF_BOUND)
 		return;
 
 	switch (td->td_state) {
 	case TDS_RUNQ:
 		/*
 		 * If we are on a per-CPU runqueue that is in the set,
 		 * then nothing needs to be done.
 		 */
 		if (ts->ts_runq != &runq &&
 		    THREAD_CAN_SCHED(td, ts->ts_runq - runq_pcpu))
 			return;
 
 		/* Put this thread on a valid per-CPU runqueue. */
 		sched_rem(td);
 		sched_add(td, SRQ_BORING);
 		break;
 	case TDS_RUNNING:
 		/*
 		 * See if our current CPU is in the set.  If not, force a
 		 * context switch.
 		 */
 		if (THREAD_CAN_SCHED(td, td->td_oncpu))
 			return;
 
 		td->td_flags |= TDF_NEEDRESCHED;
 		if (td != curthread)
 			ipi_cpu(cpu, IPI_AST);
 		break;
 	default:
 		break;
 	}
 #endif
 }
Index: head/sys/kern/sched_ule.c
===================================================================
--- head/sys/kern/sched_ule.c	(revision 222812)
+++ head/sys/kern/sched_ule.c	(revision 222813)
@@ -1,2762 +1,2763 @@
 /*-
  * Copyright (c) 2002-2007, Jeffrey Roberson <jeff@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 /*
  * This file implements the ULE scheduler.  ULE supports independent CPU
  * run queues and fine grain locking.  It has superior interactive
  * performance under load even on uni-processor systems.
  *
  * etymology:
  *   ULE is the last three letters in schedule.  It owes its name to a
  * generic user created for a scheduling system by Paul Mikesell at
  * Isilon Systems and a general lack of creativity on the part of the author.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_hwpmc_hooks.h"
 #include "opt_kdtrace.h"
 #include "opt_sched.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/resource.h>
 #include <sys/resourcevar.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/sx.h>
 #include <sys/sysctl.h>
 #include <sys/sysproto.h>
 #include <sys/turnstile.h>
 #include <sys/umtx.h>
 #include <sys/vmmeter.h>
 #include <sys/cpuset.h>
 #include <sys/sbuf.h>
 
 #ifdef HWPMC_HOOKS
 #include <sys/pmckern.h>
 #endif
 
 #ifdef KDTRACE_HOOKS
 #include <sys/dtrace_bsd.h>
 int				dtrace_vtime_active;
 dtrace_vtime_switch_func_t	dtrace_vtime_switch_func;
 #endif
 
 #include <machine/cpu.h>
 #include <machine/smp.h>
 
 #if defined(__sparc64__)
 #error "This architecture is not currently compatible with ULE"
 #endif
 
 #define	KTR_ULE	0
 
 #define	TS_NAME_LEN (MAXCOMLEN + sizeof(" td ") + sizeof(__XSTRING(UINT_MAX)))
 #define	TDQ_NAME_LEN	(sizeof("sched lock ") + sizeof(__XSTRING(MAXCPU)))
 #define	TDQ_LOADNAME_LEN	(PCPU_NAME_LEN + sizeof(" load"))
 
 /*
  * Thread scheduler specific section.  All fields are protected
  * by the thread lock.
  */
 struct td_sched {	
 	struct runq	*ts_runq;	/* Run-queue we're queued on. */
 	short		ts_flags;	/* TSF_* flags. */
 	u_char		ts_cpu;		/* CPU that we have affinity for. */
 	int		ts_rltick;	/* Real last tick, for affinity. */
 	int		ts_slice;	/* Ticks of slice remaining. */
 	u_int		ts_slptime;	/* Number of ticks we vol. slept */
 	u_int		ts_runtime;	/* Number of ticks we were running */
 	int		ts_ltick;	/* Last tick that we were running on */
 	int		ts_incrtick;	/* Last tick that we incremented on */
 	int		ts_ftick;	/* First tick that we were running on */
 	int		ts_ticks;	/* Tick count */
 #ifdef KTR
 	char		ts_name[TS_NAME_LEN];
 #endif
 };
 /* flags kept in ts_flags */
 #define	TSF_BOUND	0x0001		/* Thread can not migrate. */
 #define	TSF_XFERABLE	0x0002		/* Thread was added as transferable. */
 
 static struct td_sched td_sched0;
 
 #define	THREAD_CAN_MIGRATE(td)	((td)->td_pinned == 0)
 #define	THREAD_CAN_SCHED(td, cpu)	\
     CPU_ISSET((cpu), &(td)->td_cpuset->cs_mask)
 
 /*
  * Priority ranges used for interactive and non-interactive timeshare
  * threads.  The timeshare priorities are split up into four ranges.
  * The first range handles interactive threads.  The last three ranges
  * (NHALF, x, and NHALF) handle non-interactive threads with the outer
  * ranges supporting nice values.
  */
 #define	PRI_TIMESHARE_RANGE	(PRI_MAX_TIMESHARE - PRI_MIN_TIMESHARE + 1)
 #define	PRI_INTERACT_RANGE	((PRI_TIMESHARE_RANGE - SCHED_PRI_NRESV) / 2)
 
 #define	PRI_MIN_INTERACT	PRI_MIN_TIMESHARE
 #define	PRI_MAX_INTERACT	(PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE - 1)
 #define	PRI_MIN_BATCH		(PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE)
 #define	PRI_MAX_BATCH		PRI_MAX_TIMESHARE
 
 /*
  * Cpu percentage computation macros and defines.
  *
  * SCHED_TICK_SECS:	Number of seconds to average the cpu usage across.
  * SCHED_TICK_TARG:	Number of hz ticks to average the cpu usage across.
  * SCHED_TICK_MAX:	Maximum number of ticks before scaling back.
  * SCHED_TICK_SHIFT:	Shift factor to avoid rounding away results.
  * SCHED_TICK_HZ:	Compute the number of hz ticks for a given ticks count.
  * SCHED_TICK_TOTAL:	Gives the amount of time we've been recording ticks.
  */
 #define	SCHED_TICK_SECS		10
 #define	SCHED_TICK_TARG		(hz * SCHED_TICK_SECS)
 #define	SCHED_TICK_MAX		(SCHED_TICK_TARG + hz)
 #define	SCHED_TICK_SHIFT	10
 #define	SCHED_TICK_HZ(ts)	((ts)->ts_ticks >> SCHED_TICK_SHIFT)
 #define	SCHED_TICK_TOTAL(ts)	(max((ts)->ts_ltick - (ts)->ts_ftick, hz))
 
 /*
  * These macros determine priorities for non-interactive threads.  They are
  * assigned a priority based on their recent cpu utilization as expressed
  * by the ratio of ticks to the tick total.  NHALF priorities at the start
  * and end of the MIN to MAX timeshare range are only reachable with negative
  * or positive nice respectively.
  *
  * PRI_RANGE:	Priority range for utilization dependent priorities.
  * PRI_NRESV:	Number of nice values.
  * PRI_TICKS:	Compute a priority in PRI_RANGE from the ticks count and total.
  * PRI_NICE:	Determines the part of the priority inherited from nice.
  */
 #define	SCHED_PRI_NRESV		(PRIO_MAX - PRIO_MIN)
 #define	SCHED_PRI_NHALF		(SCHED_PRI_NRESV / 2)
 #define	SCHED_PRI_MIN		(PRI_MIN_BATCH + SCHED_PRI_NHALF)
 #define	SCHED_PRI_MAX		(PRI_MAX_BATCH - SCHED_PRI_NHALF)
 #define	SCHED_PRI_RANGE		(SCHED_PRI_MAX - SCHED_PRI_MIN + 1)
 #define	SCHED_PRI_TICKS(ts)						\
     (SCHED_TICK_HZ((ts)) /						\
     (roundup(SCHED_TICK_TOTAL((ts)), SCHED_PRI_RANGE) / SCHED_PRI_RANGE))
 #define	SCHED_PRI_NICE(nice)	(nice)
 
 /*
  * These determine the interactivity of a process.  Interactivity differs from
  * cpu utilization in that it expresses the voluntary time slept vs time ran
  * while cpu utilization includes all time not running.  This more accurately
  * models the intent of the thread.
  *
  * SLP_RUN_MAX:	Maximum amount of sleep time + run time we'll accumulate
  *		before throttling back.
  * SLP_RUN_FORK:	Maximum slp+run time to inherit at fork time.
  * INTERACT_MAX:	Maximum interactivity value.  Smaller is better.
  * INTERACT_THRESH:	Threshold for placement on the current runq.
  */
 #define	SCHED_SLP_RUN_MAX	((hz * 5) << SCHED_TICK_SHIFT)
 #define	SCHED_SLP_RUN_FORK	((hz / 2) << SCHED_TICK_SHIFT)
 #define	SCHED_INTERACT_MAX	(100)
 #define	SCHED_INTERACT_HALF	(SCHED_INTERACT_MAX / 2)
 #define	SCHED_INTERACT_THRESH	(30)
 
 /*
  * tickincr:		Converts a stathz tick into a hz domain scaled by
  *			the shift factor.  Without the shift the error rate
  *			due to rounding would be unacceptably high.
  * realstathz:		stathz is sometimes 0 and run off of hz.
  * sched_slice:		Runtime of each thread before rescheduling.
  * preempt_thresh:	Priority threshold for preemption and remote IPIs.
  */
 static int sched_interact = SCHED_INTERACT_THRESH;
 static int realstathz;
 static int tickincr;
 static int sched_slice = 1;
 #ifdef PREEMPTION
 #ifdef FULL_PREEMPTION
 static int preempt_thresh = PRI_MAX_IDLE;
 #else
 static int preempt_thresh = PRI_MIN_KERN;
 #endif
 #else 
 static int preempt_thresh = 0;
 #endif
 static int static_boost = PRI_MIN_BATCH;
 static int sched_idlespins = 10000;
 static int sched_idlespinthresh = 16;
 
 /*
  * tdq - per processor runqs and statistics.  All fields are protected by the
  * tdq_lock.  The load and lowpri may be accessed without to avoid excess
  * locking in sched_pickcpu();
  */
 struct tdq {
 	/* Ordered to improve efficiency of cpu_search() and switch(). */
 	struct mtx	tdq_lock;		/* run queue lock. */
 	struct cpu_group *tdq_cg;		/* Pointer to cpu topology. */
 	volatile int	tdq_load;		/* Aggregate load. */
 	volatile int	tdq_cpu_idle;		/* cpu_idle() is active. */
 	int		tdq_sysload;		/* For loadavg, !ITHD load. */
 	int		tdq_transferable;	/* Transferable thread count. */
 	short		tdq_switchcnt;		/* Switches this tick. */
 	short		tdq_oldswitchcnt;	/* Switches last tick. */
 	u_char		tdq_lowpri;		/* Lowest priority thread. */
 	u_char		tdq_ipipending;		/* IPI pending. */
 	u_char		tdq_idx;		/* Current insert index. */
 	u_char		tdq_ridx;		/* Current removal index. */
 	struct runq	tdq_realtime;		/* real-time run queue. */
 	struct runq	tdq_timeshare;		/* timeshare run queue. */
 	struct runq	tdq_idle;		/* Queue of IDLE threads. */
 	char		tdq_name[TDQ_NAME_LEN];
 #ifdef KTR
 	char		tdq_loadname[TDQ_LOADNAME_LEN];
 #endif
 } __aligned(64);
 
 /* Idle thread states and config. */
 #define	TDQ_RUNNING	1
 #define	TDQ_IDLE	2
 
 #ifdef SMP
 struct cpu_group *cpu_top;		/* CPU topology */
 
 #define	SCHED_AFFINITY_DEFAULT	(max(1, hz / 1000))
 #define	SCHED_AFFINITY(ts, t)	((ts)->ts_rltick > ticks - ((t) * affinity))
 
 /*
  * Run-time tunables.
  */
 static int rebalance = 1;
 static int balance_interval = 128;	/* Default set in sched_initticks(). */
 static int affinity;
 static int steal_htt = 1;
 static int steal_idle = 1;
 static int steal_thresh = 2;
 
 /*
  * One thread queue per processor.
  */
 static struct tdq	tdq_cpu[MAXCPU];
 static struct tdq	*balance_tdq;
 static int balance_ticks;
 
 #define	TDQ_SELF()	(&tdq_cpu[PCPU_GET(cpuid)])
 #define	TDQ_CPU(x)	(&tdq_cpu[(x)])
 #define	TDQ_ID(x)	((int)((x) - tdq_cpu))
 #else	/* !SMP */
 static struct tdq	tdq_cpu;
 
 #define	TDQ_ID(x)	(0)
 #define	TDQ_SELF()	(&tdq_cpu)
 #define	TDQ_CPU(x)	(&tdq_cpu)
 #endif
 
 #define	TDQ_LOCK_ASSERT(t, type)	mtx_assert(TDQ_LOCKPTR((t)), (type))
 #define	TDQ_LOCK(t)		mtx_lock_spin(TDQ_LOCKPTR((t)))
 #define	TDQ_LOCK_FLAGS(t, f)	mtx_lock_spin_flags(TDQ_LOCKPTR((t)), (f))
 #define	TDQ_UNLOCK(t)		mtx_unlock_spin(TDQ_LOCKPTR((t)))
 #define	TDQ_LOCKPTR(t)		(&(t)->tdq_lock)
 
 static void sched_priority(struct thread *);
 static void sched_thread_priority(struct thread *, u_char);
 static int sched_interact_score(struct thread *);
 static void sched_interact_update(struct thread *);
 static void sched_interact_fork(struct thread *);
 static void sched_pctcpu_update(struct td_sched *);
 
 /* Operations on per processor queues */
 static struct thread *tdq_choose(struct tdq *);
 static void tdq_setup(struct tdq *);
 static void tdq_load_add(struct tdq *, struct thread *);
 static void tdq_load_rem(struct tdq *, struct thread *);
 static __inline void tdq_runq_add(struct tdq *, struct thread *, int);
 static __inline void tdq_runq_rem(struct tdq *, struct thread *);
 static inline int sched_shouldpreempt(int, int, int);
 void tdq_print(int cpu);
 static void runq_print(struct runq *rq);
 static void tdq_add(struct tdq *, struct thread *, int);
 #ifdef SMP
 static int tdq_move(struct tdq *, struct tdq *);
 static int tdq_idled(struct tdq *);
 static void tdq_notify(struct tdq *, struct thread *);
 static struct thread *tdq_steal(struct tdq *, int);
 static struct thread *runq_steal(struct runq *, int);
 static int sched_pickcpu(struct thread *, int);
 static void sched_balance(void);
 static int sched_balance_pair(struct tdq *, struct tdq *);
 static inline struct tdq *sched_setcpu(struct thread *, int, int);
 static inline void thread_unblock_switch(struct thread *, struct mtx *);
 static struct mtx *sched_switch_migrate(struct tdq *, struct thread *, int);
 static int sysctl_kern_sched_topology_spec(SYSCTL_HANDLER_ARGS);
 static int sysctl_kern_sched_topology_spec_internal(struct sbuf *sb, 
     struct cpu_group *cg, int indent);
 #endif
 
 static void sched_setup(void *dummy);
 SYSINIT(sched_setup, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, sched_setup, NULL);
 
 static void sched_initticks(void *dummy);
 SYSINIT(sched_initticks, SI_SUB_CLOCKS, SI_ORDER_THIRD, sched_initticks,
     NULL);
 
 /*
  * Print the threads waiting on a run-queue.
  */
 static void
 runq_print(struct runq *rq)
 {
 	struct rqhead *rqh;
 	struct thread *td;
 	int pri;
 	int j;
 	int i;
 
 	for (i = 0; i < RQB_LEN; i++) {
 		printf("\t\trunq bits %d 0x%zx\n",
 		    i, rq->rq_status.rqb_bits[i]);
 		for (j = 0; j < RQB_BPW; j++)
 			if (rq->rq_status.rqb_bits[i] & (1ul << j)) {
 				pri = j + (i << RQB_L2BPW);
 				rqh = &rq->rq_queues[pri];
 				TAILQ_FOREACH(td, rqh, td_runq) {
 					printf("\t\t\ttd %p(%s) priority %d rqindex %d pri %d\n",
 					    td, td->td_name, td->td_priority,
 					    td->td_rqindex, pri);
 				}
 			}
 	}
 }
 
 /*
  * Print the status of a per-cpu thread queue.  Should be a ddb show cmd.
  */
 void
 tdq_print(int cpu)
 {
 	struct tdq *tdq;
 
 	tdq = TDQ_CPU(cpu);
 
 	printf("tdq %d:\n", TDQ_ID(tdq));
 	printf("\tlock            %p\n", TDQ_LOCKPTR(tdq));
 	printf("\tLock name:      %s\n", tdq->tdq_name);
 	printf("\tload:           %d\n", tdq->tdq_load);
 	printf("\tswitch cnt:     %d\n", tdq->tdq_switchcnt);
 	printf("\told switch cnt: %d\n", tdq->tdq_oldswitchcnt);
 	printf("\ttimeshare idx:  %d\n", tdq->tdq_idx);
 	printf("\ttimeshare ridx: %d\n", tdq->tdq_ridx);
 	printf("\tload transferable: %d\n", tdq->tdq_transferable);
 	printf("\tlowest priority:   %d\n", tdq->tdq_lowpri);
 	printf("\trealtime runq:\n");
 	runq_print(&tdq->tdq_realtime);
 	printf("\ttimeshare runq:\n");
 	runq_print(&tdq->tdq_timeshare);
 	printf("\tidle runq:\n");
 	runq_print(&tdq->tdq_idle);
 }
 
 static inline int
 sched_shouldpreempt(int pri, int cpri, int remote)
 {
 	/*
 	 * If the new priority is not better than the current priority there is
 	 * nothing to do.
 	 */
 	if (pri >= cpri)
 		return (0);
 	/*
 	 * Always preempt idle.
 	 */
 	if (cpri >= PRI_MIN_IDLE)
 		return (1);
 	/*
 	 * If preemption is disabled don't preempt others.
 	 */
 	if (preempt_thresh == 0)
 		return (0);
 	/*
 	 * Preempt if we exceed the threshold.
 	 */
 	if (pri <= preempt_thresh)
 		return (1);
 	/*
 	 * If we're interactive or better and there is non-interactive
 	 * or worse running preempt only remote processors.
 	 */
 	if (remote && pri <= PRI_MAX_INTERACT && cpri > PRI_MAX_INTERACT)
 		return (1);
 	return (0);
 }
 
 #define	TS_RQ_PPQ	(((PRI_MAX_BATCH - PRI_MIN_BATCH) + 1) / RQ_NQS)
 /*
  * Add a thread to the actual run-queue.  Keeps transferable counts up to
  * date with what is actually on the run-queue.  Selects the correct
  * queue position for timeshare threads.
  */
 static __inline void
 tdq_runq_add(struct tdq *tdq, struct thread *td, int flags)
 {
 	struct td_sched *ts;
 	u_char pri;
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 
 	pri = td->td_priority;
 	ts = td->td_sched;
 	TD_SET_RUNQ(td);
 	if (THREAD_CAN_MIGRATE(td)) {
 		tdq->tdq_transferable++;
 		ts->ts_flags |= TSF_XFERABLE;
 	}
 	if (pri < PRI_MIN_BATCH) {
 		ts->ts_runq = &tdq->tdq_realtime;
 	} else if (pri <= PRI_MAX_BATCH) {
 		ts->ts_runq = &tdq->tdq_timeshare;
 		KASSERT(pri <= PRI_MAX_BATCH && pri >= PRI_MIN_BATCH,
 			("Invalid priority %d on timeshare runq", pri));
 		/*
 		 * This queue contains only priorities between MIN and MAX
 		 * realtime.  Use the whole queue to represent these values.
 		 */
 		if ((flags & (SRQ_BORROWING|SRQ_PREEMPTED)) == 0) {
 			pri = (pri - PRI_MIN_BATCH) / TS_RQ_PPQ;
 			pri = (pri + tdq->tdq_idx) % RQ_NQS;
 			/*
 			 * This effectively shortens the queue by one so we
 			 * can have a one slot difference between idx and
 			 * ridx while we wait for threads to drain.
 			 */
 			if (tdq->tdq_ridx != tdq->tdq_idx &&
 			    pri == tdq->tdq_ridx)
 				pri = (unsigned char)(pri - 1) % RQ_NQS;
 		} else
 			pri = tdq->tdq_ridx;
 		runq_add_pri(ts->ts_runq, td, pri, flags);
 		return;
 	} else
 		ts->ts_runq = &tdq->tdq_idle;
 	runq_add(ts->ts_runq, td, flags);
 }
 
 /* 
  * Remove a thread from a run-queue.  This typically happens when a thread
  * is selected to run.  Running threads are not on the queue and the
  * transferable count does not reflect them.
  */
 static __inline void
 tdq_runq_rem(struct tdq *tdq, struct thread *td)
 {
 	struct td_sched *ts;
 
 	ts = td->td_sched;
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	KASSERT(ts->ts_runq != NULL,
 	    ("tdq_runq_remove: thread %p null ts_runq", td));
 	if (ts->ts_flags & TSF_XFERABLE) {
 		tdq->tdq_transferable--;
 		ts->ts_flags &= ~TSF_XFERABLE;
 	}
 	if (ts->ts_runq == &tdq->tdq_timeshare) {
 		if (tdq->tdq_idx != tdq->tdq_ridx)
 			runq_remove_idx(ts->ts_runq, td, &tdq->tdq_ridx);
 		else
 			runq_remove_idx(ts->ts_runq, td, NULL);
 	} else
 		runq_remove(ts->ts_runq, td);
 }
 
 /*
  * Load is maintained for all threads RUNNING and ON_RUNQ.  Add the load
  * for this thread to the referenced thread queue.
  */
 static void
 tdq_load_add(struct tdq *tdq, struct thread *td)
 {
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 
 	tdq->tdq_load++;
 	if ((td->td_flags & TDF_NOLOAD) == 0)
 		tdq->tdq_sysload++;
 	KTR_COUNTER0(KTR_SCHED, "load", tdq->tdq_loadname, tdq->tdq_load);
 }
 
 /*
  * Remove the load from a thread that is transitioning to a sleep state or
  * exiting.
  */
 static void
 tdq_load_rem(struct tdq *tdq, struct thread *td)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	KASSERT(tdq->tdq_load != 0,
 	    ("tdq_load_rem: Removing with 0 load on queue %d", TDQ_ID(tdq)));
 
 	tdq->tdq_load--;
 	if ((td->td_flags & TDF_NOLOAD) == 0)
 		tdq->tdq_sysload--;
 	KTR_COUNTER0(KTR_SCHED, "load", tdq->tdq_loadname, tdq->tdq_load);
 }
 
 /*
  * Set lowpri to its exact value by searching the run-queue and
  * evaluating curthread.  curthread may be passed as an optimization.
  */
 static void
 tdq_setlowpri(struct tdq *tdq, struct thread *ctd)
 {
 	struct thread *td;
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	if (ctd == NULL)
 		ctd = pcpu_find(TDQ_ID(tdq))->pc_curthread;
 	td = tdq_choose(tdq);
 	if (td == NULL || td->td_priority > ctd->td_priority)
 		tdq->tdq_lowpri = ctd->td_priority;
 	else
 		tdq->tdq_lowpri = td->td_priority;
 }
 
 #ifdef SMP
 struct cpu_search {
 	cpuset_t cs_mask;
 	u_int	cs_load;
 	u_int	cs_cpu;
 	int	cs_limit;	/* Min priority for low min load for high. */
 };
 
 #define	CPU_SEARCH_LOWEST	0x1
 #define	CPU_SEARCH_HIGHEST	0x2
 #define	CPU_SEARCH_BOTH		(CPU_SEARCH_LOWEST|CPU_SEARCH_HIGHEST)
 
 #define	CPUSET_FOREACH(cpu, mask)				\
 	for ((cpu) = 0; (cpu) <= mp_maxid; (cpu)++)		\
-		if ((mask) & 1 << (cpu))
+		if (CPU_ISSET(cpu, &mask))
 
 static __inline int cpu_search(struct cpu_group *cg, struct cpu_search *low,
     struct cpu_search *high, const int match);
 int cpu_search_lowest(struct cpu_group *cg, struct cpu_search *low);
 int cpu_search_highest(struct cpu_group *cg, struct cpu_search *high);
 int cpu_search_both(struct cpu_group *cg, struct cpu_search *low,
     struct cpu_search *high);
 
 /*
  * This routine compares according to the match argument and should be
  * reduced in actual instantiations via constant propagation and dead code
  * elimination.
  */ 
 static __inline int
 cpu_compare(int cpu, struct cpu_search *low, struct cpu_search *high,
     const int match)
 {
 	struct tdq *tdq;
 
 	tdq = TDQ_CPU(cpu);
 	if (match & CPU_SEARCH_LOWEST)
 		if (CPU_ISSET(cpu, &low->cs_mask) &&
 		    tdq->tdq_load < low->cs_load &&
 		    tdq->tdq_lowpri > low->cs_limit) {
 			low->cs_cpu = cpu;
 			low->cs_load = tdq->tdq_load;
 		}
 	if (match & CPU_SEARCH_HIGHEST)
 		if (CPU_ISSET(cpu, &high->cs_mask) &&
 		    tdq->tdq_load >= high->cs_limit && 
 		    tdq->tdq_load > high->cs_load &&
 		    tdq->tdq_transferable) {
 			high->cs_cpu = cpu;
 			high->cs_load = tdq->tdq_load;
 		}
 	return (tdq->tdq_load);
 }
 
 /*
  * Search the tree of cpu_groups for the lowest or highest loaded cpu
  * according to the match argument.  This routine actually compares the
  * load on all paths through the tree and finds the least loaded cpu on
  * the least loaded path, which may differ from the least loaded cpu in
  * the system.  This balances work among caches and busses.
  *
  * This inline is instantiated in three forms below using constants for the
  * match argument.  It is reduced to the minimum set for each case.  It is
  * also recursive to the depth of the tree.
  */
 static __inline int
 cpu_search(struct cpu_group *cg, struct cpu_search *low,
     struct cpu_search *high, const int match)
 {
 	int total;
 
 	total = 0;
 	if (cg->cg_children) {
 		struct cpu_search lgroup;
 		struct cpu_search hgroup;
 		struct cpu_group *child;
 		u_int lload;
 		int hload;
 		int load;
 		int i;
 
 		lload = -1;
 		hload = -1;
 		for (i = 0; i < cg->cg_children; i++) {
 			child = &cg->cg_child[i];
 			if (match & CPU_SEARCH_LOWEST) {
 				lgroup = *low;
 				lgroup.cs_load = -1;
 			}
 			if (match & CPU_SEARCH_HIGHEST) {
 				hgroup = *high;
 				lgroup.cs_load = 0;
 			}
 			switch (match) {
 			case CPU_SEARCH_LOWEST:
 				load = cpu_search_lowest(child, &lgroup);
 				break;
 			case CPU_SEARCH_HIGHEST:
 				load = cpu_search_highest(child, &hgroup);
 				break;
 			case CPU_SEARCH_BOTH:
 				load = cpu_search_both(child, &lgroup, &hgroup);
 				break;
 			}
 			total += load;
 			if (match & CPU_SEARCH_LOWEST)
 				if (load < lload || low->cs_cpu == -1) {
 					*low = lgroup;
 					lload = load;
 				}
 			if (match & CPU_SEARCH_HIGHEST) 
 				if (load > hload || high->cs_cpu == -1) {
 					hload = load;
 					*high = hgroup;
 				}
 		}
 	} else {
 		int cpu;
 
 		CPUSET_FOREACH(cpu, cg->cg_mask)
 			total += cpu_compare(cpu, low, high, match);
 	}
 	return (total);
 }
 
 /*
  * cpu_search instantiations must pass constants to maintain the inline
  * optimization.
  */
 int
 cpu_search_lowest(struct cpu_group *cg, struct cpu_search *low)
 {
 	return cpu_search(cg, low, NULL, CPU_SEARCH_LOWEST);
 }
 
 int
 cpu_search_highest(struct cpu_group *cg, struct cpu_search *high)
 {
 	return cpu_search(cg, NULL, high, CPU_SEARCH_HIGHEST);
 }
 
 int
 cpu_search_both(struct cpu_group *cg, struct cpu_search *low,
     struct cpu_search *high)
 {
 	return cpu_search(cg, low, high, CPU_SEARCH_BOTH);
 }
 
 /*
  * Find the cpu with the least load via the least loaded path that has a
  * lowpri greater than pri  pri.  A pri of -1 indicates any priority is
  * acceptable.
  */
 static inline int
 sched_lowest(struct cpu_group *cg, cpuset_t mask, int pri)
 {
 	struct cpu_search low;
 
 	low.cs_cpu = -1;
 	low.cs_load = -1;
 	low.cs_mask = mask;
 	low.cs_limit = pri;
 	cpu_search_lowest(cg, &low);
 	return low.cs_cpu;
 }
 
 /*
  * Find the cpu with the highest load via the highest loaded path.
  */
 static inline int
 sched_highest(struct cpu_group *cg, cpuset_t mask, int minload)
 {
 	struct cpu_search high;
 
 	high.cs_cpu = -1;
 	high.cs_load = 0;
 	high.cs_mask = mask;
 	high.cs_limit = minload;
 	cpu_search_highest(cg, &high);
 	return high.cs_cpu;
 }
 
 /*
  * Simultaneously find the highest and lowest loaded cpu reachable via
  * cg.
  */
 static inline void 
 sched_both(struct cpu_group *cg, cpuset_t mask, int *lowcpu, int *highcpu)
 {
 	struct cpu_search high;
 	struct cpu_search low;
 
 	low.cs_cpu = -1;
 	low.cs_limit = -1;
 	low.cs_load = -1;
 	low.cs_mask = mask;
 	high.cs_load = 0;
 	high.cs_cpu = -1;
 	high.cs_limit = -1;
 	high.cs_mask = mask;
 	cpu_search_both(cg, &low, &high);
 	*lowcpu = low.cs_cpu;
 	*highcpu = high.cs_cpu;
 	return;
 }
 
 static void
 sched_balance_group(struct cpu_group *cg)
 {
 	cpuset_t mask;
 	int high;
 	int low;
 	int i;
 
 	CPU_FILL(&mask);
 	for (;;) {
 		sched_both(cg, mask, &low, &high);
 		if (low == high || low == -1 || high == -1)
 			break;
 		if (sched_balance_pair(TDQ_CPU(high), TDQ_CPU(low)))
 			break;
 		/*
 		 * If we failed to move any threads determine which cpu
 		 * to kick out of the set and try again.
 	 	 */
 		if (TDQ_CPU(high)->tdq_transferable == 0)
 			CPU_CLR(high, &mask);
 		else
 			CPU_CLR(low, &mask);
 	}
 
 	for (i = 0; i < cg->cg_children; i++)
 		sched_balance_group(&cg->cg_child[i]);
 }
 
 static void
 sched_balance(void)
 {
 	struct tdq *tdq;
 
 	/*
 	 * Select a random time between .5 * balance_interval and
 	 * 1.5 * balance_interval.
 	 */
 	balance_ticks = max(balance_interval / 2, 1);
 	balance_ticks += random() % balance_interval;
 	if (smp_started == 0 || rebalance == 0)
 		return;
 	tdq = TDQ_SELF();
 	TDQ_UNLOCK(tdq);
 	sched_balance_group(cpu_top);
 	TDQ_LOCK(tdq);
 }
 
 /*
  * Lock two thread queues using their address to maintain lock order.
  */
 static void
 tdq_lock_pair(struct tdq *one, struct tdq *two)
 {
 	if (one < two) {
 		TDQ_LOCK(one);
 		TDQ_LOCK_FLAGS(two, MTX_DUPOK);
 	} else {
 		TDQ_LOCK(two);
 		TDQ_LOCK_FLAGS(one, MTX_DUPOK);
 	}
 }
 
 /*
  * Unlock two thread queues.  Order is not important here.
  */
 static void
 tdq_unlock_pair(struct tdq *one, struct tdq *two)
 {
 	TDQ_UNLOCK(one);
 	TDQ_UNLOCK(two);
 }
 
 /*
  * Transfer load between two imbalanced thread queues.
  */
 static int
 sched_balance_pair(struct tdq *high, struct tdq *low)
 {
 	int transferable;
 	int high_load;
 	int low_load;
 	int moved;
 	int move;
 	int diff;
 	int i;
 
 	tdq_lock_pair(high, low);
 	transferable = high->tdq_transferable;
 	high_load = high->tdq_load;
 	low_load = low->tdq_load;
 	moved = 0;
 	/*
 	 * Determine what the imbalance is and then adjust that to how many
 	 * threads we actually have to give up (transferable).
 	 */
 	if (transferable != 0) {
 		diff = high_load - low_load;
 		move = diff / 2;
 		if (diff & 0x1)
 			move++;
 		move = min(move, transferable);
 		for (i = 0; i < move; i++)
 			moved += tdq_move(high, low);
 		/*
 		 * IPI the target cpu to force it to reschedule with the new
 		 * workload.
 		 */
 		ipi_cpu(TDQ_ID(low), IPI_PREEMPT);
 	}
 	tdq_unlock_pair(high, low);
 	return (moved);
 }
 
 /*
  * Move a thread from one thread queue to another.
  */
 static int
 tdq_move(struct tdq *from, struct tdq *to)
 {
 	struct td_sched *ts;
 	struct thread *td;
 	struct tdq *tdq;
 	int cpu;
 
 	TDQ_LOCK_ASSERT(from, MA_OWNED);
 	TDQ_LOCK_ASSERT(to, MA_OWNED);
 
 	tdq = from;
 	cpu = TDQ_ID(to);
 	td = tdq_steal(tdq, cpu);
 	if (td == NULL)
 		return (0);
 	ts = td->td_sched;
 	/*
 	 * Although the run queue is locked the thread may be blocked.  Lock
 	 * it to clear this and acquire the run-queue lock.
 	 */
 	thread_lock(td);
 	/* Drop recursive lock on from acquired via thread_lock(). */
 	TDQ_UNLOCK(from);
 	sched_rem(td);
 	ts->ts_cpu = cpu;
 	td->td_lock = TDQ_LOCKPTR(to);
 	tdq_add(to, td, SRQ_YIELDING);
 	return (1);
 }
 
 /*
  * This tdq has idled.  Try to steal a thread from another cpu and switch
  * to it.
  */
 static int
 tdq_idled(struct tdq *tdq)
 {
 	struct cpu_group *cg;
 	struct tdq *steal;
 	cpuset_t mask;
 	int thresh;
 	int cpu;
 
 	if (smp_started == 0 || steal_idle == 0)
 		return (1);
 	CPU_FILL(&mask);
 	CPU_CLR(PCPU_GET(cpuid), &mask);
 	/* We don't want to be preempted while we're iterating. */
 	spinlock_enter();
 	for (cg = tdq->tdq_cg; cg != NULL; ) {
 		if ((cg->cg_flags & CG_FLAG_THREAD) == 0)
 			thresh = steal_thresh;
 		else
 			thresh = 1;
 		cpu = sched_highest(cg, mask, thresh);
 		if (cpu == -1) {
 			cg = cg->cg_parent;
 			continue;
 		}
 		steal = TDQ_CPU(cpu);
 		CPU_CLR(cpu, &mask);
 		tdq_lock_pair(tdq, steal);
 		if (steal->tdq_load < thresh || steal->tdq_transferable == 0) {
 			tdq_unlock_pair(tdq, steal);
 			continue;
 		}
 		/*
 		 * If a thread was added while interrupts were disabled don't
 		 * steal one here.  If we fail to acquire one due to affinity
 		 * restrictions loop again with this cpu removed from the
 		 * set.
 		 */
 		if (tdq->tdq_load == 0 && tdq_move(steal, tdq) == 0) {
 			tdq_unlock_pair(tdq, steal);
 			continue;
 		}
 		spinlock_exit();
 		TDQ_UNLOCK(steal);
 		mi_switch(SW_VOL | SWT_IDLE, NULL);
 		thread_unlock(curthread);
 
 		return (0);
 	}
 	spinlock_exit();
 	return (1);
 }
 
 /*
  * Notify a remote cpu of new work.  Sends an IPI if criteria are met.
  */
 static void
 tdq_notify(struct tdq *tdq, struct thread *td)
 {
 	struct thread *ctd;
 	int pri;
 	int cpu;
 
 	if (tdq->tdq_ipipending)
 		return;
 	cpu = td->td_sched->ts_cpu;
 	pri = td->td_priority;
 	ctd = pcpu_find(cpu)->pc_curthread;
 	if (!sched_shouldpreempt(pri, ctd->td_priority, 1))
 		return;
 	if (TD_IS_IDLETHREAD(ctd)) {
 		/*
 		 * If the MD code has an idle wakeup routine try that before
 		 * falling back to IPI.
 		 */
 		if (!tdq->tdq_cpu_idle || cpu_idle_wakeup(cpu))
 			return;
 	}
 	tdq->tdq_ipipending = 1;
 	ipi_cpu(cpu, IPI_PREEMPT);
 }
 
 /*
  * Steals load from a timeshare queue.  Honors the rotating queue head
  * index.
  */
 static struct thread *
 runq_steal_from(struct runq *rq, int cpu, u_char start)
 {
 	struct rqbits *rqb;
 	struct rqhead *rqh;
 	struct thread *td;
 	int first;
 	int bit;
 	int pri;
 	int i;
 
 	rqb = &rq->rq_status;
 	bit = start & (RQB_BPW -1);
 	pri = 0;
 	first = 0;
 again:
 	for (i = RQB_WORD(start); i < RQB_LEN; bit = 0, i++) {
 		if (rqb->rqb_bits[i] == 0)
 			continue;
 		if (bit != 0) {
 			for (pri = bit; pri < RQB_BPW; pri++)
 				if (rqb->rqb_bits[i] & (1ul << pri))
 					break;
 			if (pri >= RQB_BPW)
 				continue;
 		} else
 			pri = RQB_FFS(rqb->rqb_bits[i]);
 		pri += (i << RQB_L2BPW);
 		rqh = &rq->rq_queues[pri];
 		TAILQ_FOREACH(td, rqh, td_runq) {
 			if (first && THREAD_CAN_MIGRATE(td) &&
 			    THREAD_CAN_SCHED(td, cpu))
 				return (td);
 			first = 1;
 		}
 	}
 	if (start != 0) {
 		start = 0;
 		goto again;
 	}
 
 	return (NULL);
 }
 
 /*
  * Steals load from a standard linear queue.
  */
 static struct thread *
 runq_steal(struct runq *rq, int cpu)
 {
 	struct rqhead *rqh;
 	struct rqbits *rqb;
 	struct thread *td;
 	int word;
 	int bit;
 
 	rqb = &rq->rq_status;
 	for (word = 0; word < RQB_LEN; word++) {
 		if (rqb->rqb_bits[word] == 0)
 			continue;
 		for (bit = 0; bit < RQB_BPW; bit++) {
 			if ((rqb->rqb_bits[word] & (1ul << bit)) == 0)
 				continue;
 			rqh = &rq->rq_queues[bit + (word << RQB_L2BPW)];
 			TAILQ_FOREACH(td, rqh, td_runq)
 				if (THREAD_CAN_MIGRATE(td) &&
 				    THREAD_CAN_SCHED(td, cpu))
 					return (td);
 		}
 	}
 	return (NULL);
 }
 
 /*
  * Attempt to steal a thread in priority order from a thread queue.
  */
 static struct thread *
 tdq_steal(struct tdq *tdq, int cpu)
 {
 	struct thread *td;
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	if ((td = runq_steal(&tdq->tdq_realtime, cpu)) != NULL)
 		return (td);
 	if ((td = runq_steal_from(&tdq->tdq_timeshare,
 	    cpu, tdq->tdq_ridx)) != NULL)
 		return (td);
 	return (runq_steal(&tdq->tdq_idle, cpu));
 }
 
 /*
  * Sets the thread lock and ts_cpu to match the requested cpu.  Unlocks the
  * current lock and returns with the assigned queue locked.
  */
 static inline struct tdq *
 sched_setcpu(struct thread *td, int cpu, int flags)
 {
 
 	struct tdq *tdq;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	tdq = TDQ_CPU(cpu);
 	td->td_sched->ts_cpu = cpu;
 	/*
 	 * If the lock matches just return the queue.
 	 */
 	if (td->td_lock == TDQ_LOCKPTR(tdq))
 		return (tdq);
 #ifdef notyet
 	/*
 	 * If the thread isn't running its lockptr is a
 	 * turnstile or a sleepqueue.  We can just lock_set without
 	 * blocking.
 	 */
 	if (TD_CAN_RUN(td)) {
 		TDQ_LOCK(tdq);
 		thread_lock_set(td, TDQ_LOCKPTR(tdq));
 		return (tdq);
 	}
 #endif
 	/*
 	 * The hard case, migration, we need to block the thread first to
 	 * prevent order reversals with other cpus locks.
 	 */
 	spinlock_enter();
 	thread_lock_block(td);
 	TDQ_LOCK(tdq);
 	thread_lock_unblock(td, TDQ_LOCKPTR(tdq));
 	spinlock_exit();
 	return (tdq);
 }
 
 SCHED_STAT_DEFINE(pickcpu_intrbind, "Soft interrupt binding");
 SCHED_STAT_DEFINE(pickcpu_idle_affinity, "Picked idle cpu based on affinity");
 SCHED_STAT_DEFINE(pickcpu_affinity, "Picked cpu based on affinity");
 SCHED_STAT_DEFINE(pickcpu_lowest, "Selected lowest load");
 SCHED_STAT_DEFINE(pickcpu_local, "Migrated to current cpu");
 SCHED_STAT_DEFINE(pickcpu_migration, "Selection may have caused migration");
 
 static int
 sched_pickcpu(struct thread *td, int flags)
 {
 	struct cpu_group *cg;
 	struct td_sched *ts;
 	struct tdq *tdq;
 	cpuset_t mask;
 	int self;
 	int pri;
 	int cpu;
 
 	self = PCPU_GET(cpuid);
 	ts = td->td_sched;
 	if (smp_started == 0)
 		return (self);
 	/*
 	 * Don't migrate a running thread from sched_switch().
 	 */
 	if ((flags & SRQ_OURSELF) || !THREAD_CAN_MIGRATE(td))
 		return (ts->ts_cpu);
 	/*
 	 * Prefer to run interrupt threads on the processors that generate
 	 * the interrupt.
 	 */
 	if (td->td_priority <= PRI_MAX_ITHD && THREAD_CAN_SCHED(td, self) &&
 	    curthread->td_intr_nesting_level && ts->ts_cpu != self) {
 		SCHED_STAT_INC(pickcpu_intrbind);
 		ts->ts_cpu = self;
 	}
 	/*
 	 * If the thread can run on the last cpu and the affinity has not
 	 * expired or it is idle run it there.
 	 */
 	pri = td->td_priority;
 	tdq = TDQ_CPU(ts->ts_cpu);
 	if (THREAD_CAN_SCHED(td, ts->ts_cpu)) {
 		if (tdq->tdq_lowpri > PRI_MIN_IDLE) {
 			SCHED_STAT_INC(pickcpu_idle_affinity);
 			return (ts->ts_cpu);
 		}
 		if (SCHED_AFFINITY(ts, CG_SHARE_L2) && tdq->tdq_lowpri > pri) {
 			SCHED_STAT_INC(pickcpu_affinity);
 			return (ts->ts_cpu);
 		}
 	}
 	/*
 	 * Search for the highest level in the tree that still has affinity.
 	 */
 	cg = NULL;
 	for (cg = tdq->tdq_cg; cg != NULL; cg = cg->cg_parent)
 		if (SCHED_AFFINITY(ts, cg->cg_level))
 			break;
 	cpu = -1;
 	mask = td->td_cpuset->cs_mask;
 	if (cg)
 		cpu = sched_lowest(cg, mask, pri);
 	if (cpu == -1)
 		cpu = sched_lowest(cpu_top, mask, -1);
 	/*
 	 * Compare the lowest loaded cpu to current cpu.
 	 */
 	if (THREAD_CAN_SCHED(td, self) && TDQ_CPU(self)->tdq_lowpri > pri &&
 	    TDQ_CPU(cpu)->tdq_lowpri < PRI_MIN_IDLE) {
 		SCHED_STAT_INC(pickcpu_local);
 		cpu = self;
 	} else
 		SCHED_STAT_INC(pickcpu_lowest);
 	if (cpu != ts->ts_cpu)
 		SCHED_STAT_INC(pickcpu_migration);
 	KASSERT(cpu != -1, ("sched_pickcpu: Failed to find a cpu."));
 	return (cpu);
 }
 #endif
 
 /*
  * Pick the highest priority task we have and return it.
  */
 static struct thread *
 tdq_choose(struct tdq *tdq)
 {
 	struct thread *td;
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	td = runq_choose(&tdq->tdq_realtime);
 	if (td != NULL)
 		return (td);
 	td = runq_choose_from(&tdq->tdq_timeshare, tdq->tdq_ridx);
 	if (td != NULL) {
 		KASSERT(td->td_priority >= PRI_MIN_BATCH,
 		    ("tdq_choose: Invalid priority on timeshare queue %d",
 		    td->td_priority));
 		return (td);
 	}
 	td = runq_choose(&tdq->tdq_idle);
 	if (td != NULL) {
 		KASSERT(td->td_priority >= PRI_MIN_IDLE,
 		    ("tdq_choose: Invalid priority on idle queue %d",
 		    td->td_priority));
 		return (td);
 	}
 
 	return (NULL);
 }
 
 /*
  * Initialize a thread queue.
  */
 static void
 tdq_setup(struct tdq *tdq)
 {
 
 	if (bootverbose)
 		printf("ULE: setup cpu %d\n", TDQ_ID(tdq));
 	runq_init(&tdq->tdq_realtime);
 	runq_init(&tdq->tdq_timeshare);
 	runq_init(&tdq->tdq_idle);
 	snprintf(tdq->tdq_name, sizeof(tdq->tdq_name),
 	    "sched lock %d", (int)TDQ_ID(tdq));
 	mtx_init(&tdq->tdq_lock, tdq->tdq_name, "sched lock",
 	    MTX_SPIN | MTX_RECURSE);
 #ifdef KTR
 	snprintf(tdq->tdq_loadname, sizeof(tdq->tdq_loadname),
 	    "CPU %d load", (int)TDQ_ID(tdq));
 #endif
 }
 
 #ifdef SMP
 static void
 sched_setup_smp(void)
 {
 	struct tdq *tdq;
 	int i;
 
 	cpu_top = smp_topo();
 	CPU_FOREACH(i) {
 		tdq = TDQ_CPU(i);
 		tdq_setup(tdq);
 		tdq->tdq_cg = smp_topo_find(cpu_top, i);
 		if (tdq->tdq_cg == NULL)
 			panic("Can't find cpu group for %d\n", i);
 	}
 	balance_tdq = TDQ_SELF();
 	sched_balance();
 }
 #endif
 
 /*
  * Setup the thread queues and initialize the topology based on MD
  * information.
  */
 static void
 sched_setup(void *dummy)
 {
 	struct tdq *tdq;
 
 	tdq = TDQ_SELF();
 #ifdef SMP
 	sched_setup_smp();
 #else
 	tdq_setup(tdq);
 #endif
 	/*
 	 * To avoid divide-by-zero, we set realstathz a dummy value
 	 * in case which sched_clock() called before sched_initticks().
 	 */
 	realstathz = hz;
 	sched_slice = (realstathz/10);	/* ~100ms */
 	tickincr = 1 << SCHED_TICK_SHIFT;
 
 	/* Add thread0's load since it's running. */
 	TDQ_LOCK(tdq);
 	thread0.td_lock = TDQ_LOCKPTR(TDQ_SELF());
 	tdq_load_add(tdq, &thread0);
 	tdq->tdq_lowpri = thread0.td_priority;
 	TDQ_UNLOCK(tdq);
 }
 
 /*
  * This routine determines the tickincr after stathz and hz are setup.
  */
 /* ARGSUSED */
 static void
 sched_initticks(void *dummy)
 {
 	int incr;
 
 	realstathz = stathz ? stathz : hz;
 	sched_slice = (realstathz/10);	/* ~100ms */
 
 	/*
 	 * tickincr is shifted out by 10 to avoid rounding errors due to
 	 * hz not being evenly divisible by stathz on all platforms.
 	 */
 	incr = (hz << SCHED_TICK_SHIFT) / realstathz;
 	/*
 	 * This does not work for values of stathz that are more than
 	 * 1 << SCHED_TICK_SHIFT * hz.  In practice this does not happen.
 	 */
 	if (incr == 0)
 		incr = 1;
 	tickincr = incr;
 #ifdef SMP
 	/*
 	 * Set the default balance interval now that we know
 	 * what realstathz is.
 	 */
 	balance_interval = realstathz;
 	/*
 	 * Set steal thresh to roughly log2(mp_ncpu) but no greater than 4. 
 	 * This prevents excess thrashing on large machines and excess idle 
 	 * on smaller machines.
 	 */
 	steal_thresh = min(fls(mp_ncpus) - 1, 3);
 	affinity = SCHED_AFFINITY_DEFAULT;
 #endif
 }
 
 
 /*
  * This is the core of the interactivity algorithm.  Determines a score based
  * on past behavior.  It is the ratio of sleep time to run time scaled to
  * a [0, 100] integer.  This is the voluntary sleep time of a process, which
  * differs from the cpu usage because it does not account for time spent
  * waiting on a run-queue.  Would be prettier if we had floating point.
  */
 static int
 sched_interact_score(struct thread *td)
 {
 	struct td_sched *ts;
 	int div;
 
 	ts = td->td_sched;
 	/*
 	 * The score is only needed if this is likely to be an interactive
 	 * task.  Don't go through the expense of computing it if there's
 	 * no chance.
 	 */
 	if (sched_interact <= SCHED_INTERACT_HALF &&
 		ts->ts_runtime >= ts->ts_slptime)
 			return (SCHED_INTERACT_HALF);
 
 	if (ts->ts_runtime > ts->ts_slptime) {
 		div = max(1, ts->ts_runtime / SCHED_INTERACT_HALF);
 		return (SCHED_INTERACT_HALF +
 		    (SCHED_INTERACT_HALF - (ts->ts_slptime / div)));
 	}
 	if (ts->ts_slptime > ts->ts_runtime) {
 		div = max(1, ts->ts_slptime / SCHED_INTERACT_HALF);
 		return (ts->ts_runtime / div);
 	}
 	/* runtime == slptime */
 	if (ts->ts_runtime)
 		return (SCHED_INTERACT_HALF);
 
 	/*
 	 * This can happen if slptime and runtime are 0.
 	 */
 	return (0);
 
 }
 
 /*
  * Scale the scheduling priority according to the "interactivity" of this
  * process.
  */
 static void
 sched_priority(struct thread *td)
 {
 	int score;
 	int pri;
 
 	if (PRI_BASE(td->td_pri_class) != PRI_TIMESHARE)
 		return;
 	/*
 	 * If the score is interactive we place the thread in the realtime
 	 * queue with a priority that is less than kernel and interrupt
 	 * priorities.  These threads are not subject to nice restrictions.
 	 *
 	 * Scores greater than this are placed on the normal timeshare queue
 	 * where the priority is partially decided by the most recent cpu
 	 * utilization and the rest is decided by nice value.
 	 *
 	 * The nice value of the process has a linear effect on the calculated
 	 * score.  Negative nice values make it easier for a thread to be
 	 * considered interactive.
 	 */
 	score = imax(0, sched_interact_score(td) + td->td_proc->p_nice);
 	if (score < sched_interact) {
 		pri = PRI_MIN_INTERACT;
 		pri += ((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) /
 		    sched_interact) * score;
 		KASSERT(pri >= PRI_MIN_INTERACT && pri <= PRI_MAX_INTERACT,
 		    ("sched_priority: invalid interactive priority %d score %d",
 		    pri, score));
 	} else {
 		pri = SCHED_PRI_MIN;
 		if (td->td_sched->ts_ticks)
 			pri += SCHED_PRI_TICKS(td->td_sched);
 		pri += SCHED_PRI_NICE(td->td_proc->p_nice);
 		KASSERT(pri >= PRI_MIN_BATCH && pri <= PRI_MAX_BATCH,
 		    ("sched_priority: invalid priority %d: nice %d, " 
 		    "ticks %d ftick %d ltick %d tick pri %d",
 		    pri, td->td_proc->p_nice, td->td_sched->ts_ticks,
 		    td->td_sched->ts_ftick, td->td_sched->ts_ltick,
 		    SCHED_PRI_TICKS(td->td_sched)));
 	}
 	sched_user_prio(td, pri);
 
 	return;
 }
 
 /*
  * This routine enforces a maximum limit on the amount of scheduling history
  * kept.  It is called after either the slptime or runtime is adjusted.  This
  * function is ugly due to integer math.
  */
 static void
 sched_interact_update(struct thread *td)
 {
 	struct td_sched *ts;
 	u_int sum;
 
 	ts = td->td_sched;
 	sum = ts->ts_runtime + ts->ts_slptime;
 	if (sum < SCHED_SLP_RUN_MAX)
 		return;
 	/*
 	 * This only happens from two places:
 	 * 1) We have added an unusual amount of run time from fork_exit.
 	 * 2) We have added an unusual amount of sleep time from sched_sleep().
 	 */
 	if (sum > SCHED_SLP_RUN_MAX * 2) {
 		if (ts->ts_runtime > ts->ts_slptime) {
 			ts->ts_runtime = SCHED_SLP_RUN_MAX;
 			ts->ts_slptime = 1;
 		} else {
 			ts->ts_slptime = SCHED_SLP_RUN_MAX;
 			ts->ts_runtime = 1;
 		}
 		return;
 	}
 	/*
 	 * If we have exceeded by more than 1/5th then the algorithm below
 	 * will not bring us back into range.  Dividing by two here forces
 	 * us into the range of [4/5 * SCHED_INTERACT_MAX, SCHED_INTERACT_MAX]
 	 */
 	if (sum > (SCHED_SLP_RUN_MAX / 5) * 6) {
 		ts->ts_runtime /= 2;
 		ts->ts_slptime /= 2;
 		return;
 	}
 	ts->ts_runtime = (ts->ts_runtime / 5) * 4;
 	ts->ts_slptime = (ts->ts_slptime / 5) * 4;
 }
 
 /*
  * Scale back the interactivity history when a child thread is created.  The
  * history is inherited from the parent but the thread may behave totally
  * differently.  For example, a shell spawning a compiler process.  We want
  * to learn that the compiler is behaving badly very quickly.
  */
 static void
 sched_interact_fork(struct thread *td)
 {
 	int ratio;
 	int sum;
 
 	sum = td->td_sched->ts_runtime + td->td_sched->ts_slptime;
 	if (sum > SCHED_SLP_RUN_FORK) {
 		ratio = sum / SCHED_SLP_RUN_FORK;
 		td->td_sched->ts_runtime /= ratio;
 		td->td_sched->ts_slptime /= ratio;
 	}
 }
 
 /*
  * Called from proc0_init() to setup the scheduler fields.
  */
 void
 schedinit(void)
 {
 
 	/*
 	 * Set up the scheduler specific parts of proc0.
 	 */
 	proc0.p_sched = NULL; /* XXX */
 	thread0.td_sched = &td_sched0;
 	td_sched0.ts_ltick = ticks;
 	td_sched0.ts_ftick = ticks;
 	td_sched0.ts_slice = sched_slice;
 }
 
 /*
  * This is only somewhat accurate since given many processes of the same
  * priority they will switch when their slices run out, which will be
  * at most sched_slice stathz ticks.
  */
 int
 sched_rr_interval(void)
 {
 
 	/* Convert sched_slice to hz */
 	return (hz/(realstathz/sched_slice));
 }
 
 /*
  * Update the percent cpu tracking information when it is requested or
  * the total history exceeds the maximum.  We keep a sliding history of
  * tick counts that slowly decays.  This is less precise than the 4BSD
  * mechanism since it happens with less regular and frequent events.
  */
 static void
 sched_pctcpu_update(struct td_sched *ts)
 {
 
 	if (ts->ts_ticks == 0)
 		return;
 	if (ticks - (hz / 10) < ts->ts_ltick &&
 	    SCHED_TICK_TOTAL(ts) < SCHED_TICK_MAX)
 		return;
 	/*
 	 * Adjust counters and watermark for pctcpu calc.
 	 */
 	if (ts->ts_ltick > ticks - SCHED_TICK_TARG)
 		ts->ts_ticks = (ts->ts_ticks / (ticks - ts->ts_ftick)) *
 			    SCHED_TICK_TARG;
 	else
 		ts->ts_ticks = 0;
 	ts->ts_ltick = ticks;
 	ts->ts_ftick = ts->ts_ltick - SCHED_TICK_TARG;
 }
 
 /*
  * Adjust the priority of a thread.  Move it to the appropriate run-queue
  * if necessary.  This is the back-end for several priority related
  * functions.
  */
 static void
 sched_thread_priority(struct thread *td, u_char prio)
 {
 	struct td_sched *ts;
 	struct tdq *tdq;
 	int oldpri;
 
 	KTR_POINT3(KTR_SCHED, "thread", sched_tdname(td), "prio",
 	    "prio:%d", td->td_priority, "new prio:%d", prio,
 	    KTR_ATTR_LINKED, sched_tdname(curthread));
 	if (td != curthread && prio > td->td_priority) {
 		KTR_POINT3(KTR_SCHED, "thread", sched_tdname(curthread),
 		    "lend prio", "prio:%d", td->td_priority, "new prio:%d",
 		    prio, KTR_ATTR_LINKED, sched_tdname(td));
 	} 
 	ts = td->td_sched;
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	if (td->td_priority == prio)
 		return;
 	/*
 	 * If the priority has been elevated due to priority
 	 * propagation, we may have to move ourselves to a new
 	 * queue.  This could be optimized to not re-add in some
 	 * cases.
 	 */
 	if (TD_ON_RUNQ(td) && prio < td->td_priority) {
 		sched_rem(td);
 		td->td_priority = prio;
 		sched_add(td, SRQ_BORROWING);
 		return;
 	}
 	/*
 	 * If the thread is currently running we may have to adjust the lowpri
 	 * information so other cpus are aware of our current priority.
 	 */
 	if (TD_IS_RUNNING(td)) {
 		tdq = TDQ_CPU(ts->ts_cpu);
 		oldpri = td->td_priority;
 		td->td_priority = prio;
 		if (prio < tdq->tdq_lowpri)
 			tdq->tdq_lowpri = prio;
 		else if (tdq->tdq_lowpri == oldpri)
 			tdq_setlowpri(tdq, td);
 		return;
 	}
 	td->td_priority = prio;
 }
 
 /*
  * Update a thread's priority when it is lent another thread's
  * priority.
  */
 void
 sched_lend_prio(struct thread *td, u_char prio)
 {
 
 	td->td_flags |= TDF_BORROWING;
 	sched_thread_priority(td, prio);
 }
 
 /*
  * Restore a thread's priority when priority propagation is
  * over.  The prio argument is the minimum priority the thread
  * needs to have to satisfy other possible priority lending
  * requests.  If the thread's regular priority is less
  * important than prio, the thread will keep a priority boost
  * of prio.
  */
 void
 sched_unlend_prio(struct thread *td, u_char prio)
 {
 	u_char base_pri;
 
 	if (td->td_base_pri >= PRI_MIN_TIMESHARE &&
 	    td->td_base_pri <= PRI_MAX_TIMESHARE)
 		base_pri = td->td_user_pri;
 	else
 		base_pri = td->td_base_pri;
 	if (prio >= base_pri) {
 		td->td_flags &= ~TDF_BORROWING;
 		sched_thread_priority(td, base_pri);
 	} else
 		sched_lend_prio(td, prio);
 }
 
 /*
  * Standard entry for setting the priority to an absolute value.
  */
 void
 sched_prio(struct thread *td, u_char prio)
 {
 	u_char oldprio;
 
 	/* First, update the base priority. */
 	td->td_base_pri = prio;
 
 	/*
 	 * If the thread is borrowing another thread's priority, don't
 	 * ever lower the priority.
 	 */
 	if (td->td_flags & TDF_BORROWING && td->td_priority < prio)
 		return;
 
 	/* Change the real priority. */
 	oldprio = td->td_priority;
 	sched_thread_priority(td, prio);
 
 	/*
 	 * If the thread is on a turnstile, then let the turnstile update
 	 * its state.
 	 */
 	if (TD_ON_LOCK(td) && oldprio != prio)
 		turnstile_adjust(td, oldprio);
 }
 
 /*
  * Set the base user priority, does not effect current running priority.
  */
 void
 sched_user_prio(struct thread *td, u_char prio)
 {
 
 	td->td_base_user_pri = prio;
 	if (td->td_lend_user_pri <= prio)
 		return;
 	td->td_user_pri = prio;
 }
 
 void
 sched_lend_user_prio(struct thread *td, u_char prio)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	td->td_lend_user_pri = prio;
 	td->td_user_pri = min(prio, td->td_base_user_pri);
 	if (td->td_priority > td->td_user_pri)
 		sched_prio(td, td->td_user_pri);
 	else if (td->td_priority != td->td_user_pri)
 		td->td_flags |= TDF_NEEDRESCHED;
 }
 
 /*
  * Handle migration from sched_switch().  This happens only for
  * cpu binding.
  */
 static struct mtx *
 sched_switch_migrate(struct tdq *tdq, struct thread *td, int flags)
 {
 	struct tdq *tdn;
 
 	tdn = TDQ_CPU(td->td_sched->ts_cpu);
 #ifdef SMP
 	tdq_load_rem(tdq, td);
 	/*
 	 * Do the lock dance required to avoid LOR.  We grab an extra
 	 * spinlock nesting to prevent preemption while we're
 	 * not holding either run-queue lock.
 	 */
 	spinlock_enter();
 	thread_lock_block(td);	/* This releases the lock on tdq. */
 
 	/*
 	 * Acquire both run-queue locks before placing the thread on the new
 	 * run-queue to avoid deadlocks created by placing a thread with a
 	 * blocked lock on the run-queue of a remote processor.  The deadlock
 	 * occurs when a third processor attempts to lock the two queues in
 	 * question while the target processor is spinning with its own
 	 * run-queue lock held while waiting for the blocked lock to clear.
 	 */
 	tdq_lock_pair(tdn, tdq);
 	tdq_add(tdn, td, flags);
 	tdq_notify(tdn, td);
 	TDQ_UNLOCK(tdn);
 	spinlock_exit();
 #endif
 	return (TDQ_LOCKPTR(tdn));
 }
 
 /*
  * Variadic version of thread_lock_unblock() that does not assume td_lock
  * is blocked.
  */
 static inline void
 thread_unblock_switch(struct thread *td, struct mtx *mtx)
 {
 	atomic_store_rel_ptr((volatile uintptr_t *)&td->td_lock,
 	    (uintptr_t)mtx);
 }
 
 /*
  * Switch threads.  This function has to handle threads coming in while
  * blocked for some reason, running, or idle.  It also must deal with
  * migrating a thread from one queue to another as running threads may
  * be assigned elsewhere via binding.
  */
 void
 sched_switch(struct thread *td, struct thread *newtd, int flags)
 {
 	struct tdq *tdq;
 	struct td_sched *ts;
 	struct mtx *mtx;
 	int srqflag;
 	int cpuid;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT(newtd == NULL, ("sched_switch: Unsupported newtd argument"));
 
 	cpuid = PCPU_GET(cpuid);
 	tdq = TDQ_CPU(cpuid);
 	ts = td->td_sched;
 	mtx = td->td_lock;
 	ts->ts_rltick = ticks;
 	td->td_lastcpu = td->td_oncpu;
 	td->td_oncpu = NOCPU;
 	if (!(flags & SW_PREEMPT))
 		td->td_flags &= ~TDF_NEEDRESCHED;
 	td->td_owepreempt = 0;
 	tdq->tdq_switchcnt++;
 	/*
 	 * The lock pointer in an idle thread should never change.  Reset it
 	 * to CAN_RUN as well.
 	 */
 	if (TD_IS_IDLETHREAD(td)) {
 		MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
 		TD_SET_CAN_RUN(td);
 	} else if (TD_IS_RUNNING(td)) {
 		MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
 		srqflag = (flags & SW_PREEMPT) ?
 		    SRQ_OURSELF|SRQ_YIELDING|SRQ_PREEMPTED :
 		    SRQ_OURSELF|SRQ_YIELDING;
 #ifdef SMP
 		if (THREAD_CAN_MIGRATE(td) && !THREAD_CAN_SCHED(td, ts->ts_cpu))
 			ts->ts_cpu = sched_pickcpu(td, 0);
 #endif
 		if (ts->ts_cpu == cpuid)
 			tdq_runq_add(tdq, td, srqflag);
 		else {
 			KASSERT(THREAD_CAN_MIGRATE(td) ||
 			    (ts->ts_flags & TSF_BOUND) != 0,
 			    ("Thread %p shouldn't migrate", td));
 			mtx = sched_switch_migrate(tdq, td, srqflag);
 		}
 	} else {
 		/* This thread must be going to sleep. */
 		TDQ_LOCK(tdq);
 		mtx = thread_lock_block(td);
 		tdq_load_rem(tdq, td);
 	}
 	/*
 	 * We enter here with the thread blocked and assigned to the
 	 * appropriate cpu run-queue or sleep-queue and with the current
 	 * thread-queue locked.
 	 */
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED | MA_NOTRECURSED);
 	newtd = choosethread();
 	/*
 	 * Call the MD code to switch contexts if necessary.
 	 */
 	if (td != newtd) {
 #ifdef	HWPMC_HOOKS
 		if (PMC_PROC_IS_USING_PMCS(td->td_proc))
 			PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_OUT);
 #endif
 		lock_profile_release_lock(&TDQ_LOCKPTR(tdq)->lock_object);
 		TDQ_LOCKPTR(tdq)->mtx_lock = (uintptr_t)newtd;
 
 #ifdef KDTRACE_HOOKS
 		/*
 		 * If DTrace has set the active vtime enum to anything
 		 * other than INACTIVE (0), then it should have set the
 		 * function to call.
 		 */
 		if (dtrace_vtime_active)
 			(*dtrace_vtime_switch_func)(newtd);
 #endif
 
 		cpu_switch(td, newtd, mtx);
 		/*
 		 * We may return from cpu_switch on a different cpu.  However,
 		 * we always return with td_lock pointing to the current cpu's
 		 * run queue lock.
 		 */
 		cpuid = PCPU_GET(cpuid);
 		tdq = TDQ_CPU(cpuid);
 		lock_profile_obtain_lock_success(
 		    &TDQ_LOCKPTR(tdq)->lock_object, 0, 0, __FILE__, __LINE__);
 #ifdef	HWPMC_HOOKS
 		if (PMC_PROC_IS_USING_PMCS(td->td_proc))
 			PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_IN);
 #endif
 	} else
 		thread_unblock_switch(td, mtx);
 	/*
 	 * Assert that all went well and return.
 	 */
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED|MA_NOTRECURSED);
 	MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
 	td->td_oncpu = cpuid;
 }
 
 /*
  * Adjust thread priorities as a result of a nice request.
  */
 void
 sched_nice(struct proc *p, int nice)
 {
 	struct thread *td;
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 
 	p->p_nice = nice;
 	FOREACH_THREAD_IN_PROC(p, td) {
 		thread_lock(td);
 		sched_priority(td);
 		sched_prio(td, td->td_base_user_pri);
 		thread_unlock(td);
 	}
 }
 
 /*
  * Record the sleep time for the interactivity scorer.
  */
 void
 sched_sleep(struct thread *td, int prio)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 
 	td->td_slptick = ticks;
 	if (TD_IS_SUSPENDED(td) || prio >= PSOCK)
 		td->td_flags |= TDF_CANSWAP;
 	if (PRI_BASE(td->td_pri_class) != PRI_TIMESHARE)
 		return;
 	if (static_boost == 1 && prio)
 		sched_prio(td, prio);
 	else if (static_boost && td->td_priority > static_boost)
 		sched_prio(td, static_boost);
 }
 
 /*
  * Schedule a thread to resume execution and record how long it voluntarily
  * slept.  We also update the pctcpu, interactivity, and priority.
  */
 void
 sched_wakeup(struct thread *td)
 {
 	struct td_sched *ts;
 	int slptick;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	ts = td->td_sched;
 	td->td_flags &= ~TDF_CANSWAP;
 	/*
 	 * If we slept for more than a tick update our interactivity and
 	 * priority.
 	 */
 	slptick = td->td_slptick;
 	td->td_slptick = 0;
 	if (slptick && slptick != ticks) {
 		u_int hzticks;
 
 		hzticks = (ticks - slptick) << SCHED_TICK_SHIFT;
 		ts->ts_slptime += hzticks;
 		sched_interact_update(td);
 		sched_pctcpu_update(ts);
 	}
 	/* Reset the slice value after we sleep. */
 	ts->ts_slice = sched_slice;
 	sched_add(td, SRQ_BORING);
 }
 
 /*
  * Penalize the parent for creating a new child and initialize the child's
  * priority.
  */
 void
 sched_fork(struct thread *td, struct thread *child)
 {
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	sched_fork_thread(td, child);
 	/*
 	 * Penalize the parent and child for forking.
 	 */
 	sched_interact_fork(child);
 	sched_priority(child);
 	td->td_sched->ts_runtime += tickincr;
 	sched_interact_update(td);
 	sched_priority(td);
 }
 
 /*
  * Fork a new thread, may be within the same process.
  */
 void
 sched_fork_thread(struct thread *td, struct thread *child)
 {
 	struct td_sched *ts;
 	struct td_sched *ts2;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	/*
 	 * Initialize child.
 	 */
 	ts = td->td_sched;
 	ts2 = child->td_sched;
 	child->td_lock = TDQ_LOCKPTR(TDQ_SELF());
 	child->td_cpuset = cpuset_ref(td->td_cpuset);
 	ts2->ts_cpu = ts->ts_cpu;
 	ts2->ts_flags = 0;
 	/*
 	 * Grab our parents cpu estimation information.
 	 */
 	ts2->ts_ticks = ts->ts_ticks;
 	ts2->ts_ltick = ts->ts_ltick;
 	ts2->ts_incrtick = ts->ts_incrtick;
 	ts2->ts_ftick = ts->ts_ftick;
 	/*
 	 * Do not inherit any borrowed priority from the parent.
 	 */
 	child->td_priority = child->td_base_pri;
 	/*
 	 * And update interactivity score.
 	 */
 	ts2->ts_slptime = ts->ts_slptime;
 	ts2->ts_runtime = ts->ts_runtime;
 	ts2->ts_slice = 1;	/* Attempt to quickly learn interactivity. */
 #ifdef KTR
 	bzero(ts2->ts_name, sizeof(ts2->ts_name));
 #endif
 }
 
 /*
  * Adjust the priority class of a thread.
  */
 void
 sched_class(struct thread *td, int class)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	if (td->td_pri_class == class)
 		return;
 	td->td_pri_class = class;
 }
 
 /*
  * Return some of the child's priority and interactivity to the parent.
  */
 void
 sched_exit(struct proc *p, struct thread *child)
 {
 	struct thread *td;
 
 	KTR_STATE1(KTR_SCHED, "thread", sched_tdname(child), "proc exit",
 	    "prio:td", child->td_priority);
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	td = FIRST_THREAD_IN_PROC(p);
 	sched_exit_thread(td, child);
 }
 
 /*
  * Penalize another thread for the time spent on this one.  This helps to
  * worsen the priority and interactivity of processes which schedule batch
  * jobs such as make.  This has little effect on the make process itself but
  * causes new processes spawned by it to receive worse scores immediately.
  */
 void
 sched_exit_thread(struct thread *td, struct thread *child)
 {
 
 	KTR_STATE1(KTR_SCHED, "thread", sched_tdname(child), "thread exit",
 	    "prio:td", child->td_priority);
 	/*
 	 * Give the child's runtime to the parent without returning the
 	 * sleep time as a penalty to the parent.  This causes shells that
 	 * launch expensive things to mark their children as expensive.
 	 */
 	thread_lock(td);
 	td->td_sched->ts_runtime += child->td_sched->ts_runtime;
 	sched_interact_update(td);
 	sched_priority(td);
 	thread_unlock(td);
 }
 
 void
 sched_preempt(struct thread *td)
 {
 	struct tdq *tdq;
 
 	thread_lock(td);
 	tdq = TDQ_SELF();
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	tdq->tdq_ipipending = 0;
 	if (td->td_priority > tdq->tdq_lowpri) {
 		int flags;
 
 		flags = SW_INVOL | SW_PREEMPT;
 		if (td->td_critnest > 1)
 			td->td_owepreempt = 1;
 		else if (TD_IS_IDLETHREAD(td))
 			mi_switch(flags | SWT_REMOTEWAKEIDLE, NULL);
 		else
 			mi_switch(flags | SWT_REMOTEPREEMPT, NULL);
 	}
 	thread_unlock(td);
 }
 
 /*
  * Fix priorities on return to user-space.  Priorities may be elevated due
  * to static priorities in msleep() or similar.
  */
 void
 sched_userret(struct thread *td)
 {
 	/*
 	 * XXX we cheat slightly on the locking here to avoid locking in  
 	 * the usual case.  Setting td_priority here is essentially an
 	 * incomplete workaround for not setting it properly elsewhere.
 	 * Now that some interrupt handlers are threads, not setting it
 	 * properly elsewhere can clobber it in the window between setting
 	 * it here and returning to user mode, so don't waste time setting
 	 * it perfectly here.
 	 */
 	KASSERT((td->td_flags & TDF_BORROWING) == 0,
 	    ("thread with borrowed priority returning to userland"));
 	if (td->td_priority != td->td_user_pri) {
 		thread_lock(td);
 		td->td_priority = td->td_user_pri;
 		td->td_base_pri = td->td_user_pri;
 		tdq_setlowpri(TDQ_SELF(), td);
 		thread_unlock(td);
         }
 }
 
 /*
  * Handle a stathz tick.  This is really only relevant for timeshare
  * threads.
  */
 void
 sched_clock(struct thread *td)
 {
 	struct tdq *tdq;
 	struct td_sched *ts;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	tdq = TDQ_SELF();
 #ifdef SMP
 	/*
 	 * We run the long term load balancer infrequently on the first cpu.
 	 */
 	if (balance_tdq == tdq) {
 		if (balance_ticks && --balance_ticks == 0)
 			sched_balance();
 	}
 #endif
 	/*
 	 * Save the old switch count so we have a record of the last ticks
 	 * activity.   Initialize the new switch count based on our load.
 	 * If there is some activity seed it to reflect that.
 	 */
 	tdq->tdq_oldswitchcnt = tdq->tdq_switchcnt;
 	tdq->tdq_switchcnt = tdq->tdq_load;
 	/*
 	 * Advance the insert index once for each tick to ensure that all
 	 * threads get a chance to run.
 	 */
 	if (tdq->tdq_idx == tdq->tdq_ridx) {
 		tdq->tdq_idx = (tdq->tdq_idx + 1) % RQ_NQS;
 		if (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
 			tdq->tdq_ridx = tdq->tdq_idx;
 	}
 	ts = td->td_sched;
 	if (td->td_pri_class & PRI_FIFO_BIT)
 		return;
 	if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) {
 		/*
 		 * We used a tick; charge it to the thread so
 		 * that we can compute our interactivity.
 		 */
 		td->td_sched->ts_runtime += tickincr;
 		sched_interact_update(td);
 		sched_priority(td);
 	}
 	/*
 	 * We used up one time slice.
 	 */
 	if (--ts->ts_slice > 0)
 		return;
 	/*
 	 * We're out of time, force a requeue at userret().
 	 */
 	ts->ts_slice = sched_slice;
 	td->td_flags |= TDF_NEEDRESCHED;
 }
 
 /*
  * Called once per hz tick.  Used for cpu utilization information.  This
  * is easier than trying to scale based on stathz.
  */
 void
 sched_tick(int cnt)
 {
 	struct td_sched *ts;
 
 	ts = curthread->td_sched;
 	/*
 	 * Ticks is updated asynchronously on a single cpu.  Check here to
 	 * avoid incrementing ts_ticks multiple times in a single tick.
 	 */
 	if (ts->ts_incrtick == ticks)
 		return;
 	/* Adjust ticks for pctcpu */
 	ts->ts_ticks += cnt << SCHED_TICK_SHIFT;
 	ts->ts_ltick = ticks;
 	ts->ts_incrtick = ticks;
 	/*
 	 * Update if we've exceeded our desired tick threshold by over one
 	 * second.
 	 */
 	if (ts->ts_ftick + SCHED_TICK_MAX < ts->ts_ltick)
 		sched_pctcpu_update(ts);
 }
 
 /*
  * Return whether the current CPU has runnable tasks.  Used for in-kernel
  * cooperative idle threads.
  */
 int
 sched_runnable(void)
 {
 	struct tdq *tdq;
 	int load;
 
 	load = 1;
 
 	tdq = TDQ_SELF();
 	if ((curthread->td_flags & TDF_IDLETD) != 0) {
 		if (tdq->tdq_load > 0)
 			goto out;
 	} else
 		if (tdq->tdq_load - 1 > 0)
 			goto out;
 	load = 0;
 out:
 	return (load);
 }
 
 /*
  * Choose the highest priority thread to run.  The thread is removed from
  * the run-queue while running however the load remains.  For SMP we set
  * the tdq in the global idle bitmask if it idles here.
  */
 struct thread *
 sched_choose(void)
 {
 	struct thread *td;
 	struct tdq *tdq;
 
 	tdq = TDQ_SELF();
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	td = tdq_choose(tdq);
 	if (td) {
 		td->td_sched->ts_ltick = ticks;
 		tdq_runq_rem(tdq, td);
 		tdq->tdq_lowpri = td->td_priority;
 		return (td);
 	}
 	tdq->tdq_lowpri = PRI_MAX_IDLE;
 	return (PCPU_GET(idlethread));
 }
 
 /*
  * Set owepreempt if necessary.  Preemption never happens directly in ULE,
  * we always request it once we exit a critical section.
  */
 static inline void
 sched_setpreempt(struct thread *td)
 {
 	struct thread *ctd;
 	int cpri;
 	int pri;
 
 	THREAD_LOCK_ASSERT(curthread, MA_OWNED);
 
 	ctd = curthread;
 	pri = td->td_priority;
 	cpri = ctd->td_priority;
 	if (pri < cpri)
 		ctd->td_flags |= TDF_NEEDRESCHED;
 	if (panicstr != NULL || pri >= cpri || cold || TD_IS_INHIBITED(ctd))
 		return;
 	if (!sched_shouldpreempt(pri, cpri, 0))
 		return;
 	ctd->td_owepreempt = 1;
 }
 
 /*
  * Add a thread to a thread queue.  Select the appropriate runq and add the
  * thread to it.  This is the internal function called when the tdq is
  * predetermined.
  */
 void
 tdq_add(struct tdq *tdq, struct thread *td, int flags)
 {
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	KASSERT((td->td_inhibitors == 0),
 	    ("sched_add: trying to run inhibited thread"));
 	KASSERT((TD_CAN_RUN(td) || TD_IS_RUNNING(td)),
 	    ("sched_add: bad thread state"));
 	KASSERT(td->td_flags & TDF_INMEM,
 	    ("sched_add: thread swapped out"));
 
 	if (td->td_priority < tdq->tdq_lowpri)
 		tdq->tdq_lowpri = td->td_priority;
 	tdq_runq_add(tdq, td, flags);
 	tdq_load_add(tdq, td);
 }
 
 /*
  * Select the target thread queue and add a thread to it.  Request
  * preemption or IPI a remote processor if required.
  */
 void
 sched_add(struct thread *td, int flags)
 {
 	struct tdq *tdq;
 #ifdef SMP
 	int cpu;
 #endif
 
 	KTR_STATE2(KTR_SCHED, "thread", sched_tdname(td), "runq add",
 	    "prio:%d", td->td_priority, KTR_ATTR_LINKED,
 	    sched_tdname(curthread));
 	KTR_POINT1(KTR_SCHED, "thread", sched_tdname(curthread), "wokeup",
 	    KTR_ATTR_LINKED, sched_tdname(td));
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	/*
 	 * Recalculate the priority before we select the target cpu or
 	 * run-queue.
 	 */
 	if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE)
 		sched_priority(td);
 #ifdef SMP
 	/*
 	 * Pick the destination cpu and if it isn't ours transfer to the
 	 * target cpu.
 	 */
 	cpu = sched_pickcpu(td, flags);
 	tdq = sched_setcpu(td, cpu, flags);
 	tdq_add(tdq, td, flags);
 	if (cpu != PCPU_GET(cpuid)) {
 		tdq_notify(tdq, td);
 		return;
 	}
 #else
 	tdq = TDQ_SELF();
 	TDQ_LOCK(tdq);
 	/*
 	 * Now that the thread is moving to the run-queue, set the lock
 	 * to the scheduler's lock.
 	 */
 	thread_lock_set(td, TDQ_LOCKPTR(tdq));
 	tdq_add(tdq, td, flags);
 #endif
 	if (!(flags & SRQ_YIELDING))
 		sched_setpreempt(td);
 }
 
 /*
  * Remove a thread from a run-queue without running it.  This is used
  * when we're stealing a thread from a remote queue.  Otherwise all threads
  * exit by calling sched_exit_thread() and sched_throw() themselves.
  */
 void
 sched_rem(struct thread *td)
 {
 	struct tdq *tdq;
 
 	KTR_STATE1(KTR_SCHED, "thread", sched_tdname(td), "runq rem",
 	    "prio:%d", td->td_priority);
 	tdq = TDQ_CPU(td->td_sched->ts_cpu);
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
 	KASSERT(TD_ON_RUNQ(td),
 	    ("sched_rem: thread not on run queue"));
 	tdq_runq_rem(tdq, td);
 	tdq_load_rem(tdq, td);
 	TD_SET_CAN_RUN(td);
 	if (td->td_priority == tdq->tdq_lowpri)
 		tdq_setlowpri(tdq, NULL);
 }
 
 /*
  * Fetch cpu utilization information.  Updates on demand.
  */
 fixpt_t
 sched_pctcpu(struct thread *td)
 {
 	fixpt_t pctcpu;
 	struct td_sched *ts;
 
 	pctcpu = 0;
 	ts = td->td_sched;
 	if (ts == NULL)
 		return (0);
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	if (ts->ts_ticks) {
 		int rtick;
 
 		sched_pctcpu_update(ts);
 		/* How many rtick per second ? */
 		rtick = min(SCHED_TICK_HZ(ts) / SCHED_TICK_SECS, hz);
 		pctcpu = (FSCALE * ((FSCALE * rtick)/hz)) >> FSHIFT;
 	}
 
 	return (pctcpu);
 }
 
 /*
  * Enforce affinity settings for a thread.  Called after adjustments to
  * cpumask.
  */
 void
 sched_affinity(struct thread *td)
 {
 #ifdef SMP
 	struct td_sched *ts;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	ts = td->td_sched;
 	if (THREAD_CAN_SCHED(td, ts->ts_cpu))
 		return;
 	if (TD_ON_RUNQ(td)) {
 		sched_rem(td);
 		sched_add(td, SRQ_BORING);
 		return;
 	}
 	if (!TD_IS_RUNNING(td))
 		return;
 	/*
 	 * Force a switch before returning to userspace.  If the
 	 * target thread is not running locally send an ipi to force
 	 * the issue.
 	 */
 	td->td_flags |= TDF_NEEDRESCHED;
 	if (td != curthread)
 		ipi_cpu(ts->ts_cpu, IPI_PREEMPT);
 #endif
 }
 
 /*
  * Bind a thread to a target cpu.
  */
 void
 sched_bind(struct thread *td, int cpu)
 {
 	struct td_sched *ts;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED|MA_NOTRECURSED);
 	KASSERT(td == curthread, ("sched_bind: can only bind curthread"));
 	ts = td->td_sched;
 	if (ts->ts_flags & TSF_BOUND)
 		sched_unbind(td);
 	KASSERT(THREAD_CAN_MIGRATE(td), ("%p must be migratable", td));
 	ts->ts_flags |= TSF_BOUND;
 	sched_pin();
 	if (PCPU_GET(cpuid) == cpu)
 		return;
 	ts->ts_cpu = cpu;
 	/* When we return from mi_switch we'll be on the correct cpu. */
 	mi_switch(SW_VOL, NULL);
 }
 
 /*
  * Release a bound thread.
  */
 void
 sched_unbind(struct thread *td)
 {
 	struct td_sched *ts;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT(td == curthread, ("sched_unbind: can only bind curthread"));
 	ts = td->td_sched;
 	if ((ts->ts_flags & TSF_BOUND) == 0)
 		return;
 	ts->ts_flags &= ~TSF_BOUND;
 	sched_unpin();
 }
 
 int
 sched_is_bound(struct thread *td)
 {
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	return (td->td_sched->ts_flags & TSF_BOUND);
 }
 
 /*
  * Basic yield call.
  */
 void
 sched_relinquish(struct thread *td)
 {
 	thread_lock(td);
 	mi_switch(SW_VOL | SWT_RELINQUISH, NULL);
 	thread_unlock(td);
 }
 
 /*
  * Return the total system load.
  */
 int
 sched_load(void)
 {
 #ifdef SMP
 	int total;
 	int i;
 
 	total = 0;
 	CPU_FOREACH(i)
 		total += TDQ_CPU(i)->tdq_sysload;
 	return (total);
 #else
 	return (TDQ_SELF()->tdq_sysload);
 #endif
 }
 
 int
 sched_sizeof_proc(void)
 {
 	return (sizeof(struct proc));
 }
 
 int
 sched_sizeof_thread(void)
 {
 	return (sizeof(struct thread) + sizeof(struct td_sched));
 }
 
 #ifdef SMP
 #define	TDQ_IDLESPIN(tdq)						\
     ((tdq)->tdq_cg != NULL && ((tdq)->tdq_cg->cg_flags & CG_FLAG_THREAD) == 0)
 #else
 #define	TDQ_IDLESPIN(tdq)	1
 #endif
 
 /*
  * The actual idle process.
  */
 void
 sched_idletd(void *dummy)
 {
 	struct thread *td;
 	struct tdq *tdq;
 	int switchcnt;
 	int i;
 
 	mtx_assert(&Giant, MA_NOTOWNED);
 	td = curthread;
 	tdq = TDQ_SELF();
 	for (;;) {
 #ifdef SMP
 		if (tdq_idled(tdq) == 0)
 			continue;
 #endif
 		switchcnt = tdq->tdq_switchcnt + tdq->tdq_oldswitchcnt;
 		/*
 		 * If we're switching very frequently, spin while checking
 		 * for load rather than entering a low power state that 
 		 * may require an IPI.  However, don't do any busy
 		 * loops while on SMT machines as this simply steals
 		 * cycles from cores doing useful work.
 		 */
 		if (TDQ_IDLESPIN(tdq) && switchcnt > sched_idlespinthresh) {
 			for (i = 0; i < sched_idlespins; i++) {
 				if (tdq->tdq_load)
 					break;
 				cpu_spinwait();
 			}
 		}
 		switchcnt = tdq->tdq_switchcnt + tdq->tdq_oldswitchcnt;
 		if (tdq->tdq_load == 0) {
 			tdq->tdq_cpu_idle = 1;
 			if (tdq->tdq_load == 0) {
 				cpu_idle(switchcnt > sched_idlespinthresh * 4);
 				tdq->tdq_switchcnt++;
 			}
 			tdq->tdq_cpu_idle = 0;
 		}
 		if (tdq->tdq_load) {
 			thread_lock(td);
 			mi_switch(SW_VOL | SWT_IDLE, NULL);
 			thread_unlock(td);
 		}
 	}
 }
 
 /*
  * A CPU is entering for the first time or a thread is exiting.
  */
 void
 sched_throw(struct thread *td)
 {
 	struct thread *newtd;
 	struct tdq *tdq;
 
 	tdq = TDQ_SELF();
 	if (td == NULL) {
 		/* Correct spinlock nesting and acquire the correct lock. */
 		TDQ_LOCK(tdq);
 		spinlock_exit();
 	} else {
 		MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
 		tdq_load_rem(tdq, td);
 		lock_profile_release_lock(&TDQ_LOCKPTR(tdq)->lock_object);
 	}
 	KASSERT(curthread->td_md.md_spinlock_count == 1, ("invalid count"));
 	newtd = choosethread();
 	TDQ_LOCKPTR(tdq)->mtx_lock = (uintptr_t)newtd;
 	PCPU_SET(switchtime, cpu_ticks());
 	PCPU_SET(switchticks, ticks);
 	cpu_throw(td, newtd);		/* doesn't return */
 }
 
 /*
  * This is called from fork_exit().  Just acquire the correct locks and
  * let fork do the rest of the work.
  */
 void
 sched_fork_exit(struct thread *td)
 {
 	struct td_sched *ts;
 	struct tdq *tdq;
 	int cpuid;
 
 	/*
 	 * Finish setting up thread glue so that it begins execution in a
 	 * non-nested critical section with the scheduler lock held.
 	 */
 	cpuid = PCPU_GET(cpuid);
 	tdq = TDQ_CPU(cpuid);
 	ts = td->td_sched;
 	if (TD_IS_IDLETHREAD(td))
 		td->td_lock = TDQ_LOCKPTR(tdq);
 	MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
 	td->td_oncpu = cpuid;
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED | MA_NOTRECURSED);
 	lock_profile_obtain_lock_success(
 	    &TDQ_LOCKPTR(tdq)->lock_object, 0, 0, __FILE__, __LINE__);
 }
 
 /*
  * Create on first use to catch odd startup conditons.
  */
 char *
 sched_tdname(struct thread *td)
 {
 #ifdef KTR
 	struct td_sched *ts;
 
 	ts = td->td_sched;
 	if (ts->ts_name[0] == '\0')
 		snprintf(ts->ts_name, sizeof(ts->ts_name),
 		    "%s tid %d", td->td_name, td->td_tid);
 	return (ts->ts_name);
 #else
 	return (td->td_name);
 #endif
 }
 
 #ifdef SMP
 
 /*
  * Build the CPU topology dump string. Is recursively called to collect
  * the topology tree.
  */
 static int
 sysctl_kern_sched_topology_spec_internal(struct sbuf *sb, struct cpu_group *cg,
     int indent)
 {
+	char cpusetbuf[CPUSETBUFSIZ];
 	int i, first;
 
 	sbuf_printf(sb, "%*s<group level=\"%d\" cache-level=\"%d\">\n", indent,
 	    "", 1 + indent / 2, cg->cg_level);
-	sbuf_printf(sb, "%*s <cpu count=\"%d\" mask=\"0x%x\">", indent, "",
-	    cg->cg_count, cg->cg_mask);
+	sbuf_printf(sb, "%*s <cpu count=\"%d\" mask=\"%s\">", indent, "",
+	    cg->cg_count, cpusetobj_strprint(cpusetbuf, &cg->cg_mask));
 	first = TRUE;
 	for (i = 0; i < MAXCPU; i++) {
-		if ((cg->cg_mask & (1 << i)) != 0) {
+		if (CPU_ISSET(i, &cg->cg_mask)) {
 			if (!first)
 				sbuf_printf(sb, ", ");
 			else
 				first = FALSE;
 			sbuf_printf(sb, "%d", i);
 		}
 	}
 	sbuf_printf(sb, "</cpu>\n");
 
 	if (cg->cg_flags != 0) {
 		sbuf_printf(sb, "%*s <flags>", indent, "");
 		if ((cg->cg_flags & CG_FLAG_HTT) != 0)
 			sbuf_printf(sb, "<flag name=\"HTT\">HTT group</flag>");
 		if ((cg->cg_flags & CG_FLAG_THREAD) != 0)
 			sbuf_printf(sb, "<flag name=\"THREAD\">THREAD group</flag>");
 		if ((cg->cg_flags & CG_FLAG_SMT) != 0)
 			sbuf_printf(sb, "<flag name=\"SMT\">SMT group</flag>");
 		sbuf_printf(sb, "</flags>\n");
 	}
 
 	if (cg->cg_children > 0) {
 		sbuf_printf(sb, "%*s <children>\n", indent, "");
 		for (i = 0; i < cg->cg_children; i++)
 			sysctl_kern_sched_topology_spec_internal(sb, 
 			    &cg->cg_child[i], indent+2);
 		sbuf_printf(sb, "%*s </children>\n", indent, "");
 	}
 	sbuf_printf(sb, "%*s</group>\n", indent, "");
 	return (0);
 }
 
 /*
  * Sysctl handler for retrieving topology dump. It's a wrapper for
  * the recursive sysctl_kern_smp_topology_spec_internal().
  */
 static int
 sysctl_kern_sched_topology_spec(SYSCTL_HANDLER_ARGS)
 {
 	struct sbuf *topo;
 	int err;
 
 	KASSERT(cpu_top != NULL, ("cpu_top isn't initialized"));
 
 	topo = sbuf_new(NULL, NULL, 500, SBUF_AUTOEXTEND);
 	if (topo == NULL)
 		return (ENOMEM);
 
 	sbuf_printf(topo, "<groups>\n");
 	err = sysctl_kern_sched_topology_spec_internal(topo, cpu_top, 1);
 	sbuf_printf(topo, "</groups>\n");
 
 	if (err == 0) {
 		sbuf_finish(topo);
 		err = SYSCTL_OUT(req, sbuf_data(topo), sbuf_len(topo));
 	}
 	sbuf_delete(topo);
 	return (err);
 }
 
 #endif
 
 SYSCTL_NODE(_kern, OID_AUTO, sched, CTLFLAG_RW, 0, "Scheduler");
 SYSCTL_STRING(_kern_sched, OID_AUTO, name, CTLFLAG_RD, "ULE", 0,
     "Scheduler name");
 SYSCTL_INT(_kern_sched, OID_AUTO, slice, CTLFLAG_RW, &sched_slice, 0,
     "Slice size for timeshare threads");
 SYSCTL_INT(_kern_sched, OID_AUTO, interact, CTLFLAG_RW, &sched_interact, 0,
      "Interactivity score threshold");
 SYSCTL_INT(_kern_sched, OID_AUTO, preempt_thresh, CTLFLAG_RW, &preempt_thresh,
      0,"Min priority for preemption, lower priorities have greater precedence");
 SYSCTL_INT(_kern_sched, OID_AUTO, static_boost, CTLFLAG_RW, &static_boost,
      0,"Controls whether static kernel priorities are assigned to sleeping threads.");
 SYSCTL_INT(_kern_sched, OID_AUTO, idlespins, CTLFLAG_RW, &sched_idlespins,
      0,"Number of times idle will spin waiting for new work.");
 SYSCTL_INT(_kern_sched, OID_AUTO, idlespinthresh, CTLFLAG_RW, &sched_idlespinthresh,
      0,"Threshold before we will permit idle spinning.");
 #ifdef SMP
 SYSCTL_INT(_kern_sched, OID_AUTO, affinity, CTLFLAG_RW, &affinity, 0,
     "Number of hz ticks to keep thread affinity for");
 SYSCTL_INT(_kern_sched, OID_AUTO, balance, CTLFLAG_RW, &rebalance, 0,
     "Enables the long-term load balancer");
 SYSCTL_INT(_kern_sched, OID_AUTO, balance_interval, CTLFLAG_RW,
     &balance_interval, 0,
     "Average frequency in stathz ticks to run the long-term balancer");
 SYSCTL_INT(_kern_sched, OID_AUTO, steal_htt, CTLFLAG_RW, &steal_htt, 0,
     "Steals work from another hyper-threaded core on idle");
 SYSCTL_INT(_kern_sched, OID_AUTO, steal_idle, CTLFLAG_RW, &steal_idle, 0,
     "Attempts to steal work from other cores before idling");
 SYSCTL_INT(_kern_sched, OID_AUTO, steal_thresh, CTLFLAG_RW, &steal_thresh, 0,
     "Minimum load on remote cpu before we'll steal");
 
 /* Retrieve SMP topology */
 SYSCTL_PROC(_kern_sched, OID_AUTO, topology_spec, CTLTYPE_STRING |
     CTLFLAG_RD, NULL, 0, sysctl_kern_sched_topology_spec, "A", 
     "XML dump of detected CPU topology");
 
 #endif
 
 /* ps compat.  All cpu percentages from ULE are weighted. */
 static int ccpu = 0;
 SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");
Index: head/sys/kern/subr_kdb.c
===================================================================
--- head/sys/kern/subr_kdb.c	(revision 222812)
+++ head/sys/kern/subr_kdb.c	(revision 222813)
@@ -1,552 +1,553 @@
 /*-
  * Copyright (c) 2004 The FreeBSD Project
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_kdb.h"
 #include "opt_stack.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/sbuf.h>
 #include <sys/smp.h>
 #include <sys/stack.h>
 #include <sys/sysctl.h>
 
 #include <machine/kdb.h>
 #include <machine/pcb.h>
 
 #ifdef SMP
 #include <machine/smp.h>
 #endif
 
 int kdb_active = 0;
 static void *kdb_jmpbufp = NULL;
 struct kdb_dbbe *kdb_dbbe = NULL;
 static struct pcb kdb_pcb;
 struct pcb *kdb_thrctx = NULL;
 struct thread *kdb_thread = NULL;
 struct trapframe *kdb_frame = NULL;
 
 KDB_BACKEND(null, NULL, NULL, NULL);
 SET_DECLARE(kdb_dbbe_set, struct kdb_dbbe);
 
 static int kdb_sysctl_available(SYSCTL_HANDLER_ARGS);
 static int kdb_sysctl_current(SYSCTL_HANDLER_ARGS);
 static int kdb_sysctl_enter(SYSCTL_HANDLER_ARGS);
 static int kdb_sysctl_panic(SYSCTL_HANDLER_ARGS);
 static int kdb_sysctl_trap(SYSCTL_HANDLER_ARGS);
 static int kdb_sysctl_trap_code(SYSCTL_HANDLER_ARGS);
 
 SYSCTL_NODE(_debug, OID_AUTO, kdb, CTLFLAG_RW, NULL, "KDB nodes");
 
 SYSCTL_PROC(_debug_kdb, OID_AUTO, available, CTLTYPE_STRING | CTLFLAG_RD, NULL,
     0, kdb_sysctl_available, "A", "list of available KDB backends");
 
 SYSCTL_PROC(_debug_kdb, OID_AUTO, current, CTLTYPE_STRING | CTLFLAG_RW, NULL,
     0, kdb_sysctl_current, "A", "currently selected KDB backend");
 
 SYSCTL_PROC(_debug_kdb, OID_AUTO, enter, CTLTYPE_INT | CTLFLAG_RW, NULL, 0,
     kdb_sysctl_enter, "I", "set to enter the debugger");
 
 SYSCTL_PROC(_debug_kdb, OID_AUTO, panic, CTLTYPE_INT | CTLFLAG_RW, NULL, 0,
     kdb_sysctl_panic, "I", "set to panic the kernel");
 
 SYSCTL_PROC(_debug_kdb, OID_AUTO, trap, CTLTYPE_INT | CTLFLAG_RW, NULL, 0,
     kdb_sysctl_trap, "I", "set to cause a page fault via data access");
 
 SYSCTL_PROC(_debug_kdb, OID_AUTO, trap_code, CTLTYPE_INT | CTLFLAG_RW, NULL, 0,
     kdb_sysctl_trap_code, "I", "set to cause a page fault via code access");
 
 /*
  * Flag indicating whether or not to IPI the other CPUs to stop them on
  * entering the debugger.  Sometimes, this will result in a deadlock as
  * stop_cpus() waits for the other cpus to stop, so we allow it to be
  * disabled.  In order to maximize the chances of success, use a hard
  * stop for that.
  */
 #ifdef SMP
 static int kdb_stop_cpus = 1;
 SYSCTL_INT(_debug_kdb, OID_AUTO, stop_cpus, CTLFLAG_RW | CTLFLAG_TUN,
     &kdb_stop_cpus, 0, "stop other CPUs when entering the debugger");
 TUNABLE_INT("debug.kdb.stop_cpus", &kdb_stop_cpus);
 #endif
 
 /*
  * Flag to indicate to debuggers why the debugger was entered.
  */
 const char * volatile kdb_why = KDB_WHY_UNSET;
 
 static int
 kdb_sysctl_available(SYSCTL_HANDLER_ARGS)
 {
 	struct kdb_dbbe **iter;
 	struct sbuf sbuf;
 	int error;
 
 	sbuf_new_for_sysctl(&sbuf, NULL, 64, req);
 	SET_FOREACH(iter, kdb_dbbe_set) {
 		if ((*iter)->dbbe_active == 0)
 			sbuf_printf(&sbuf, "%s ", (*iter)->dbbe_name);
 	}
 	error = sbuf_finish(&sbuf);
 	sbuf_delete(&sbuf);
 	return (error);
 }
 
 static int
 kdb_sysctl_current(SYSCTL_HANDLER_ARGS)
 {
 	char buf[16];
 	int error;
 
 	if (kdb_dbbe != NULL)
 		strlcpy(buf, kdb_dbbe->dbbe_name, sizeof(buf));
 	else
 		*buf = '\0';
 	error = sysctl_handle_string(oidp, buf, sizeof(buf), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	if (kdb_active)
 		return (EBUSY);
 	return (kdb_dbbe_select(buf));
 }
 
 static int
 kdb_sysctl_enter(SYSCTL_HANDLER_ARGS)
 {
 	int error, i;
 
 	error = sysctl_wire_old_buffer(req, sizeof(int));
 	if (error == 0) {
 		i = 0;
 		error = sysctl_handle_int(oidp, &i, 0, req);
 	}
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	if (kdb_active)
 		return (EBUSY);
 	kdb_enter(KDB_WHY_SYSCTL, "sysctl debug.kdb.enter");
 	return (0);
 }
 
 static int
 kdb_sysctl_panic(SYSCTL_HANDLER_ARGS)
 {
 	int error, i;
 
 	error = sysctl_wire_old_buffer(req, sizeof(int));
 	if (error == 0) {
 		i = 0;
 		error = sysctl_handle_int(oidp, &i, 0, req);
 	}
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	panic("kdb_sysctl_panic");
 	return (0);
 }
 
 static int
 kdb_sysctl_trap(SYSCTL_HANDLER_ARGS)
 {
 	int error, i;
 	int *addr = (int *)0x10;
 
 	error = sysctl_wire_old_buffer(req, sizeof(int));
 	if (error == 0) {
 		i = 0;
 		error = sysctl_handle_int(oidp, &i, 0, req);
 	}
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	return (*addr);
 }
 
 static int
 kdb_sysctl_trap_code(SYSCTL_HANDLER_ARGS)
 {
 	int error, i;
 	void (*fp)(u_int, u_int, u_int) = (void *)0xdeadc0de;
 
 	error = sysctl_wire_old_buffer(req, sizeof(int));
 	if (error == 0) {
 		i = 0;
 		error = sysctl_handle_int(oidp, &i, 0, req);
 	}
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	(*fp)(0x11111111, 0x22222222, 0x33333333);
 	return (0);
 }
 
 void
 kdb_panic(const char *msg)
 {
 	
 #ifdef SMP
 	stop_cpus_hard(PCPU_GET(other_cpus));
 #endif
 	printf("KDB: panic\n");
 	panic("%s", msg);
 }
 
 void
 kdb_reboot(void)
 {
 
 	printf("KDB: reboot requested\n");
 	shutdown_nice(0);
 }
 
 /*
  * Solaris implements a new BREAK which is initiated by a character sequence
  * CR ~ ^b which is similar to a familiar pattern used on Sun servers by the
  * Remote Console.
  *
  * Note that this function may be called from almost anywhere, with interrupts
  * disabled and with unknown locks held, so it must not access data other than
  * its arguments.  Its up to the caller to ensure that the state variable is
  * consistent.
  */
 
 #define	KEY_CR		13	/* CR '\r' */
 #define	KEY_TILDE	126	/* ~ */
 #define	KEY_CRTLB	2	/* ^B */
 #define	KEY_CRTLP	16	/* ^P */
 #define	KEY_CRTLR	18	/* ^R */
 
 int
 kdb_alt_break(int key, int *state)
 {
 	int brk;
 
 	brk = 0;
 	switch (*state) {
 	case 0:
 		if (key == KEY_CR)
 			*state = 1;
 		break;
 	case 1:
 		if (key == KEY_TILDE)
 			*state = 2;
 		break;
 	case 2:
 		if (key == KEY_CRTLB)
 			brk = KDB_REQ_DEBUGGER;
 		else if (key == KEY_CRTLP)
 			brk = KDB_REQ_PANIC;
 		else if (key == KEY_CRTLR)
 			brk = KDB_REQ_REBOOT;
 		*state = 0;
 	}
 	return (brk);
 }
 
 /*
  * Print a backtrace of the calling thread. The backtrace is generated by
  * the selected debugger, provided it supports backtraces. If no debugger
  * is selected or the current debugger does not support backtraces, this
  * function silently returns.
  */
 
 void
 kdb_backtrace(void)
 {
 
 	if (kdb_dbbe != NULL && kdb_dbbe->dbbe_trace != NULL) {
 		printf("KDB: stack backtrace:\n");
 		kdb_dbbe->dbbe_trace();
 	}
 #ifdef STACK
 	else {
 		struct stack st;
 
 		printf("KDB: stack backtrace:\n");
 		stack_save(&st);
 		stack_print_ddb(&st);
 	}
 #endif
 }
 
 /*
  * Set/change the current backend.
  */
 
 int
 kdb_dbbe_select(const char *name)
 {
 	struct kdb_dbbe *be, **iter;
 
 	SET_FOREACH(iter, kdb_dbbe_set) {
 		be = *iter;
 		if (be->dbbe_active == 0 && strcmp(be->dbbe_name, name) == 0) {
 			kdb_dbbe = be;
 			return (0);
 		}
 	}
 	return (EINVAL);
 }
 
 /*
  * Enter the currently selected debugger. If a message has been provided,
  * it is printed first. If the debugger does not support the enter method,
  * it is entered by using breakpoint(), which enters the debugger through
  * kdb_trap().  The 'why' argument will contain a more mechanically usable
  * string than 'msg', and is relied upon by DDB scripting to identify the
  * reason for entering the debugger so that the right script can be run.
  */
 void
 kdb_enter(const char *why, const char *msg)
 {
 
 	if (kdb_dbbe != NULL && kdb_active == 0) {
 		if (msg != NULL)
 			printf("KDB: enter: %s\n", msg);
 		kdb_why = why;
 		breakpoint();
 		kdb_why = KDB_WHY_UNSET;
 	}
 }
 
 /*
  * Initialize the kernel debugger interface.
  */
 
 void
 kdb_init(void)
 {
 	struct kdb_dbbe *be, **iter;
 	int cur_pri, pri;
 
 	kdb_active = 0;
 	kdb_dbbe = NULL;
 	cur_pri = -1;
 	SET_FOREACH(iter, kdb_dbbe_set) {
 		be = *iter;
 		pri = (be->dbbe_init != NULL) ? be->dbbe_init() : -1;
 		be->dbbe_active = (pri >= 0) ? 0 : -1;
 		if (pri > cur_pri) {
 			cur_pri = pri;
 			kdb_dbbe = be;
 		}
 	}
 	if (kdb_dbbe != NULL) {
 		printf("KDB: debugger backends:");
 		SET_FOREACH(iter, kdb_dbbe_set) {
 			be = *iter;
 			if (be->dbbe_active == 0)
 				printf(" %s", be->dbbe_name);
 		}
 		printf("\n");
 		printf("KDB: current backend: %s\n",
 		    kdb_dbbe->dbbe_name);
 	}
 }
 
 /*
  * Handle contexts.
  */
 
 void *
 kdb_jmpbuf(jmp_buf new)
 {
 	void *old;
 
 	old = kdb_jmpbufp;
 	kdb_jmpbufp = new;
 	return (old);
 }
 
 void
 kdb_reenter(void)
 {
 
 	if (!kdb_active || kdb_jmpbufp == NULL)
 		return;
 
 	longjmp(kdb_jmpbufp, 1);
 	/* NOTREACHED */
 }
 
 /*
  * Thread related support functions.
  */
 
 struct pcb *
 kdb_thr_ctx(struct thread *thr)
 {  
 #if defined(SMP) && defined(KDB_STOPPEDPCB)
 	struct pcpu *pc;
 #endif
  
 	if (thr == curthread) 
 		return (&kdb_pcb);
 
 #if defined(SMP) && defined(KDB_STOPPEDPCB)
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu)  {
-		if (pc->pc_curthread == thr && (stopped_cpus & pc->pc_cpumask))
+		if (pc->pc_curthread == thr &&
+		    CPU_OVERLAP(&stopped_cpus, &pc->pc_cpumask))
 			return (KDB_STOPPEDPCB(pc));
 	}
 #endif
 	return (thr->td_pcb);
 }
 
 struct thread *
 kdb_thr_first(void)
 {
 	struct proc *p;
 	struct thread *thr;
 
 	p = LIST_FIRST(&allproc);
 	while (p != NULL) {
 		if (p->p_flag & P_INMEM) {
 			thr = FIRST_THREAD_IN_PROC(p);
 			if (thr != NULL)
 				return (thr);
 		}
 		p = LIST_NEXT(p, p_list);
 	}
 	return (NULL);
 }
 
 struct thread *
 kdb_thr_from_pid(pid_t pid)
 {
 	struct proc *p;
 
 	p = LIST_FIRST(&allproc);
 	while (p != NULL) {
 		if (p->p_flag & P_INMEM && p->p_pid == pid)
 			return (FIRST_THREAD_IN_PROC(p));
 		p = LIST_NEXT(p, p_list);
 	}
 	return (NULL);
 }
 
 struct thread *
 kdb_thr_lookup(lwpid_t tid)
 {
 	struct thread *thr;
 
 	thr = kdb_thr_first();
 	while (thr != NULL && thr->td_tid != tid)
 		thr = kdb_thr_next(thr);
 	return (thr);
 }
 
 struct thread *
 kdb_thr_next(struct thread *thr)
 {
 	struct proc *p;
 
 	p = thr->td_proc;
 	thr = TAILQ_NEXT(thr, td_plist);
 	do {
 		if (thr != NULL)
 			return (thr);
 		p = LIST_NEXT(p, p_list);
 		if (p != NULL && (p->p_flag & P_INMEM))
 			thr = FIRST_THREAD_IN_PROC(p);
 	} while (p != NULL);
 	return (NULL);
 }
 
 int
 kdb_thr_select(struct thread *thr)
 {
 	if (thr == NULL)
 		return (EINVAL);
 	kdb_thread = thr;
 	kdb_thrctx = kdb_thr_ctx(thr);
 	return (0);
 }
 
 /*
  * Enter the debugger due to a trap.
  */
 
 int
 kdb_trap(int type, int code, struct trapframe *tf)
 {
 	struct kdb_dbbe *be;
 	register_t intr;
 #ifdef SMP
 	int did_stop_cpus;
 #endif
 	int handled;
 
 	be = kdb_dbbe;
 	if (be == NULL || be->dbbe_trap == NULL)
 		return (0);
 
 	/* We reenter the debugger through kdb_reenter(). */
 	if (kdb_active)
 		return (0);
 
 	intr = intr_disable();
 
 #ifdef SMP
 	if ((did_stop_cpus = kdb_stop_cpus) != 0)
 		stop_cpus_hard(PCPU_GET(other_cpus));
 #endif
 
 	kdb_active++;
 
 	kdb_frame = tf;
 
 	/* Let MD code do its thing first... */
 	kdb_cpu_trap(type, code);
 
 	makectx(tf, &kdb_pcb);
 	kdb_thr_select(curthread);
 
 	for (;;) {
 		handled = be->dbbe_trap(type, code);
 		if (be == kdb_dbbe)
 			break;
 		be = kdb_dbbe;
 		if (be == NULL || be->dbbe_trap == NULL)
 			break;
 		printf("Switching to %s back-end\n", be->dbbe_name);
 	}
 
 	kdb_active--;
 
 #ifdef SMP
 	if (did_stop_cpus)
 		restart_cpus(stopped_cpus);
 #endif
 
 	intr_restore(intr);
 
 	return (handled);
 }
Index: head/sys/kern/subr_pcpu.c
===================================================================
--- head/sys/kern/subr_pcpu.c	(revision 222812)
+++ head/sys/kern/subr_pcpu.c	(revision 222813)
@@ -1,398 +1,398 @@
 /*-
  * Copyright (c) 2001 Wind River Systems, Inc.
  * All rights reserved.
  * Written by: John Baldwin <jhb@FreeBSD.org>
  *
  * Copyright (c) 2009 Jeffrey Roberson <jeff@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * This module provides MI support for per-cpu data.
  *
  * Each architecture determines the mapping of logical CPU IDs to physical
  * CPUs.  The requirements of this mapping are as follows:
  *  - Logical CPU IDs must reside in the range 0 ... MAXCPU - 1.
  *  - The mapping is not required to be dense.  That is, there may be
  *    gaps in the mappings.
  *  - The platform sets the value of MAXCPU in <machine/param.h>.
  *  - It is suggested, but not required, that in the non-SMP case, the
  *    platform define MAXCPU to be 1 and define the logical ID of the
  *    sole CPU as 0.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 #include <sys/sx.h>
 #include <ddb/ddb.h>
 
 MALLOC_DEFINE(M_PCPU, "Per-cpu", "Per-cpu resource accouting.");
 
 struct dpcpu_free {
 	uintptr_t	df_start;
 	int		df_len;
 	TAILQ_ENTRY(dpcpu_free) df_link;
 };
 
 static DPCPU_DEFINE(char, modspace[DPCPU_MODMIN]);
 static TAILQ_HEAD(, dpcpu_free) dpcpu_head = TAILQ_HEAD_INITIALIZER(dpcpu_head);
 static struct sx dpcpu_lock;
 uintptr_t dpcpu_off[MAXCPU];
 struct pcpu *cpuid_to_pcpu[MAXCPU];
 struct cpuhead cpuhead = STAILQ_HEAD_INITIALIZER(cpuhead);
 
 /*
  * Initialize the MI portions of a struct pcpu.
  */
 void
 pcpu_init(struct pcpu *pcpu, int cpuid, size_t size)
 {
 
 	bzero(pcpu, size);
 	KASSERT(cpuid >= 0 && cpuid < MAXCPU,
 	    ("pcpu_init: invalid cpuid %d", cpuid));
 	pcpu->pc_cpuid = cpuid;
-	pcpu->pc_cpumask = 1 << cpuid;
+	CPU_SETOF(cpuid, &pcpu->pc_cpumask);
 	cpuid_to_pcpu[cpuid] = pcpu;
 	STAILQ_INSERT_TAIL(&cpuhead, pcpu, pc_allcpu);
 	cpu_pcpu_init(pcpu, cpuid, size);
 	pcpu->pc_rm_queue.rmq_next = &pcpu->pc_rm_queue;
 	pcpu->pc_rm_queue.rmq_prev = &pcpu->pc_rm_queue;
 #ifdef KTR
 	snprintf(pcpu->pc_name, sizeof(pcpu->pc_name), "CPU %d", cpuid);
 #endif
 }
 
 void
 dpcpu_init(void *dpcpu, int cpuid)
 {
 	struct pcpu *pcpu;
 
 	pcpu = pcpu_find(cpuid);
 	pcpu->pc_dynamic = (uintptr_t)dpcpu - DPCPU_START;
 
 	/*
 	 * Initialize defaults from our linker section.
 	 */
 	memcpy(dpcpu, (void *)DPCPU_START, DPCPU_BYTES);
 
 	/*
 	 * Place it in the global pcpu offset array.
 	 */
 	dpcpu_off[cpuid] = pcpu->pc_dynamic;
 }
 
 static void
 dpcpu_startup(void *dummy __unused)
 {
 	struct dpcpu_free *df;
 
 	df = malloc(sizeof(*df), M_PCPU, M_WAITOK | M_ZERO);
 	df->df_start = (uintptr_t)&DPCPU_NAME(modspace);
 	df->df_len = DPCPU_MODMIN;
 	TAILQ_INSERT_HEAD(&dpcpu_head, df, df_link);
 	sx_init(&dpcpu_lock, "dpcpu alloc lock");
 }
 SYSINIT(dpcpu, SI_SUB_KLD, SI_ORDER_FIRST, dpcpu_startup, 0);
 
 /*
  * First-fit extent based allocator for allocating space in the per-cpu
  * region reserved for modules.  This is only intended for use by the
  * kernel linkers to place module linker sets.
  */
 void *
 dpcpu_alloc(int size)
 {
 	struct dpcpu_free *df;
 	void *s;
 
 	s = NULL;
 	size = roundup2(size, sizeof(void *));
 	sx_xlock(&dpcpu_lock);
 	TAILQ_FOREACH(df, &dpcpu_head, df_link) {
 		if (df->df_len < size)
 			continue;
 		if (df->df_len == size) {
 			s = (void *)df->df_start;
 			TAILQ_REMOVE(&dpcpu_head, df, df_link);
 			free(df, M_PCPU);
 			break;
 		}
 		s = (void *)df->df_start;
 		df->df_len -= size;
 		df->df_start = df->df_start + size;
 		break;
 	}
 	sx_xunlock(&dpcpu_lock);
 
 	return (s);
 }
 
 /*
  * Free dynamic per-cpu space at module unload time. 
  */
 void
 dpcpu_free(void *s, int size)
 {
 	struct dpcpu_free *df;
 	struct dpcpu_free *dn;
 	uintptr_t start;
 	uintptr_t end;
 
 	size = roundup2(size, sizeof(void *));
 	start = (uintptr_t)s;
 	end = start + size;
 	/*
 	 * Free a region of space and merge it with as many neighbors as
 	 * possible.  Keeping the list sorted simplifies this operation.
 	 */
 	sx_xlock(&dpcpu_lock);
 	TAILQ_FOREACH(df, &dpcpu_head, df_link) {
 		if (df->df_start > end)
 			break;
 		/*
 		 * If we expand at the end of an entry we may have to
 		 * merge it with the one following it as well.
 		 */
 		if (df->df_start + df->df_len == start) {
 			df->df_len += size;
 			dn = TAILQ_NEXT(df, df_link);
 			if (df->df_start + df->df_len == dn->df_start) {
 				df->df_len += dn->df_len;
 				TAILQ_REMOVE(&dpcpu_head, dn, df_link);
 				free(dn, M_PCPU);
 			}
 			sx_xunlock(&dpcpu_lock);
 			return;
 		}
 		if (df->df_start == end) {
 			df->df_start = start;
 			df->df_len += size;
 			sx_xunlock(&dpcpu_lock);
 			return;
 		}
 	}
 	dn = malloc(sizeof(*df), M_PCPU, M_WAITOK | M_ZERO);
 	dn->df_start = start;
 	dn->df_len = size;
 	if (df)
 		TAILQ_INSERT_BEFORE(df, dn, df_link);
 	else
 		TAILQ_INSERT_TAIL(&dpcpu_head, dn, df_link);
 	sx_xunlock(&dpcpu_lock);
 }
 
 /*
  * Initialize the per-cpu storage from an updated linker-set region.
  */
 void
 dpcpu_copy(void *s, int size)
 {
 #ifdef SMP
 	uintptr_t dpcpu;
 	int i;
 
 	for (i = 0; i < mp_ncpus; ++i) {
 		dpcpu = dpcpu_off[i];
 		if (dpcpu == 0)
 			continue;
 		memcpy((void *)(dpcpu + (uintptr_t)s), s, size);
 	}
 #else
 	memcpy((void *)(dpcpu_off[0] + (uintptr_t)s), s, size);
 #endif
 }
 
 /*
  * Destroy a struct pcpu.
  */
 void
 pcpu_destroy(struct pcpu *pcpu)
 {
 
 	STAILQ_REMOVE(&cpuhead, pcpu, pcpu, pc_allcpu);
 	cpuid_to_pcpu[pcpu->pc_cpuid] = NULL;
 	dpcpu_off[pcpu->pc_cpuid] = 0;
 }
 
 /*
  * Locate a struct pcpu by cpu id.
  */
 struct pcpu *
 pcpu_find(u_int cpuid)
 {
 
 	return (cpuid_to_pcpu[cpuid]);
 }
 
 int
 sysctl_dpcpu_quad(SYSCTL_HANDLER_ARGS)
 {
 	uintptr_t dpcpu;
 	int64_t count;
 	int i;
 
 	count = 0;
 	for (i = 0; i < mp_ncpus; ++i) {
 		dpcpu = dpcpu_off[i];
 		if (dpcpu == 0)
 			continue;
 		count += *(int64_t *)(dpcpu + (uintptr_t)arg1);
 	}
 	return (SYSCTL_OUT(req, &count, sizeof(count)));
 }
 
 int
 sysctl_dpcpu_long(SYSCTL_HANDLER_ARGS)
 {
 	uintptr_t dpcpu;
 	long count;
 	int i;
 
 	count = 0;
 	for (i = 0; i < mp_ncpus; ++i) {
 		dpcpu = dpcpu_off[i];
 		if (dpcpu == 0)
 			continue;
 		count += *(long *)(dpcpu + (uintptr_t)arg1);
 	}
 	return (SYSCTL_OUT(req, &count, sizeof(count)));
 }
 
 int
 sysctl_dpcpu_int(SYSCTL_HANDLER_ARGS)
 {
 	uintptr_t dpcpu;
 	int count;
 	int i;
 
 	count = 0;
 	for (i = 0; i < mp_ncpus; ++i) {
 		dpcpu = dpcpu_off[i];
 		if (dpcpu == 0)
 			continue;
 		count += *(int *)(dpcpu + (uintptr_t)arg1);
 	}
 	return (SYSCTL_OUT(req, &count, sizeof(count)));
 }
 
 #ifdef DDB
 DB_SHOW_COMMAND(dpcpu_off, db_show_dpcpu_off)
 {
 	int id;
 
 	CPU_FOREACH(id) {
 		db_printf("dpcpu_off[%2d] = 0x%jx (+ DPCPU_START = %p)\n",
 		    id, (uintmax_t)dpcpu_off[id],
 		    (void *)(uintptr_t)(dpcpu_off[id] + DPCPU_START));
 	}
 }
 
 static void
 show_pcpu(struct pcpu *pc)
 {
 	struct thread *td;
 
 	db_printf("cpuid        = %d\n", pc->pc_cpuid);
 	db_printf("dynamic pcpu = %p\n", (void *)pc->pc_dynamic);
 	db_printf("curthread    = ");
 	td = pc->pc_curthread;
 	if (td != NULL)
 		db_printf("%p: pid %d \"%s\"\n", td, td->td_proc->p_pid,
 		    td->td_name);
 	else
 		db_printf("none\n");
 	db_printf("curpcb       = %p\n", pc->pc_curpcb);
 	db_printf("fpcurthread  = ");
 	td = pc->pc_fpcurthread;
 	if (td != NULL)
 		db_printf("%p: pid %d \"%s\"\n", td, td->td_proc->p_pid,
 		    td->td_name);
 	else
 		db_printf("none\n");
 	db_printf("idlethread   = ");
 	td = pc->pc_idlethread;
 	if (td != NULL)
 		db_printf("%p: tid %d \"%s\"\n", td, td->td_tid, td->td_name);
 	else
 		db_printf("none\n");
 	db_show_mdpcpu(pc);
 
 #ifdef VIMAGE
 	db_printf("curvnet      = %p\n", pc->pc_curthread->td_vnet);
 #endif
 
 #ifdef WITNESS
 	db_printf("spin locks held:\n");
 	witness_list_locks(&pc->pc_spinlocks, db_printf);
 #endif
 }
 
 DB_SHOW_COMMAND(pcpu, db_show_pcpu)
 {
 	struct pcpu *pc;
 	int id;
 
 	if (have_addr)
 		id = ((addr >> 4) % 16) * 10 + (addr % 16);
 	else
 		id = PCPU_GET(cpuid);
 	pc = pcpu_find(id);
 	if (pc == NULL) {
 		db_printf("CPU %d not found\n", id);
 		return;
 	}
 	show_pcpu(pc);
 }
 
 DB_SHOW_ALL_COMMAND(pcpu, db_show_cpu_all)
 {
 	struct pcpu *pc;
 	int id;
 
 	db_printf("Current CPU: %d\n\n", PCPU_GET(cpuid));
 	for (id = 0; id <= mp_maxid; id++) {
 		pc = pcpu_find(id);
 		if (pc != NULL) {
 			show_pcpu(pc);
 			db_printf("\n");
 		}
 	}
 }
 DB_SHOW_ALIAS(allpcpu, db_show_cpu_all);
 #endif
Index: head/sys/kern/subr_smp.c
===================================================================
--- head/sys/kern/subr_smp.c	(revision 222812)
+++ head/sys/kern/subr_smp.c	(revision 222813)
@@ -1,709 +1,724 @@
 /*-
  * Copyright (c) 2001, John Baldwin <jhb@FreeBSD.org>.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * This module holds the global variables and machine independent functions
  * used for the kernel SMP support.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/proc.h>
 #include <sys/bus.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 
 #include <machine/cpu.h>
 #include <machine/smp.h>
 
 #include "opt_sched.h"
 
 #ifdef SMP
-volatile cpumask_t stopped_cpus;
-volatile cpumask_t started_cpus;
-cpumask_t hlt_cpus_mask;
-cpumask_t logical_cpus_mask;
+volatile cpuset_t stopped_cpus;
+volatile cpuset_t started_cpus;
+cpuset_t hlt_cpus_mask;
+cpuset_t logical_cpus_mask;
 
 void (*cpustop_restartfunc)(void);
 #endif
 /* This is used in modules that need to work in both SMP and UP. */
-cpumask_t all_cpus;
+cpuset_t all_cpus;
 
 int mp_ncpus;
 /* export this for libkvm consumers. */
 int mp_maxcpus = MAXCPU;
 
 volatile int smp_started;
 u_int mp_maxid;
 
 SYSCTL_NODE(_kern, OID_AUTO, smp, CTLFLAG_RD, NULL, "Kernel SMP");
 
 SYSCTL_UINT(_kern_smp, OID_AUTO, maxid, CTLFLAG_RD, &mp_maxid, 0,
     "Max CPU ID.");
 
 SYSCTL_INT(_kern_smp, OID_AUTO, maxcpus, CTLFLAG_RD, &mp_maxcpus, 0,
     "Max number of CPUs that the system was compiled for.");
 
 int smp_active = 0;	/* are the APs allowed to run? */
 SYSCTL_INT(_kern_smp, OID_AUTO, active, CTLFLAG_RW, &smp_active, 0,
     "Number of Auxillary Processors (APs) that were successfully started");
 
 int smp_disabled = 0;	/* has smp been disabled? */
 SYSCTL_INT(_kern_smp, OID_AUTO, disabled, CTLFLAG_RDTUN, &smp_disabled, 0,
     "SMP has been disabled from the loader");
 TUNABLE_INT("kern.smp.disabled", &smp_disabled);
 
 int smp_cpus = 1;	/* how many cpu's running */
 SYSCTL_INT(_kern_smp, OID_AUTO, cpus, CTLFLAG_RD, &smp_cpus, 0,
     "Number of CPUs online");
 
 int smp_topology = 0;	/* Which topology we're using. */
 SYSCTL_INT(_kern_smp, OID_AUTO, topology, CTLFLAG_RD, &smp_topology, 0,
     "Topology override setting; 0 is default provided by hardware.");
 TUNABLE_INT("kern.smp.topology", &smp_topology);
 
 #ifdef SMP
 /* Enable forwarding of a signal to a process running on a different CPU */
 static int forward_signal_enabled = 1;
 SYSCTL_INT(_kern_smp, OID_AUTO, forward_signal_enabled, CTLFLAG_RW,
 	   &forward_signal_enabled, 0,
 	   "Forwarding of a signal to a process on a different CPU");
 
 /* Variables needed for SMP rendezvous. */
 static volatile int smp_rv_ncpus;
 static void (*volatile smp_rv_setup_func)(void *arg);
 static void (*volatile smp_rv_action_func)(void *arg);
 static void (*volatile smp_rv_teardown_func)(void *arg);
 static void *volatile smp_rv_func_arg;
 static volatile int smp_rv_waiters[3];
 static volatile int smp_rv_generation;
 
 /* 
  * Shared mutex to restrict busywaits between smp_rendezvous() and
  * smp(_targeted)_tlb_shootdown().  A deadlock occurs if both of these
  * functions trigger at once and cause multiple CPUs to busywait with
  * interrupts disabled. 
  */
 struct mtx smp_ipi_mtx;
 
 /*
  * Let the MD SMP code initialize mp_maxid very early if it can.
  */
 static void
 mp_setmaxid(void *dummy)
 {
 	cpu_mp_setmaxid();
 }
 SYSINIT(cpu_mp_setmaxid, SI_SUB_TUNABLES, SI_ORDER_FIRST, mp_setmaxid, NULL);
 
 /*
  * Call the MD SMP initialization code.
  */
 static void
 mp_start(void *dummy)
 {
 
 	mtx_init(&smp_ipi_mtx, "smp rendezvous", NULL, MTX_SPIN);
 
 	/* Probe for MP hardware. */
 	if (smp_disabled != 0 || cpu_mp_probe() == 0) {
 		mp_ncpus = 1;
 		all_cpus = PCPU_GET(cpumask);
 		return;
 	}
 
 	cpu_mp_start();
 	printf("FreeBSD/SMP: Multiprocessor System Detected: %d CPUs\n",
 	    mp_ncpus);
 	cpu_mp_announce();
 }
 SYSINIT(cpu_mp, SI_SUB_CPU, SI_ORDER_THIRD, mp_start, NULL);
 
 void
 forward_signal(struct thread *td)
 {
 	int id;
 
 	/*
 	 * signotify() has already set TDF_ASTPENDING and TDF_NEEDSIGCHECK on
 	 * this thread, so all we need to do is poke it if it is currently
 	 * executing so that it executes ast().
 	 */
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT(TD_IS_RUNNING(td),
 	    ("forward_signal: thread is not TDS_RUNNING"));
 
 	CTR1(KTR_SMP, "forward_signal(%p)", td->td_proc);
 
 	if (!smp_started || cold || panicstr)
 		return;
 	if (!forward_signal_enabled)
 		return;
 
 	/* No need to IPI ourself. */
 	if (td == curthread)
 		return;
 
 	id = td->td_oncpu;
 	if (id == NOCPU)
 		return;
 	ipi_cpu(id, IPI_AST);
 }
 
 /*
  * When called the executing CPU will send an IPI to all other CPUs
  *  requesting that they halt execution.
  *
  * Usually (but not necessarily) called with 'other_cpus' as its arg.
  *
  *  - Signals all CPUs in map to stop.
  *  - Waits for each to stop.
  *
  * Returns:
  *  -1: error
  *   0: NA
  *   1: ok
  *
  */
 static int
-generic_stop_cpus(cpumask_t map, u_int type)
+generic_stop_cpus(cpuset_t map, u_int type)
 {
+#ifdef KTR
+	char cpusetbuf[CPUSETBUFSIZ];
+#endif
 	static volatile u_int stopping_cpu = NOCPU;
 	int i;
 
 	KASSERT(
 #if defined(__amd64__)
 	    type == IPI_STOP || type == IPI_STOP_HARD || type == IPI_SUSPEND,
 #else
 	    type == IPI_STOP || type == IPI_STOP_HARD,
 #endif
 	    ("%s: invalid stop type", __func__));
 
 	if (!smp_started)
 		return (0);
 
-	CTR2(KTR_SMP, "stop_cpus(%x) with %u type", map, type);
+	CTR2(KTR_SMP, "stop_cpus(%s) with %u type",
+	    cpusetobj_strprint(cpusetbuf, &map), type);
 
 	if (stopping_cpu != PCPU_GET(cpuid))
 		while (atomic_cmpset_int(&stopping_cpu, NOCPU,
 		    PCPU_GET(cpuid)) == 0)
 			while (stopping_cpu != NOCPU)
 				cpu_spinwait(); /* spin */
 
 	/* send the stop IPI to all CPUs in map */
 	ipi_selected(map, type);
 
 	i = 0;
-	while ((stopped_cpus & map) != map) {
+	while (!CPU_SUBSET(&stopped_cpus, &map)) {
 		/* spin */
 		cpu_spinwait();
 		i++;
 #ifdef DIAGNOSTIC
 		if (i == 100000) {
 			printf("timeout stopping cpus\n");
 			break;
 		}
 #endif
 	}
 
 	stopping_cpu = NOCPU;
 	return (1);
 }
 
 int
-stop_cpus(cpumask_t map)
+stop_cpus(cpuset_t map)
 {
 
 	return (generic_stop_cpus(map, IPI_STOP));
 }
 
 int
-stop_cpus_hard(cpumask_t map)
+stop_cpus_hard(cpuset_t map)
 {
 
 	return (generic_stop_cpus(map, IPI_STOP_HARD));
 }
 
 #if defined(__amd64__)
 int
-suspend_cpus(cpumask_t map)
+suspend_cpus(cpuset_t map)
 {
 
 	return (generic_stop_cpus(map, IPI_SUSPEND));
 }
 #endif
 
 /*
  * Called by a CPU to restart stopped CPUs. 
  *
  * Usually (but not necessarily) called with 'stopped_cpus' as its arg.
  *
  *  - Signals all CPUs in map to restart.
  *  - Waits for each to restart.
  *
  * Returns:
  *  -1: error
  *   0: NA
  *   1: ok
  */
 int
-restart_cpus(cpumask_t map)
+restart_cpus(cpuset_t map)
 {
+#ifdef KTR
+	char cpusetbuf[CPUSETBUFSIZ];
+#endif
 
 	if (!smp_started)
 		return 0;
 
-	CTR1(KTR_SMP, "restart_cpus(%x)", map);
+	CTR1(KTR_SMP, "restart_cpus(%s)", cpusetobj_strprint(cpusetbuf, &map));
 
 	/* signal other cpus to restart */
-	atomic_store_rel_int(&started_cpus, map);
+	CPU_COPY_STORE_REL(&map, &started_cpus);
 
 	/* wait for each to clear its bit */
-	while ((stopped_cpus & map) != 0)
+	while (CPU_OVERLAP(&stopped_cpus, &map))
 		cpu_spinwait();
 
 	return 1;
 }
 
 /*
  * All-CPU rendezvous.  CPUs are signalled, all execute the setup function 
  * (if specified), rendezvous, execute the action function (if specified),
  * rendezvous again, execute the teardown function (if specified), and then
  * resume.
  *
  * Note that the supplied external functions _must_ be reentrant and aware
  * that they are running in parallel and in an unknown lock context.
  */
 void
 smp_rendezvous_action(void)
 {
 	struct thread *td;
 	void *local_func_arg;
 	void (*local_setup_func)(void*);
 	void (*local_action_func)(void*);
 	void (*local_teardown_func)(void*);
 	int generation;
 #ifdef INVARIANTS
 	int owepreempt;
 #endif
 
 	/* Ensure we have up-to-date values. */
 	atomic_add_acq_int(&smp_rv_waiters[0], 1);
 	while (smp_rv_waiters[0] < smp_rv_ncpus)
 		cpu_spinwait();
 
 	/* Fetch rendezvous parameters after acquire barrier. */
 	local_func_arg = smp_rv_func_arg;
 	local_setup_func = smp_rv_setup_func;
 	local_action_func = smp_rv_action_func;
 	local_teardown_func = smp_rv_teardown_func;
 	generation = smp_rv_generation;
 
 	/*
 	 * Use a nested critical section to prevent any preemptions
 	 * from occurring during a rendezvous action routine.
 	 * Specifically, if a rendezvous handler is invoked via an IPI
 	 * and the interrupted thread was in the critical_exit()
 	 * function after setting td_critnest to 0 but before
 	 * performing a deferred preemption, this routine can be
 	 * invoked with td_critnest set to 0 and td_owepreempt true.
 	 * In that case, a critical_exit() during the rendezvous
 	 * action would trigger a preemption which is not permitted in
 	 * a rendezvous action.  To fix this, wrap all of the
 	 * rendezvous action handlers in a critical section.  We
 	 * cannot use a regular critical section however as having
 	 * critical_exit() preempt from this routine would also be
 	 * problematic (the preemption must not occur before the IPI
 	 * has been acknowledged via an EOI).  Instead, we
 	 * intentionally ignore td_owepreempt when leaving the
 	 * critical section.  This should be harmless because we do
 	 * not permit rendezvous action routines to schedule threads,
 	 * and thus td_owepreempt should never transition from 0 to 1
 	 * during this routine.
 	 */
 	td = curthread;
 	td->td_critnest++;
 #ifdef INVARIANTS
 	owepreempt = td->td_owepreempt;
 #endif
 	
 	/*
 	 * If requested, run a setup function before the main action
 	 * function.  Ensure all CPUs have completed the setup
 	 * function before moving on to the action function.
 	 */
 	if (local_setup_func != smp_no_rendevous_barrier) {
 		if (smp_rv_setup_func != NULL)
 			smp_rv_setup_func(smp_rv_func_arg);
 		atomic_add_int(&smp_rv_waiters[1], 1);
 		while (smp_rv_waiters[1] < smp_rv_ncpus)
                 	cpu_spinwait();
 	}
 
 	if (local_action_func != NULL)
 		local_action_func(local_func_arg);
 
 	/*
 	 * Signal that the main action has been completed.  If a
 	 * full exit rendezvous is requested, then all CPUs will
 	 * wait here until all CPUs have finished the main action.
 	 *
 	 * Note that the write by the last CPU to finish the action
 	 * may become visible to different CPUs at different times.
 	 * As a result, the CPU that initiated the rendezvous may
 	 * exit the rendezvous and drop the lock allowing another
 	 * rendezvous to be initiated on the same CPU or a different
 	 * CPU.  In that case the exit sentinel may be cleared before
 	 * all CPUs have noticed causing those CPUs to hang forever.
 	 * Workaround this by using a generation count to notice when
 	 * this race occurs and to exit the rendezvous in that case.
 	 */
 	MPASS(generation == smp_rv_generation);
 	atomic_add_int(&smp_rv_waiters[2], 1);
 	if (local_teardown_func != smp_no_rendevous_barrier) {
 		while (smp_rv_waiters[2] < smp_rv_ncpus &&
 		    generation == smp_rv_generation)
 			cpu_spinwait();
 
 		if (local_teardown_func != NULL)
 			local_teardown_func(local_func_arg);
 	}
 
 	td->td_critnest--;
 	KASSERT(owepreempt == td->td_owepreempt,
 	    ("rendezvous action changed td_owepreempt"));
 }
 
 void
-smp_rendezvous_cpus(cpumask_t map,
+smp_rendezvous_cpus(cpuset_t map,
 	void (* setup_func)(void *), 
 	void (* action_func)(void *),
 	void (* teardown_func)(void *),
 	void *arg)
 {
-	int i, ncpus = 0;
+	int curcpumap, i, ncpus = 0;
 
 	if (!smp_started) {
 		if (setup_func != NULL)
 			setup_func(arg);
 		if (action_func != NULL)
 			action_func(arg);
 		if (teardown_func != NULL)
 			teardown_func(arg);
 		return;
 	}
 
 	CPU_FOREACH(i) {
-		if (((1 << i) & map) != 0)
+		if (CPU_ISSET(i, &map))
 			ncpus++;
 	}
 	if (ncpus == 0)
-		panic("ncpus is 0 with map=0x%x", map);
+		panic("ncpus is 0 with non-zero map");
 
 	mtx_lock_spin(&smp_ipi_mtx);
 
 	atomic_add_acq_int(&smp_rv_generation, 1);
 
 	/* Pass rendezvous parameters via global variables. */
 	smp_rv_ncpus = ncpus;
 	smp_rv_setup_func = setup_func;
 	smp_rv_action_func = action_func;
 	smp_rv_teardown_func = teardown_func;
 	smp_rv_func_arg = arg;
 	smp_rv_waiters[1] = 0;
 	smp_rv_waiters[2] = 0;
 	atomic_store_rel_int(&smp_rv_waiters[0], 0);
 
 	/*
 	 * Signal other processors, which will enter the IPI with
 	 * interrupts off.
 	 */
-	ipi_selected(map & ~(1 << curcpu), IPI_RENDEZVOUS);
+	curcpumap = CPU_ISSET(curcpu, &map);
+	CPU_CLR(curcpu, &map);
+	ipi_selected(map, IPI_RENDEZVOUS);
 
 	/* Check if the current CPU is in the map */
-	if ((map & (1 << curcpu)) != 0)
+	if (curcpumap != 0)
 		smp_rendezvous_action();
 
 	/*
 	 * If the caller did not request an exit barrier to be enforced
 	 * on each CPU, ensure that this CPU waits for all the other
 	 * CPUs to finish the rendezvous.
 	 */
 	if (teardown_func == smp_no_rendevous_barrier)
 		while (atomic_load_acq_int(&smp_rv_waiters[2]) < ncpus)
 			cpu_spinwait();
 
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 
 void
 smp_rendezvous(void (* setup_func)(void *), 
 	       void (* action_func)(void *),
 	       void (* teardown_func)(void *),
 	       void *arg)
 {
 	smp_rendezvous_cpus(all_cpus, setup_func, action_func, teardown_func, arg);
 }
 
 static struct cpu_group group[MAXCPU];
 
 struct cpu_group *
 smp_topo(void)
 {
+	char cpusetbuf[CPUSETBUFSIZ], cpusetbuf2[CPUSETBUFSIZ];
 	struct cpu_group *top;
 
 	/*
 	 * Check for a fake topology request for debugging purposes.
 	 */
 	switch (smp_topology) {
 	case 1:
 		/* Dual core with no sharing.  */
 		top = smp_topo_1level(CG_SHARE_NONE, 2, 0);
 		break;
 	case 2:
 		/* No topology, all cpus are equal. */
 		top = smp_topo_none();
 		break;
 	case 3:
 		/* Dual core with shared L2.  */
 		top = smp_topo_1level(CG_SHARE_L2, 2, 0);
 		break;
 	case 4:
 		/* quad core, shared l3 among each package, private l2.  */
 		top = smp_topo_1level(CG_SHARE_L3, 4, 0);
 		break;
 	case 5:
 		/* quad core,  2 dualcore parts on each package share l2.  */
 		top = smp_topo_2level(CG_SHARE_NONE, 2, CG_SHARE_L2, 2, 0);
 		break;
 	case 6:
 		/* Single-core 2xHTT */
 		top = smp_topo_1level(CG_SHARE_L1, 2, CG_FLAG_HTT);
 		break;
 	case 7:
 		/* quad core with a shared l3, 8 threads sharing L2.  */
 		top = smp_topo_2level(CG_SHARE_L3, 4, CG_SHARE_L2, 8,
 		    CG_FLAG_SMT);
 		break;
 	default:
 		/* Default, ask the system what it wants. */
 		top = cpu_topo();
 		break;
 	}
 	/*
 	 * Verify the returned topology.
 	 */
 	if (top->cg_count != mp_ncpus)
 		panic("Built bad topology at %p.  CPU count %d != %d",
 		    top, top->cg_count, mp_ncpus);
-	if (top->cg_mask != all_cpus)
-		panic("Built bad topology at %p.  CPU mask 0x%X != 0x%X",
-		    top, top->cg_mask, all_cpus);
+	if (CPU_CMP(&top->cg_mask, &all_cpus))
+		panic("Built bad topology at %p.  CPU mask (%s) != (%s)",
+		    top, cpusetobj_strprint(cpusetbuf, &top->cg_mask),
+		    cpusetobj_strprint(cpusetbuf2, &all_cpus));
 	return (top);
 }
 
 struct cpu_group *
 smp_topo_none(void)
 {
 	struct cpu_group *top;
 
 	top = &group[0];
 	top->cg_parent = NULL;
 	top->cg_child = NULL;
 	top->cg_mask = all_cpus;
 	top->cg_count = mp_ncpus;
 	top->cg_children = 0;
 	top->cg_level = CG_SHARE_NONE;
 	top->cg_flags = 0;
 	
 	return (top);
 }
 
 static int
 smp_topo_addleaf(struct cpu_group *parent, struct cpu_group *child, int share,
     int count, int flags, int start)
 {
-	cpumask_t mask;
+	char cpusetbuf[CPUSETBUFSIZ], cpusetbuf2[CPUSETBUFSIZ];
+	cpuset_t mask;
 	int i;
 
-	for (mask = 0, i = 0; i < count; i++, start++)
-		mask |= (1 << start);
+	CPU_ZERO(&mask);
+	for (i = 0; i < count; i++, start++)
+		CPU_SET(start, &mask);
 	child->cg_parent = parent;
 	child->cg_child = NULL;
 	child->cg_children = 0;
 	child->cg_level = share;
 	child->cg_count = count;
 	child->cg_flags = flags;
 	child->cg_mask = mask;
 	parent->cg_children++;
 	for (; parent != NULL; parent = parent->cg_parent) {
-		if ((parent->cg_mask & child->cg_mask) != 0)
-			panic("Duplicate children in %p.  mask 0x%X child 0x%X",
-			    parent, parent->cg_mask, child->cg_mask);
-		parent->cg_mask |= child->cg_mask;
+		if (CPU_OVERLAP(&parent->cg_mask, &child->cg_mask))
+			panic("Duplicate children in %p.  mask (%s) child (%s)",
+			    parent,
+			    cpusetobj_strprint(cpusetbuf, &parent->cg_mask),
+			    cpusetobj_strprint(cpusetbuf2, &child->cg_mask));
+		CPU_OR(&parent->cg_mask, &child->cg_mask);
 		parent->cg_count += child->cg_count;
 	}
 
 	return (start);
 }
 
 struct cpu_group *
 smp_topo_1level(int share, int count, int flags)
 {
 	struct cpu_group *child;
 	struct cpu_group *top;
 	int packages;
 	int cpu;
 	int i;
 
 	cpu = 0;
 	top = &group[0];
 	packages = mp_ncpus / count;
 	top->cg_child = child = &group[1];
 	top->cg_level = CG_SHARE_NONE;
 	for (i = 0; i < packages; i++, child++)
 		cpu = smp_topo_addleaf(top, child, share, count, flags, cpu);
 	return (top);
 }
 
 struct cpu_group *
 smp_topo_2level(int l2share, int l2count, int l1share, int l1count,
     int l1flags)
 {
 	struct cpu_group *top;
 	struct cpu_group *l1g;
 	struct cpu_group *l2g;
 	int cpu;
 	int i;
 	int j;
 
 	cpu = 0;
 	top = &group[0];
 	l2g = &group[1];
 	top->cg_child = l2g;
 	top->cg_level = CG_SHARE_NONE;
 	top->cg_children = mp_ncpus / (l2count * l1count);
 	l1g = l2g + top->cg_children;
 	for (i = 0; i < top->cg_children; i++, l2g++) {
 		l2g->cg_parent = top;
 		l2g->cg_child = l1g;
 		l2g->cg_level = l2share;
 		for (j = 0; j < l2count; j++, l1g++)
 			cpu = smp_topo_addleaf(l2g, l1g, l1share, l1count,
 			    l1flags, cpu);
 	}
 	return (top);
 }
 
 
 struct cpu_group *
 smp_topo_find(struct cpu_group *top, int cpu)
 {
 	struct cpu_group *cg;
-	cpumask_t mask;
+	cpuset_t mask;
 	int children;
 	int i;
 
-	mask = (1 << cpu);
+	CPU_SETOF(cpu, &mask);
 	cg = top;
 	for (;;) {
-		if ((cg->cg_mask & mask) == 0)
+		if (!CPU_OVERLAP(&cg->cg_mask, &mask))
 			return (NULL);
 		if (cg->cg_children == 0)
 			return (cg);
 		children = cg->cg_children;
 		for (i = 0, cg = cg->cg_child; i < children; cg++, i++)
-			if ((cg->cg_mask & mask) != 0)
+			if (CPU_OVERLAP(&cg->cg_mask, &mask))
 				break;
 	}
 	return (NULL);
 }
 #else /* !SMP */
 
 void
-smp_rendezvous_cpus(cpumask_t map,
+smp_rendezvous_cpus(cpuset_t map,
 	void (*setup_func)(void *), 
 	void (*action_func)(void *),
 	void (*teardown_func)(void *),
 	void *arg)
 {
 	if (setup_func != NULL)
 		setup_func(arg);
 	if (action_func != NULL)
 		action_func(arg);
 	if (teardown_func != NULL)
 		teardown_func(arg);
 }
 
 void
 smp_rendezvous(void (*setup_func)(void *), 
 	       void (*action_func)(void *),
 	       void (*teardown_func)(void *),
 	       void *arg)
 {
 
 	if (setup_func != NULL)
 		setup_func(arg);
 	if (action_func != NULL)
 		action_func(arg);
 	if (teardown_func != NULL)
 		teardown_func(arg);
 }
 
 /*
  * Provide dummy SMP support for UP kernels.  Modules that need to use SMP
  * APIs will still work using this dummy support.
  */
 static void
 mp_setvariables_for_up(void *dummy)
 {
 	mp_ncpus = 1;
 	mp_maxid = PCPU_GET(cpuid);
 	all_cpus = PCPU_GET(cpumask);
 	KASSERT(PCPU_GET(cpuid) == 0, ("UP must have a CPU ID of zero"));
 }
 SYSINIT(cpu_mp_setvariables, SI_SUB_TUNABLES, SI_ORDER_FIRST,
     mp_setvariables_for_up, NULL);
 #endif /* SMP */
 
 void
 smp_no_rendevous_barrier(void *dummy)
 {
 #ifdef SMP
 	KASSERT((!smp_started),("smp_no_rendevous called and smp is started"));
 #endif
 }
Index: head/sys/mips/cavium/octeon_mp.c
===================================================================
--- head/sys/mips/cavium/octeon_mp.c	(revision 222812)
+++ head/sys/mips/cavium/octeon_mp.c	(revision 222813)
@@ -1,128 +1,136 @@
 /*-
  * Copyright (c) 2004-2010 Juli Mallett <jmallett@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/conf.h>
 #include <sys/kernel.h>
 #include <sys/smp.h>
 #include <sys/systm.h>
 
 #include <machine/hwfunc.h>
 #include <machine/md_var.h>
 #include <machine/smp.h>
 
 #include <mips/cavium/octeon_pcmap_regs.h>
 
 #include <contrib/octeon-sdk/cvmx.h>
 #include <contrib/octeon-sdk/cvmx-interrupt.h>
 
 /* XXX */
 extern cvmx_bootinfo_t *octeon_bootinfo;
 
 unsigned octeon_ap_boot = ~0;
 
 void
 platform_ipi_send(int cpuid)
 {
 	cvmx_write_csr(CVMX_CIU_MBOX_SETX(cpuid), 1);
 	mips_wbflush();
 }
 
 void
 platform_ipi_clear(void)
 {
 	uint64_t action;
 
 	action = cvmx_read_csr(CVMX_CIU_MBOX_CLRX(PCPU_GET(cpuid)));
 	KASSERT(action == 1, ("unexpected IPIs: %#jx", (uintmax_t)action));
 	cvmx_write_csr(CVMX_CIU_MBOX_CLRX(PCPU_GET(cpuid)), action);
 }
 
 int
 platform_ipi_intrnum(void)
 {
 	return (1);
 }
 
 void
 platform_init_ap(int cpuid)
 {
 	unsigned ciu_int_mask, clock_int_mask, ipi_int_mask;
 
 	/*
 	 * Set the exception base.
 	 */
 	mips_wr_ebase(0x80000000);
 
 	/*
 	 * Clear any pending IPIs.
 	 */
 	cvmx_write_csr(CVMX_CIU_MBOX_CLRX(cpuid), 0xffffffff);
 
 	/*
 	 * Set up interrupts.
 	 */
 	octeon_ciu_reset();
 
 	/*
 	 * Unmask the clock, ipi and ciu interrupts.
 	 */
 	ciu_int_mask = hard_int_mask(0);
 	clock_int_mask = hard_int_mask(5);
 	ipi_int_mask = hard_int_mask(platform_ipi_intrnum());
 	set_intr_mask(ciu_int_mask | clock_int_mask | ipi_int_mask);
 
 	mips_wbflush();
 }
 
-cpumask_t
-platform_cpu_mask(void)
+void
+platform_cpu_mask(cpuset_t *mask)
 {
-       return (octeon_bootinfo->core_mask);
+
+	CPU_ZERO(mask);
+
+	/*
+	 * XXX: hack in order to simplify CPU set building, assuming that
+	 * core_mask is 32-bits.
+	 */
+	memcpy(mask, &octeon_bootinfo->core_mask,
+	    sizeof(octeon_bootinfo->core_mask));
 }
 
 struct cpu_group *
 platform_smp_topo(void)
 {
 	return (smp_topo_none());
 }
 
 int
 platform_start_ap(int cpuid)
 {
 	if (atomic_cmpset_32(&octeon_ap_boot, ~0, cpuid) == 0)
 		return (-1);
 	for (;;) {
 		DELAY(1000);
 		if (atomic_cmpset_32(&octeon_ap_boot, 0, ~0) != 0)
 			return (0);
 		printf("Waiting for cpu%d to start\n", cpuid);
 	}
 }
Index: head/sys/mips/include/_types.h
===================================================================
--- head/sys/mips/include/_types.h	(revision 222812)
+++ head/sys/mips/include/_types.h	(revision 222813)
@@ -1,164 +1,163 @@
 /*-
  * Copyright (c) 2002 Mike Barcroft <mike@FreeBSD.org>
  * Copyright (c) 1990, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	From: @(#)ansi.h	8.2 (Berkeley) 1/4/94
  *	From: @(#)types.h	8.3 (Berkeley) 1/5/94
  *	from: src/sys/i386/include/_types.h,v 1.12 2005/07/02 23:13:31 thompsa
  * $FreeBSD$
  */
 
 #ifndef _MACHINE__TYPES_H_
 #define	_MACHINE__TYPES_H_
 
 #ifndef _SYS_CDEFS_H_
 #error this file needs sys/cdefs.h as a prerequisite
 #endif
 
 /*
  * Basic types upon which most other types are built.
  */
 typedef	__signed char		__int8_t;
 typedef	unsigned char		__uint8_t;
 typedef	short			__int16_t;
 typedef	unsigned short		__uint16_t;
 typedef	int			__int32_t;
 typedef	unsigned int		__uint32_t;
 #ifdef __mips_n64
 typedef	long			__int64_t;
 typedef	unsigned long		__uint64_t;
 #else
 #ifndef lint
 __extension__
 #endif
 /* LONGLONG */
 typedef	long long		__int64_t;
 #ifndef lint
 __extension__
 #endif
 /* LONGLONG */
 typedef	unsigned long long	__uint64_t;
 #endif
 
 /*
  * Standard type definitions.
  */
 typedef	__int32_t	__clock_t;		/* clock()... */
-typedef	unsigned int	__cpumask_t;
 typedef	double		__double_t;
 typedef	double		__float_t;
 #ifdef __mips_n64
 typedef	__int64_t	__critical_t;
 typedef	__int64_t	__intfptr_t;
 typedef	__int64_t	__intptr_t;
 #else
 typedef	__int32_t	__critical_t;
 typedef	__int32_t	__intfptr_t;
 typedef	__int32_t	__intptr_t;
 #endif
 typedef	__int64_t	__intmax_t;
 typedef	__int32_t	__int_fast8_t;
 typedef	__int32_t	__int_fast16_t;
 typedef	__int32_t	__int_fast32_t;
 typedef	__int64_t	__int_fast64_t;
 typedef	__int8_t	__int_least8_t;
 typedef	__int16_t	__int_least16_t;
 typedef	__int32_t	__int_least32_t;
 typedef	__int64_t	__int_least64_t;
 #if defined(__mips_n64) || defined(__mips_n32)
 typedef	__int64_t	__register_t;
 typedef	__int64_t	f_register_t;
 #else
 typedef	__int32_t	__register_t;
 typedef	__int32_t	f_register_t;
 #endif
 #ifdef __mips_n64
 typedef	__int64_t	__ptrdiff_t;
 typedef	__int64_t	__segsz_t;
 typedef	__uint64_t	__size_t;
 typedef	__int64_t	__ssize_t;
 typedef	__uint64_t	__uintfptr_t;
 typedef	__uint64_t	__uintptr_t;
 #else
 typedef	__int32_t	__ptrdiff_t;		/* ptr1 - ptr2 */
 typedef	__int32_t	__segsz_t;		/* segment size (in pages) */
 typedef	__uint32_t	__size_t;		/* sizeof() */
 typedef	__int32_t	__ssize_t;		/* byte count or error */
 typedef	__uint32_t	__uintfptr_t;
 typedef	__uint32_t	__uintptr_t;
 #endif
 typedef	__int64_t	__time_t;		/* time()... */
 typedef	__uint64_t	__uintmax_t;
 typedef	__uint32_t	__uint_fast8_t;
 typedef	__uint32_t	__uint_fast16_t;
 typedef	__uint32_t	__uint_fast32_t;
 typedef	__uint64_t	__uint_fast64_t;
 typedef	__uint8_t	__uint_least8_t;
 typedef	__uint16_t	__uint_least16_t;
 typedef	__uint32_t	__uint_least32_t;
 typedef	__uint64_t	__uint_least64_t;
 #if defined(__mips_n64) || defined(__mips_n32)
 typedef	__uint64_t	__u_register_t;
 #else
 typedef	__uint32_t	__u_register_t;
 #endif
 #ifdef __mips_n64
 typedef	__uint64_t	__vm_offset_t;
 typedef	__uint64_t	__vm_size_t;
 #else
 typedef	__uint32_t	__vm_offset_t;
 typedef	__uint32_t	__vm_size_t;
 #endif
 #if defined(__mips_n64) || defined(__mips_n32) /* PHYSADDR_64_BIT */
 typedef	__uint64_t	__vm_paddr_t;
 #else
 typedef	__uint32_t	__vm_paddr_t;
 #endif
 
 typedef	__int64_t	__vm_ooffset_t;
 typedef	__uint64_t	__vm_pindex_t;
 
 /*
  * Unusual type definitions.
  */
 #ifdef __GNUCLIKE_BUILTIN_VARARGS
 typedef __builtin_va_list	__va_list;	/* internally known to gcc */
 #else
 typedef	char *			__va_list;
 #endif /* __GNUCLIKE_BUILTIN_VARARGS */
 #if defined(__GNUC_VA_LIST_COMPATIBILITY) && !defined(__GNUC_VA_LIST) \
     && !defined(__NO_GNUC_VA_LIST)
 #define	__GNUC_VA_LIST
 typedef __va_list		__gnuc_va_list;	/* compatibility w/GNU headers*/
 #endif
 
 #endif /* !_MACHINE__TYPES_H_ */
Index: head/sys/mips/include/hwfunc.h
===================================================================
--- head/sys/mips/include/hwfunc.h	(revision 222812)
+++ head/sys/mips/include/hwfunc.h	(revision 222813)
@@ -1,103 +1,105 @@
 /*-
  * Copyright (c) 2003-2004 Juli Mallett.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_HWFUNC_H_
 #define	_MACHINE_HWFUNC_H_
 
+#include <sys/_cpuset.h>
+
 struct trapframe;
 struct timecounter;
 /*
  * Hooks downward into hardware functionality.
  */
 
 void platform_halt(void);
 void platform_intr(struct trapframe *);
 void platform_reset(void);
 void platform_start(__register_t, __register_t,  __register_t, __register_t);
 
 /* For clocks and ticks and such */
 void platform_initclocks(void);
 uint64_t platform_get_frequency(void);
 unsigned platform_get_timecount(struct timecounter *);
 
 /* For hardware specific CPU initialization */
 void platform_cpu_init(void);
 void platform_secondary_init(void);
 
 #ifdef SMP
 
 /*
  * Spin up the AP so that it starts executing MP bootstrap entry point: mpentry
  *
  * Returns 0 on sucess and non-zero on failure.
  */
 int platform_start_ap(int processor_id);
 
 /*
  * Platform-specific initialization that needs to be done when an AP starts
  * running. This function is called from the MP bootstrap code in mpboot.S
  */
 void platform_init_ap(int processor_id);
 
 /*
  * Return a plaform-specific interrrupt number that is used to deliver IPIs.
  *
  * This hardware interrupt is used to deliver IPIs exclusively and must
  * not be used for any other interrupt source.
  */
 int platform_ipi_intrnum(void);
 
 /*
  * Trigger a IPI interrupt on 'cpuid'.
  */
 void platform_ipi_send(int cpuid);
 
 /*
  * Quiesce the IPI interrupt source on the current cpu.
  */
 void platform_ipi_clear(void);
 
 /*
  * Return the processor id.
  *
  * Note that this function is called in early boot when stack is not available.
  */
 extern int platform_processor_id(void);
 
 /*
  * Return the cpumask of available processors.
  */
-extern cpumask_t platform_cpu_mask(void);
+extern void platform_cpu_mask(cpuset_t *mask);
 
 /*
  * Return the topology of processors on this platform
  */
 struct cpu_group *platform_smp_topo(void);
 
 
 #endif	/* SMP */
 #endif /* !_MACHINE_HWFUNC_H_ */
Index: head/sys/mips/include/pmap.h
===================================================================
--- head/sys/mips/include/pmap.h	(revision 222812)
+++ head/sys/mips/include/pmap.h	(revision 222813)
@@ -1,175 +1,176 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * Derived from hp300 version by Mike Hibler, this version by William
  * Jolitz uses a recursive map [a pde points to the page directory] to
  * map the page tables using the pagetables themselves. This is done to
  * reduce the impact on kernel virtual memory for lots of sparse address
  * space, and to reduce the cost of memory to each process.
  *
  *	from: hp300: @(#)pmap.h 7.2 (Berkeley) 12/16/90
  *	from: @(#)pmap.h	7.4 (Berkeley) 5/12/91
  *	from: src/sys/i386/include/pmap.h,v 1.65.2.2 2000/11/30 01:54:42 peter
  *	JNPR: pmap.h,v 1.7.2.1 2007/09/10 07:44:12 girish
  *      $FreeBSD$
  */
 
 #ifndef _MACHINE_PMAP_H_
 #define	_MACHINE_PMAP_H_
 
 #include <machine/vmparam.h>
 #include <machine/pte.h>
 
 #if defined(__mips_n32) || defined(__mips_n64) /* PHYSADDR_64BIT */
 #define	NKPT		256	/* mem > 4G, vm_page_startup needs more KPTs */
 #else
 #define	NKPT		120	/* actual number of kernel page tables */
 #endif
 
 #ifndef LOCORE
 
 #include <sys/queue.h>
+#include <sys/_cpuset.h>
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
 
 /*
  * Pmap stuff
  */
 struct pv_entry;
 
 struct md_page {
 	int pv_list_count;
 	int pv_flags;
 	TAILQ_HEAD(, pv_entry) pv_list;
 };
 
 #define	PV_TABLE_MOD		0x01	/* modified */
 #define	PV_TABLE_REF		0x02	/* referenced */
 
 #define	ASID_BITS		8
 #define	ASIDGEN_BITS		(32 - ASID_BITS)
 #define	ASIDGEN_MASK		((1 << ASIDGEN_BITS) - 1)
 
 struct pmap {
 	pd_entry_t *pm_segtab;	/* KVA of segment table */
 	TAILQ_HEAD(, pv_entry) pm_pvlist;	/* list of mappings in
 						 * pmap */
-	cpumask_t	pm_active;		/* active on cpus */
+	cpuset_t	pm_active;		/* active on cpus */
 	struct {
 		u_int32_t asid:ASID_BITS;	/* TLB address space tag */
 		u_int32_t gen:ASIDGEN_BITS;	/* its generation number */
 	}      pm_asid[MAXSMPCPU];
 	struct pmap_statistics pm_stats;	/* pmap statistics */
 	struct vm_page *pm_ptphint;	/* pmap ptp hint */
 	struct mtx pm_mtx;
 };
 
 typedef struct pmap *pmap_t;
 
 #ifdef	_KERNEL
 
 pt_entry_t *pmap_pte(pmap_t, vm_offset_t);
 vm_offset_t pmap_kextract(vm_offset_t va);
 
 #define	vtophys(va)	pmap_kextract(((vm_offset_t) (va)))
 #define	pmap_asid(pmap)	(pmap)->pm_asid[PCPU_GET(cpuid)].asid
 
 extern struct pmap	kernel_pmap_store;
 #define kernel_pmap	(&kernel_pmap_store)
 
 #define	PMAP_LOCK(pmap)		mtx_lock(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_ASSERT(pmap, type)	mtx_assert(&(pmap)->pm_mtx, (type))
 #define	PMAP_LOCK_DESTROY(pmap) mtx_destroy(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_INIT(pmap)	mtx_init(&(pmap)->pm_mtx, "pmap", \
 				    NULL, MTX_DEF)
 #define	PMAP_LOCKED(pmap)	mtx_owned(&(pmap)->pm_mtx)
 #define	PMAP_MTX(pmap)		(&(pmap)->pm_mtx)
 #define	PMAP_TRYLOCK(pmap)	mtx_trylock(&(pmap)->pm_mtx)
 #define	PMAP_UNLOCK(pmap)	mtx_unlock(&(pmap)->pm_mtx)
 
 /*
  * For each vm_page_t, there is a list of all currently valid virtual
  * mappings of that page.  An entry is a pv_entry_t, the list is pv_table.
  */
 typedef struct pv_entry {
 	pmap_t pv_pmap;		/* pmap where mapping lies */
 	vm_offset_t pv_va;	/* virtual address for mapping */
 	TAILQ_ENTRY(pv_entry) pv_list;
 	TAILQ_ENTRY(pv_entry) pv_plist;
 	vm_page_t pv_ptem;	/* VM page for pte */
 }       *pv_entry_t;
 
 /*
  * physmem_desc[] is a superset of phys_avail[] and describes all the
  * memory present in the system.
  *
  * phys_avail[] is similar but does not include the memory stolen by
  * pmap_steal_memory().
  *
  * Each memory region is described by a pair of elements in the array
  * so we can describe up to (PHYS_AVAIL_ENTRIES / 2) distinct memory
  * regions.
  */
 #define	PHYS_AVAIL_ENTRIES	10
 extern vm_paddr_t phys_avail[PHYS_AVAIL_ENTRIES + 2];
 extern vm_paddr_t physmem_desc[PHYS_AVAIL_ENTRIES + 2];
 
 extern vm_offset_t virtual_avail;
 extern vm_offset_t virtual_end;
 
 extern vm_paddr_t dump_avail[PHYS_AVAIL_ENTRIES + 2];
 
 #define	pmap_page_get_memattr(m)	VM_MEMATTR_DEFAULT
 #define	pmap_page_is_mapped(m)	(!TAILQ_EMPTY(&(m)->md.pv_list))
 #define	pmap_page_set_memattr(m, ma)	(void)0
 
 void pmap_bootstrap(void);
 void *pmap_mapdev(vm_paddr_t, vm_size_t);
 void pmap_unmapdev(vm_offset_t, vm_size_t);
 vm_offset_t pmap_steal_memory(vm_size_t size);
 int page_is_managed(vm_paddr_t pa);
 void pmap_kenter(vm_offset_t va, vm_paddr_t pa);
 void pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int attr);
 void pmap_kremove(vm_offset_t va);
 void *pmap_kenter_temporary(vm_paddr_t pa, int i);
 void pmap_kenter_temporary_free(vm_paddr_t pa);
 int pmap_compute_pages_to_dump(void);
 void pmap_flush_pvcache(vm_page_t m);
 int pmap_emulate_modified(pmap_t pmap, vm_offset_t va);
 void pmap_grow_direct_page_cache(void);
 vm_page_t pmap_alloc_direct_page(unsigned int index, int req);
 
 #endif				/* _KERNEL */
 
 #endif				/* !LOCORE */
 
 #endif				/* !_MACHINE_PMAP_H_ */
Index: head/sys/mips/include/smp.h
===================================================================
--- head/sys/mips/include/smp.h	(revision 222812)
+++ head/sys/mips/include/smp.h	(revision 222813)
@@ -1,45 +1,47 @@
 /*-
  * ----------------------------------------------------------------------------
  * "THE BEER-WARE LICENSE" (Revision 42):
  * <phk@FreeBSD.org> wrote this file.  As long as you retain this notice you
  * can do whatever you want with this stuff. If we meet some day, and you think
  * this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
  * ----------------------------------------------------------------------------
  *
  *	from: src/sys/alpha/include/smp.h,v 1.8 2005/01/05 20:05:50 imp
  *	JNPR: smp.h,v 1.3 2006/12/02 09:53:41 katta
  * $FreeBSD$
  *
  */
 
 #ifndef _MACHINE_SMP_H_
 #define	_MACHINE_SMP_H_
 
 #ifdef _KERNEL
 
+#include <sys/_cpuset.h>
+
 #include <machine/pcb.h>
 
 /*
  * Interprocessor interrupts for SMP.
  */
 #define	IPI_RENDEZVOUS		0x0002
 #define	IPI_AST			0x0004
 #define	IPI_STOP		0x0008
 #define	IPI_STOP_HARD		0x0008
 #define	IPI_PREEMPT		0x0010
 #define	IPI_HARDCLOCK		0x0020
 
 #ifndef LOCORE
 
 void	ipi_all_but_self(int ipi);
 void	ipi_cpu(int cpu, u_int ipi);
-void	ipi_selected(cpumask_t cpus, int ipi);
+void	ipi_selected(cpuset_t cpus, int ipi);
 void	smp_init_secondary(u_int32_t cpuid);
 void	mpentry(void);
 
 extern struct pcb stoppcbs[];
 
 #endif /* !LOCORE */
 #endif /* _KERNEL */
 
 #endif /* _MACHINE_SMP_H_ */
Index: head/sys/mips/mips/mp_machdep.c
===================================================================
--- head/sys/mips/mips/mp_machdep.c	(revision 222812)
+++ head/sys/mips/mips/mp_machdep.c	(revision 222813)
@@ -1,352 +1,368 @@
 /*-
  * Copyright (c) 2009 Neelkanth Natu
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
+#include <sys/cpuset.h>
 #include <sys/ktr.h>
 #include <sys/proc.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/kernel.h>
 #include <sys/pcpu.h>
 #include <sys/smp.h>
 #include <sys/sched.h>
 #include <sys/bus.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 
 #include <machine/clock.h>
 #include <machine/smp.h>
 #include <machine/hwfunc.h>
 #include <machine/intr_machdep.h>
 #include <machine/cache.h>
 #include <machine/tlb.h>
 
 struct pcb stoppcbs[MAXCPU];
 
 static void *dpcpu;
 static struct mtx ap_boot_mtx;
 
 static volatile int aps_ready;
 static volatile int mp_naps;
 
 static void
 ipi_send(struct pcpu *pc, int ipi)
 {
 
 	CTR3(KTR_SMP, "%s: cpu=%d, ipi=%x", __func__, pc->pc_cpuid, ipi);
 
 	atomic_set_32(&pc->pc_pending_ipis, ipi);
 	platform_ipi_send(pc->pc_cpuid);
 
 	CTR1(KTR_SMP, "%s: sent", __func__);
 }
 
 void
 ipi_all_but_self(int ipi)
 {
 
 	ipi_selected(PCPU_GET(other_cpus), ipi);
 }
 
 /* Send an IPI to a set of cpus. */
 void
-ipi_selected(cpumask_t cpus, int ipi)
+ipi_selected(cpuset_t cpus, int ipi)
 {
 	struct pcpu *pc;
 
-	CTR3(KTR_SMP, "%s: cpus: %x, ipi: %x\n", __func__, cpus, ipi);
-
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
-		if ((cpus & pc->pc_cpumask) != 0)
+		if (CPU_OVERLAP(&cpus, &pc->pc_cpumask)) {
+			CTR3(KTR_SMP, "%s: pc: %p, ipi: %x\n", __func__, pc,
+			    ipi);
 			ipi_send(pc, ipi);
+		}
 	}
 }
 
 /* Send an IPI to a specific CPU. */
 void
 ipi_cpu(int cpu, u_int ipi)
 {
 
 	CTR3(KTR_SMP, "%s: cpu: %d, ipi: %x\n", __func__, cpu, ipi);
 	ipi_send(cpuid_to_pcpu[cpu], ipi);
 }
 
 /*
  * Handle an IPI sent to this processor.
  */
 static int
 mips_ipi_handler(void *arg)
 {
 	int cpu;
-	cpumask_t cpumask;
+	cpuset_t cpumask;
 	u_int	ipi, ipi_bitmap;
 	int	bit;
 
 	cpu = PCPU_GET(cpuid);
 	cpumask = PCPU_GET(cpumask);
 
 	platform_ipi_clear();	/* quiesce the pending ipi interrupt */
 
 	ipi_bitmap = atomic_readandclear_int(PCPU_PTR(pending_ipis));
 	if (ipi_bitmap == 0)
 		return (FILTER_STRAY);
 
 	CTR1(KTR_SMP, "smp_handle_ipi(), ipi_bitmap=%x", ipi_bitmap);
 
 	while ((bit = ffs(ipi_bitmap))) {
 		bit = bit - 1;
 		ipi = 1 << bit;
 		ipi_bitmap &= ~ipi;
 		switch (ipi) {
 		case IPI_RENDEZVOUS:
 			CTR0(KTR_SMP, "IPI_RENDEZVOUS");
 			smp_rendezvous_action();
 			break;
 
 		case IPI_AST:
 			CTR0(KTR_SMP, "IPI_AST");
 			break;
 
 		case IPI_STOP:
 			/*
 			 * IPI_STOP_HARD is mapped to IPI_STOP so it is not
 			 * necessary to add it in the switch.
 			 */
 			CTR0(KTR_SMP, "IPI_STOP or IPI_STOP_HARD");
 
 			savectx(&stoppcbs[cpu]);
 			tlb_save();
 
 			/* Indicate we are stopped */
-			atomic_set_int(&stopped_cpus, cpumask);
+			CPU_OR_ATOMIC(&stopped_cpus, &cpumask);
 
 			/* Wait for restart */
-			while ((started_cpus & cpumask) == 0)
+			while (!CPU_OVERLAP(&started_cpus, &cpumask))
 				cpu_spinwait();
 
-			atomic_clear_int(&started_cpus, cpumask);
-			atomic_clear_int(&stopped_cpus, cpumask);
+			CPU_NAND_ATOMIC(&started_cpus, &cpumask);
+			CPU_NAND_ATOMIC(&stopped_cpus, &cpumask);
 			CTR0(KTR_SMP, "IPI_STOP (restart)");
 			break;
 		case IPI_PREEMPT:
 			CTR1(KTR_SMP, "%s: IPI_PREEMPT", __func__);
 			sched_preempt(curthread);
 			break;
 		case IPI_HARDCLOCK:
 			CTR1(KTR_SMP, "%s: IPI_HARDCLOCK", __func__);
 			hardclockintr();
 			break;
 		default:
 			panic("Unknown IPI 0x%0x on cpu %d", ipi, curcpu);
 		}
 	}
 
 	return (FILTER_HANDLED);
 }
 
 static int
 start_ap(int cpuid)
 {
 	int cpus, ms;
 
 	cpus = mp_naps;
 	dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE);
 
 	mips_sync();
 
 	if (platform_start_ap(cpuid) != 0)
 		return (-1);			/* could not start AP */
 
 	for (ms = 0; ms < 5000; ++ms) {
 		if (mp_naps > cpus)
 			return (0);		/* success */
 		else
 			DELAY(1000);
 	}
 
 	return (-2);				/* timeout initializing AP */
 }
 
 void
 cpu_mp_setmaxid(void)
 {
-	cpumask_t cpumask;
+	cpuset_t cpumask;
+	int cpu, last;
 
-	cpumask = platform_cpu_mask();
-	mp_ncpus = bitcount32(cpumask);
+	platform_cpu_mask(&cpumask);
+	mp_ncpus = 0;
+	last = 1;
+	while ((cpu = cpusetobj_ffs(&cpumask)) != 0) {
+		last = cpu;
+		cpu--;
+		CPU_CLR(cpu, &cpumask);
+		mp_ncpus++;
+	}
 	if (mp_ncpus <= 0)
 		mp_ncpus = 1;
 
-	mp_maxid = min(fls(cpumask), MAXCPU) - 1;
+	mp_maxid = min(last, MAXCPU) - 1;
 }
 
 void
 cpu_mp_announce(void)
 {
 	/* NOTHING */
 }
 
 struct cpu_group *
 cpu_topo(void)
 {
 	return (platform_smp_topo());
 }
 
 int
 cpu_mp_probe(void)
 {
 
 	return (mp_ncpus > 1);
 }
 
 void
 cpu_mp_start(void)
 {
 	int error, cpuid;
-	cpumask_t cpumask;
+	cpuset_t cpumask, ocpus;
 
 	mtx_init(&ap_boot_mtx, "ap boot", NULL, MTX_SPIN);
 
-	all_cpus = 0;
-	cpumask = platform_cpu_mask();
+	CPU_ZERO(&all_cpus);
+	platform_cpu_mask(&cpumask);
 
-	while (cpumask != 0) {
-		cpuid = ffs(cpumask) - 1;
-		cpumask &= ~(1 << cpuid);
+	while (!CPU_EMPTY(&cpumask)) {
+		cpuid = cpusetobj_ffs(&cpumask) - 1;
+		CPU_CLR(cpuid, &cpumask);
 
 		if (cpuid >= MAXCPU) {
 			printf("cpu_mp_start: ignoring AP #%d.\n", cpuid);
 			continue;
 		}
 
 		if (cpuid != platform_processor_id()) {
 			if ((error = start_ap(cpuid)) != 0) {
 				printf("AP #%d failed to start: %d\n", cpuid, error);
 				continue;
 			}
 			if (bootverbose)
 				printf("AP #%d started!\n", cpuid);
 		}
-		all_cpus |= 1 << cpuid;
+		CPU_SET(cpuid, &all_cpus);
 	}
 
-	PCPU_SET(other_cpus, all_cpus & ~PCPU_GET(cpumask));
+	ocpus = all_cpus;
+	CPU_CLR(PCPU_GET(cpuid), &ocpus);
+	PCPU_SET(other_cpus, ocpus);
 }
 
 void
 smp_init_secondary(u_int32_t cpuid)
 {
+	cpuset_t ocpus;
+
 	/* TLB */
 	mips_wr_wired(0);
 	tlb_invalidate_all();
 	mips_wr_wired(VMWIRED_ENTRIES);
 
 	/*
 	 * We assume that the L1 cache on the APs is identical to the one
 	 * on the BSP.
 	 */
 	mips_dcache_wbinv_all();
 	mips_icache_sync_all();
 
 	mips_sync();
 
 	mips_wr_entryhi(0);
 
 	pcpu_init(PCPU_ADDR(cpuid), cpuid, sizeof(struct pcpu));
 	dpcpu_init(dpcpu, cpuid);
 
 	/* The AP has initialized successfully - allow the BSP to proceed */
 	++mp_naps;
 
 	/* Spin until the BSP is ready to release the APs */
 	while (!aps_ready)
 		;
 
 	/* Initialize curthread. */
 	KASSERT(PCPU_GET(idlethread) != NULL, ("no idle thread"));
 	PCPU_SET(curthread, PCPU_GET(idlethread));
 
 	mtx_lock_spin(&ap_boot_mtx);
 
 	smp_cpus++;
 
 	CTR1(KTR_SMP, "SMP: AP CPU #%d launched", PCPU_GET(cpuid));
 
 	/* Build our map of 'other' CPUs. */
-	PCPU_SET(other_cpus, all_cpus & ~PCPU_GET(cpumask));
+	ocpus = all_cpus;
+	CPU_CLR(PCPU_GET(cpuid), &ocpus);
+	PCPU_SET(other_cpus, ocpus);
 
 	if (bootverbose)
 		printf("SMP: AP CPU #%d launched.\n", PCPU_GET(cpuid));
 
 	if (smp_cpus == mp_ncpus) {
 		atomic_store_rel_int(&smp_started, 1);
 		smp_active = 1;
 	}
 
 	mtx_unlock_spin(&ap_boot_mtx);
 
 	while (smp_started == 0)
 		; /* nothing */
 
 	/* Start per-CPU event timers. */
 	cpu_initclocks_ap();
 
 	/* enter the scheduler */
 	sched_throw(NULL);
 
 	panic("scheduler returned us to %s", __func__);
 	/* NOTREACHED */
 }
 
 static void
 release_aps(void *dummy __unused)
 {
 	int ipi_irq;
 
 	if (mp_ncpus == 1)
 		return;
 
 	/*
 	 * IPI handler
 	 */
 	ipi_irq = platform_ipi_intrnum();
 	cpu_establish_hardintr("ipi", mips_ipi_handler, NULL, NULL, ipi_irq,
 			       INTR_TYPE_MISC | INTR_EXCL, NULL);
 
 	atomic_store_rel_int(&aps_ready, 1);
 
 	while (smp_started == 0)
 		; /* nothing */
 }
 
 SYSINIT(start_aps, SI_SUB_SMP, SI_ORDER_FIRST, release_aps, NULL);
Index: head/sys/mips/mips/pmap.c
===================================================================
--- head/sys/mips/mips/pmap.c	(revision 222812)
+++ head/sys/mips/mips/pmap.c	(revision 222813)
@@ -1,3312 +1,3324 @@
 /*
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 1994 John S. Dyson
  * All rights reserved.
  * Copyright (c) 1994 David Greenman
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from:	@(#)pmap.c	7.7 (Berkeley)	5/12/91
  *	from: src/sys/i386/i386/pmap.c,v 1.250.2.8 2000/11/21 00:09:14 ps
  *	JNPR: pmap.c,v 1.11.2.1 2007/08/16 11:51:06 girish
  */
 
 /*
  *	Manages physical address maps.
  *
  *	In addition to hardware address maps, this
  *	module is called upon to provide software-use-only
  *	maps which may or may not be stored in the same
  *	form as hardware maps.	These pseudo-maps are
  *	used to store intermediate results from copy
  *	operations to and from address spaces.
  *
  *	Since the information managed by this module is
  *	also stored by the logical address mapping module,
  *	this module may throw away valid virtual-to-physical
  *	mappings at almost any time.  However, invalidations
  *	of virtual-to-physical mappings must be done as
  *	requested.
  *
  *	In order to cope with hardware architectures which
  *	make virtual-to-physical map invalidates expensive,
  *	this module may delay invalidate or reduced protection
  *	operations until such time as they are actually
  *	necessary.  This module is given full information as
  *	to which processors are currently using which maps,
  *	and to when physical maps must be made correct.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/proc.h>
 #include <sys/msgbuf.h>
 #include <sys/vmmeter.h>
 #include <sys/mman.h>
 #include <sys/smp.h>
 #ifdef DDB
 #include <ddb/ddb.h>
 #endif
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_phys.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 #include <vm/uma.h>
 #include <sys/pcpu.h>
 #include <sys/sched.h>
 #ifdef SMP
 #include <sys/smp.h>
 #endif
 
 #include <machine/cache.h>
 #include <machine/md_var.h>
 #include <machine/tlb.h>
 
 #undef PMAP_DEBUG
 
 #ifndef PMAP_SHPGPERPROC
 #define	PMAP_SHPGPERPROC 200
 #endif
 
 #if !defined(DIAGNOSTIC)
 #define	PMAP_INLINE __inline
 #else
 #define	PMAP_INLINE
 #endif
 
 /*
  * Get PDEs and PTEs for user/kernel address space
  */
 #define	pmap_seg_index(v)	(((v) >> SEGSHIFT) & (NPDEPG - 1))
 #define	pmap_pde_index(v)	(((v) >> PDRSHIFT) & (NPDEPG - 1))
 #define	pmap_pte_index(v)	(((v) >> PAGE_SHIFT) & (NPTEPG - 1))
 #define	pmap_pde_pindex(v)	((v) >> PDRSHIFT)
 
 #ifdef __mips_n64
 #define	NUPDE			(NPDEPG * NPDEPG)
 #define	NUSERPGTBLS		(NUPDE + NPDEPG)
 #else
 #define	NUPDE			(NPDEPG)
 #define	NUSERPGTBLS		(NUPDE)
 #endif
 
 #define	is_kernel_pmap(x)	((x) == kernel_pmap)
 
 struct pmap kernel_pmap_store;
 pd_entry_t *kernel_segmap;
 
 vm_offset_t virtual_avail;	/* VA of first avail page (after kernel bss) */
 vm_offset_t virtual_end;	/* VA of last avail page (end of kernel AS) */
 
 static int nkpt;
 unsigned pmap_max_asid;		/* max ASID supported by the system */
 
 #define	PMAP_ASID_RESERVED	0
 
 vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS;
 
 static void pmap_asid_alloc(pmap_t pmap);
 
 /*
  * Data for the pv entry allocation mechanism
  */
 static uma_zone_t pvzone;
 static struct vm_object pvzone_obj;
 static int pv_entry_count = 0, pv_entry_max = 0, pv_entry_high_water = 0;
 
 static PMAP_INLINE void free_pv_entry(pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t locked_pmap);
 static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va);
 static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap,
     vm_offset_t va);
 static __inline void pmap_changebit(vm_page_t m, int bit, boolean_t setem);
 static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va,
     vm_page_t m, vm_prot_t prot, vm_page_t mpte);
 static int pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq, vm_offset_t va);
 static void pmap_remove_page(struct pmap *pmap, vm_offset_t va);
 static void pmap_remove_entry(struct pmap *pmap, vm_page_t m, vm_offset_t va);
 static boolean_t pmap_try_insert_pv_entry(pmap_t pmap, vm_page_t mpte,
     vm_offset_t va, vm_page_t m);
 static void pmap_update_page(pmap_t pmap, vm_offset_t va, pt_entry_t pte);
 static void pmap_invalidate_all(pmap_t pmap);
 static void pmap_invalidate_page(pmap_t pmap, vm_offset_t va);
 static int _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m);
 
 static vm_page_t pmap_allocpte(pmap_t pmap, vm_offset_t va, int flags);
 static vm_page_t _pmap_allocpte(pmap_t pmap, unsigned ptepindex, int flags);
 static int pmap_unuse_pt(pmap_t, vm_offset_t, vm_page_t);
 static pt_entry_t init_pte_prot(vm_offset_t va, vm_page_t m, vm_prot_t prot);
 
 #ifdef SMP
 static void pmap_invalidate_page_action(void *arg);
 static void pmap_invalidate_all_action(void *arg);
 static void pmap_update_page_action(void *arg);
 #endif
 
 #ifndef __mips_n64
 /*
  * This structure is for high memory (memory above 512Meg in 32 bit) support.
  * The highmem area does not have a KSEG0 mapping, and we need a mechanism to
  * do temporary per-CPU mappings for pmap_zero_page, pmap_copy_page etc.
  *
  * At bootup, we reserve 2 virtual pages per CPU for mapping highmem pages. To 
  * access a highmem physical address on a CPU, we map the physical address to
  * the reserved virtual address for the CPU in the kernel pagetable.  This is 
  * done with interrupts disabled(although a spinlock and sched_pin would be 
  * sufficient).
  */
 struct local_sysmaps {
 	vm_offset_t	base;
 	uint32_t	saved_intr;
 	uint16_t	valid1, valid2;
 };
 static struct local_sysmaps sysmap_lmem[MAXCPU];
 
 static __inline void
 pmap_alloc_lmem_map(void)
 {
 	int i;
 
 	for (i = 0; i < MAXCPU; i++) {
 		sysmap_lmem[i].base = virtual_avail;
 		virtual_avail += PAGE_SIZE * 2;
 		sysmap_lmem[i].valid1 = sysmap_lmem[i].valid2 = 0;
 	}
 }
 
 static __inline vm_offset_t
 pmap_lmem_map1(vm_paddr_t phys)
 {
 	struct local_sysmaps *sysm;
 	pt_entry_t *pte, npte;
 	vm_offset_t va;
 	uint32_t intr;
 	int cpu;
 
 	intr = intr_disable();
 	cpu = PCPU_GET(cpuid);
 	sysm = &sysmap_lmem[cpu];
 	sysm->saved_intr = intr;
 	va = sysm->base;
 	npte = TLBLO_PA_TO_PFN(phys) |
 	    PTE_D | PTE_V | PTE_G | PTE_W | PTE_C_CACHE;
 	pte = pmap_pte(kernel_pmap, va);
 	*pte = npte;
 	sysm->valid1 = 1;
 	return (va);
 }
 
 static __inline vm_offset_t
 pmap_lmem_map2(vm_paddr_t phys1, vm_paddr_t phys2)
 {
 	struct local_sysmaps *sysm;
 	pt_entry_t *pte, npte;
 	vm_offset_t va1, va2;
 	uint32_t intr;
 	int cpu;
 
 	intr = intr_disable();
 	cpu = PCPU_GET(cpuid);
 	sysm = &sysmap_lmem[cpu];
 	sysm->saved_intr = intr;
 	va1 = sysm->base;
 	va2 = sysm->base + PAGE_SIZE;
 	npte = TLBLO_PA_TO_PFN(phys1) |
 	    PTE_D | PTE_V | PTE_G | PTE_W | PTE_C_CACHE;
 	pte = pmap_pte(kernel_pmap, va1);
 	*pte = npte;
 	npte =  TLBLO_PA_TO_PFN(phys2) |
 	    PTE_D | PTE_V | PTE_G | PTE_W | PTE_C_CACHE;
 	pte = pmap_pte(kernel_pmap, va2);
 	*pte = npte;
 	sysm->valid1 = 1;
 	sysm->valid2 = 1;
 	return (va1);
 }
 
 static __inline void
 pmap_lmem_unmap(void)
 {
 	struct local_sysmaps *sysm;
 	pt_entry_t *pte;
 	int cpu;
 
 	cpu = PCPU_GET(cpuid);
 	sysm = &sysmap_lmem[cpu];
 	pte = pmap_pte(kernel_pmap, sysm->base);
 	*pte = PTE_G;
 	tlb_invalidate_address(kernel_pmap, sysm->base);
 	sysm->valid1 = 0;
 	if (sysm->valid2) {
 		pte = pmap_pte(kernel_pmap, sysm->base + PAGE_SIZE);
 		*pte = PTE_G;
 		tlb_invalidate_address(kernel_pmap, sysm->base + PAGE_SIZE);
 		sysm->valid2 = 0;
 	}
 	intr_restore(sysm->saved_intr);
 }
 #else  /* __mips_n64 */
 
 static __inline void
 pmap_alloc_lmem_map(void)
 {
 }
 
 static __inline vm_offset_t
 pmap_lmem_map1(vm_paddr_t phys)
 {
 
 	return (0);
 }
 
 static __inline vm_offset_t
 pmap_lmem_map2(vm_paddr_t phys1, vm_paddr_t phys2)
 {
 
 	return (0);
 }
 
 static __inline vm_offset_t 
 pmap_lmem_unmap(void)
 {
 
 	return (0);
 }
 #endif /* !__mips_n64 */
 
 /*
  * Page table entry lookup routines.
  */
 static __inline pd_entry_t *
 pmap_segmap(pmap_t pmap, vm_offset_t va)
 {
 
 	return (&pmap->pm_segtab[pmap_seg_index(va)]);
 }
 
 #ifdef __mips_n64
 static __inline pd_entry_t *
 pmap_pdpe_to_pde(pd_entry_t *pdpe, vm_offset_t va)
 {
 	pd_entry_t *pde;
 
 	pde = (pd_entry_t *)*pdpe;
 	return (&pde[pmap_pde_index(va)]);
 }
 
 static __inline pd_entry_t *
 pmap_pde(pmap_t pmap, vm_offset_t va)
 {
 	pd_entry_t *pdpe;
 
 	pdpe = pmap_segmap(pmap, va);
 	if (pdpe == NULL || *pdpe == NULL)
 		return (NULL);
 
 	return (pmap_pdpe_to_pde(pdpe, va));
 }
 #else
 static __inline pd_entry_t *
 pmap_pdpe_to_pde(pd_entry_t *pdpe, vm_offset_t va)
 {
 
 	return (pdpe);
 }
 
 static __inline 
 pd_entry_t *pmap_pde(pmap_t pmap, vm_offset_t va)
 {
 
 	return (pmap_segmap(pmap, va));
 }
 #endif
 
 static __inline pt_entry_t *
 pmap_pde_to_pte(pd_entry_t *pde, vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	pte = (pt_entry_t *)*pde;
 	return (&pte[pmap_pte_index(va)]);
 }
 
 pt_entry_t *
 pmap_pte(pmap_t pmap, vm_offset_t va)
 {
 	pd_entry_t *pde;
 
 	pde = pmap_pde(pmap, va);
 	if (pde == NULL || *pde == NULL)
 		return (NULL);
 
 	return (pmap_pde_to_pte(pde, va));
 }
 
 vm_offset_t
 pmap_steal_memory(vm_size_t size)
 {
 	vm_paddr_t bank_size, pa;
 	vm_offset_t va;
 
 	size = round_page(size);
 	bank_size = phys_avail[1] - phys_avail[0];
 	while (size > bank_size) {
 		int i;
 
 		for (i = 0; phys_avail[i + 2]; i += 2) {
 			phys_avail[i] = phys_avail[i + 2];
 			phys_avail[i + 1] = phys_avail[i + 3];
 		}
 		phys_avail[i] = 0;
 		phys_avail[i + 1] = 0;
 		if (!phys_avail[0])
 			panic("pmap_steal_memory: out of memory");
 		bank_size = phys_avail[1] - phys_avail[0];
 	}
 
 	pa = phys_avail[0];
 	phys_avail[0] += size;
 	if (MIPS_DIRECT_MAPPABLE(pa) == 0)
 		panic("Out of memory below 512Meg?");
 	va = MIPS_PHYS_TO_DIRECT(pa);
 	bzero((caddr_t)va, size);
 	return (va);
 }
 
 /*
  * Bootstrap the system enough to run with virtual memory.  This
  * assumes that the phys_avail array has been initialized.
  */
 static void 
 pmap_create_kernel_pagetable(void)
 {
 	int i, j;
 	vm_offset_t ptaddr;
 	pt_entry_t *pte;
 #ifdef __mips_n64
 	pd_entry_t *pde;
 	vm_offset_t pdaddr;
 	int npt, npde;
 #endif
 
 	/*
 	 * Allocate segment table for the kernel
 	 */
 	kernel_segmap = (pd_entry_t *)pmap_steal_memory(PAGE_SIZE);
 
 	/*
 	 * Allocate second level page tables for the kernel
 	 */
 #ifdef __mips_n64
 	npde = howmany(NKPT, NPDEPG);
 	pdaddr = pmap_steal_memory(PAGE_SIZE * npde);
 #endif
 	nkpt = NKPT;
 	ptaddr = pmap_steal_memory(PAGE_SIZE * nkpt);
 
 	/*
 	 * The R[4-7]?00 stores only one copy of the Global bit in the
 	 * translation lookaside buffer for each 2 page entry. Thus invalid
 	 * entrys must have the Global bit set so when Entry LO and Entry HI
 	 * G bits are anded together they will produce a global bit to store
 	 * in the tlb.
 	 */
 	for (i = 0, pte = (pt_entry_t *)ptaddr; i < (nkpt * NPTEPG); i++, pte++)
 		*pte = PTE_G;
 
 #ifdef __mips_n64
 	for (i = 0,  npt = nkpt; npt > 0; i++) {
 		kernel_segmap[i] = (pd_entry_t)(pdaddr + i * PAGE_SIZE);
 		pde = (pd_entry_t *)kernel_segmap[i];
 
 		for (j = 0; j < NPDEPG && npt > 0; j++, npt--)
 			pde[j] = (pd_entry_t)(ptaddr + (i * NPDEPG + j) * PAGE_SIZE);
 	}
 #else
 	for (i = 0, j = pmap_seg_index(VM_MIN_KERNEL_ADDRESS); i < nkpt; i++, j++)
 		kernel_segmap[j] = (pd_entry_t)(ptaddr + (i * PAGE_SIZE));
 #endif
 
 	PMAP_LOCK_INIT(kernel_pmap);
 	kernel_pmap->pm_segtab = kernel_segmap;
-	kernel_pmap->pm_active = ~0;
+	CPU_FILL(&kernel_pmap->pm_active);
 	TAILQ_INIT(&kernel_pmap->pm_pvlist);
 	kernel_pmap->pm_asid[0].asid = PMAP_ASID_RESERVED;
 	kernel_pmap->pm_asid[0].gen = 0;
 	kernel_vm_end += nkpt * NPTEPG * PAGE_SIZE;
 }
 
 void
 pmap_bootstrap(void)
 {
 	int i;
 	int need_local_mappings = 0; 
 
 	/* Sort. */
 again:
 	for (i = 0; phys_avail[i + 1] != 0; i += 2) {
 		/*
 		 * Keep the memory aligned on page boundary.
 		 */
 		phys_avail[i] = round_page(phys_avail[i]);
 		phys_avail[i + 1] = trunc_page(phys_avail[i + 1]);
 
 		if (i < 2)
 			continue;
 		if (phys_avail[i - 2] > phys_avail[i]) {
 			vm_paddr_t ptemp[2];
 
 			ptemp[0] = phys_avail[i + 0];
 			ptemp[1] = phys_avail[i + 1];
 
 			phys_avail[i + 0] = phys_avail[i - 2];
 			phys_avail[i + 1] = phys_avail[i - 1];
 
 			phys_avail[i - 2] = ptemp[0];
 			phys_avail[i - 1] = ptemp[1];
 			goto again;
 		}
 	}
 
        	/*
 	 * In 32 bit, we may have memory which cannot be mapped directly.
 	 * This memory will need temporary mapping before it can be
 	 * accessed.
 	 */
 	if (!MIPS_DIRECT_MAPPABLE(phys_avail[i - 1] - 1))
 		need_local_mappings = 1;
 
 	/*
 	 * Copy the phys_avail[] array before we start stealing memory from it.
 	 */
 	for (i = 0; phys_avail[i + 1] != 0; i += 2) {
 		physmem_desc[i] = phys_avail[i];
 		physmem_desc[i + 1] = phys_avail[i + 1];
 	}
 
 	Maxmem = atop(phys_avail[i - 1]);
 
 	if (bootverbose) {
 		printf("Physical memory chunk(s):\n");
 		for (i = 0; phys_avail[i + 1] != 0; i += 2) {
 			vm_paddr_t size;
 
 			size = phys_avail[i + 1] - phys_avail[i];
 			printf("%#08jx - %#08jx, %ju bytes (%ju pages)\n",
 			    (uintmax_t) phys_avail[i],
 			    (uintmax_t) phys_avail[i + 1] - 1,
 			    (uintmax_t) size, (uintmax_t) size / PAGE_SIZE);
 		}
 		printf("Maxmem is 0x%0jx\n", ptoa((uintmax_t)Maxmem));
 	}
 	/*
 	 * Steal the message buffer from the beginning of memory.
 	 */
 	msgbufp = (struct msgbuf *)pmap_steal_memory(msgbufsize);
 	msgbufinit(msgbufp, msgbufsize);
 
 	/*
 	 * Steal thread0 kstack.
 	 */
 	kstack0 = pmap_steal_memory(KSTACK_PAGES << PAGE_SHIFT);
 
 	virtual_avail = VM_MIN_KERNEL_ADDRESS;
 	virtual_end = VM_MAX_KERNEL_ADDRESS;
 
 #ifdef SMP
 	/*
 	 * Steal some virtual address space to map the pcpu area.
 	 */
 	virtual_avail = roundup2(virtual_avail, PAGE_SIZE * 2);
 	pcpup = (struct pcpu *)virtual_avail;
 	virtual_avail += PAGE_SIZE * 2;
 
 	/*
 	 * Initialize the wired TLB entry mapping the pcpu region for
 	 * the BSP at 'pcpup'. Up until this point we were operating
 	 * with the 'pcpup' for the BSP pointing to a virtual address
 	 * in KSEG0 so there was no need for a TLB mapping.
 	 */
 	mips_pcpu_tlb_init(PCPU_ADDR(0));
 
 	if (bootverbose)
 		printf("pcpu is available at virtual address %p.\n", pcpup);
 #endif
 
 	if (need_local_mappings)
 		pmap_alloc_lmem_map();
 	pmap_create_kernel_pagetable();
 	pmap_max_asid = VMNUM_PIDS;
 	mips_wr_entryhi(0);
 	mips_wr_pagemask(0);
 }
 
 /*
  * Initialize a vm_page's machine-dependent fields.
  */
 void
 pmap_page_init(vm_page_t m)
 {
 
 	TAILQ_INIT(&m->md.pv_list);
 	m->md.pv_list_count = 0;
 	m->md.pv_flags = 0;
 }
 
 /*
  *	Initialize the pmap module.
  *	Called by vm_init, to initialize any structures that the pmap
  *	system needs to map virtual memory.
  *	pmap_init has been enhanced to support in a fairly consistant
  *	way, discontiguous physical memory.
  */
 void
 pmap_init(void)
 {
 
 	/*
 	 * Initialize the address space (zone) for the pv entries.  Set a
 	 * high water mark so that the system can recover from excessive
 	 * numbers of pv entries.
 	 */
 	pvzone = uma_zcreate("PV ENTRY", sizeof(struct pv_entry), NULL, NULL,
 	    NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_VM | UMA_ZONE_NOFREE);
 	pv_entry_max = PMAP_SHPGPERPROC * maxproc + cnt.v_page_count;
 	pv_entry_high_water = 9 * (pv_entry_max / 10);
 	uma_zone_set_obj(pvzone, &pvzone_obj, pv_entry_max);
 }
 
 /***************************************************
  * Low level helper routines.....
  ***************************************************/
 
 static __inline void
 pmap_invalidate_all_local(pmap_t pmap)
 {
 
 	if (pmap == kernel_pmap) {
 		tlb_invalidate_all();
 		return;
 	}
-	if (pmap->pm_active & PCPU_GET(cpumask))
+	sched_pin();
+	if (CPU_OVERLAP(&pmap->pm_active, PCPU_PTR(cpumask))) {
+		sched_unpin();
 		tlb_invalidate_all_user(pmap);
-	else
+	} else {
+		sched_unpin();
 		pmap->pm_asid[PCPU_GET(cpuid)].gen = 0;
+	}
 }
 
 #ifdef SMP
 static void
 pmap_invalidate_all(pmap_t pmap)
 {
 
 	smp_rendezvous(0, pmap_invalidate_all_action, 0, pmap);
 }
 
 static void
 pmap_invalidate_all_action(void *arg)
 {
 
 	pmap_invalidate_all_local((pmap_t)arg);
 }
 #else
 static void
 pmap_invalidate_all(pmap_t pmap)
 {
 
 	pmap_invalidate_all_local(pmap);
 }
 #endif
 
 static __inline void
 pmap_invalidate_page_local(pmap_t pmap, vm_offset_t va)
 {
 
 	if (is_kernel_pmap(pmap)) {
 		tlb_invalidate_address(pmap, va);
 		return;
 	}
-	if (pmap->pm_asid[PCPU_GET(cpuid)].gen != PCPU_GET(asid_generation))
+	sched_pin();
+	if (pmap->pm_asid[PCPU_GET(cpuid)].gen != PCPU_GET(asid_generation)) {
+		sched_unpin();
 		return;
-	else if (!(pmap->pm_active & PCPU_GET(cpumask))) {
+	} else if (!CPU_OVERLAP(&pmap->pm_active, PCPU_PTR(cpumask))) {
 		pmap->pm_asid[PCPU_GET(cpuid)].gen = 0;
+		sched_unpin();
 		return;
 	}
+	sched_unpin();
 	tlb_invalidate_address(pmap, va);
 }
 
 #ifdef SMP
 struct pmap_invalidate_page_arg {
 	pmap_t pmap;
 	vm_offset_t va;
 };
 
 static void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
 	struct pmap_invalidate_page_arg arg;
 
 	arg.pmap = pmap;
 	arg.va = va;
 	smp_rendezvous(0, pmap_invalidate_page_action, 0, &arg);
 }
 
 static void
 pmap_invalidate_page_action(void *arg)
 {
 	struct pmap_invalidate_page_arg *p = arg;
 
 	pmap_invalidate_page_local(p->pmap, p->va);
 }
 #else
 static void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
 
 	pmap_invalidate_page_local(pmap, va);
 }
 #endif
 
 static __inline void
 pmap_update_page_local(pmap_t pmap, vm_offset_t va, pt_entry_t pte)
 {
 
 	if (is_kernel_pmap(pmap)) {
 		tlb_update(pmap, va, pte);
 		return;
 	}
-	if (pmap->pm_asid[PCPU_GET(cpuid)].gen != PCPU_GET(asid_generation))
+	sched_pin();
+	if (pmap->pm_asid[PCPU_GET(cpuid)].gen != PCPU_GET(asid_generation)) {
+		sched_unpin();
 		return;
-	else if (!(pmap->pm_active & PCPU_GET(cpumask))) {
+	} else if (!CPU_OVERLAP(&pmap->pm_active, PCPU_PTR(cpumask))) {
 		pmap->pm_asid[PCPU_GET(cpuid)].gen = 0;
+		sched_unpin();
 		return;
 	}
+	sched_unpin();
 	tlb_update(pmap, va, pte);
 }
 
 #ifdef SMP
 struct pmap_update_page_arg {
 	pmap_t pmap;
 	vm_offset_t va;
 	pt_entry_t pte;
 };
 
 static void
 pmap_update_page(pmap_t pmap, vm_offset_t va, pt_entry_t pte)
 {
 	struct pmap_update_page_arg arg;
 
 	arg.pmap = pmap;
 	arg.va = va;
 	arg.pte = pte;
 	smp_rendezvous(0, pmap_update_page_action, 0, &arg);
 }
 
 static void
 pmap_update_page_action(void *arg)
 {
 	struct pmap_update_page_arg *p = arg;
 
 	pmap_update_page_local(p->pmap, p->va, p->pte);
 }
 #else
 static void
 pmap_update_page(pmap_t pmap, vm_offset_t va, pt_entry_t pte)
 {
 
 	pmap_update_page_local(pmap, va, pte);
 }
 #endif
 
 /*
  *	Routine:	pmap_extract
  *	Function:
  *		Extract the physical page address associated
  *		with the given map/virtual_address pair.
  */
 vm_paddr_t
 pmap_extract(pmap_t pmap, vm_offset_t va)
 {
 	pt_entry_t *pte;
 	vm_offset_t retval = 0;
 
 	PMAP_LOCK(pmap);
 	pte = pmap_pte(pmap, va);
 	if (pte) {
 		retval = TLBLO_PTE_TO_PA(*pte) | (va & PAGE_MASK);
 	}
 	PMAP_UNLOCK(pmap);
 	return (retval);
 }
 
 /*
  *	Routine:	pmap_extract_and_hold
  *	Function:
  *		Atomically extract and hold the physical page
  *		with the given pmap and virtual address pair
  *		if that mapping permits the given protection.
  */
 vm_page_t
 pmap_extract_and_hold(pmap_t pmap, vm_offset_t va, vm_prot_t prot)
 {
 	pt_entry_t pte;
 	vm_page_t m;
 	vm_paddr_t pa;
 
 	m = NULL;
 	pa = 0;
 	PMAP_LOCK(pmap);
 retry:
 	pte = *pmap_pte(pmap, va);
 	if (pte != 0 && pte_test(&pte, PTE_V) &&
 	    (pte_test(&pte, PTE_D) || (prot & VM_PROT_WRITE) == 0)) {
 		if (vm_page_pa_tryrelock(pmap, TLBLO_PTE_TO_PA(pte), &pa))
 			goto retry;
 
 		m = PHYS_TO_VM_PAGE(TLBLO_PTE_TO_PA(pte));
 		vm_page_hold(m);
 	}
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pmap);
 	return (m);
 }
 
 /***************************************************
  * Low level mapping routines.....
  ***************************************************/
 
 /*
  * add a wired page to the kva
  */
 void
 pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int attr)
 {
 	pt_entry_t *pte;
 	pt_entry_t opte, npte;
 
 #ifdef PMAP_DEBUG
 	printf("pmap_kenter:  va: %p -> pa: %p\n", (void *)va, (void *)pa);
 #endif
 	npte = TLBLO_PA_TO_PFN(pa) | PTE_D | PTE_V | PTE_G | PTE_W | attr;
 
 	pte = pmap_pte(kernel_pmap, va);
 	opte = *pte;
 	*pte = npte;
 	if (pte_test(&opte, PTE_V) && opte != npte)
 		pmap_update_page(kernel_pmap, va, npte);
 }
 
 void
 pmap_kenter(vm_offset_t va, vm_paddr_t pa)
 {
 
 	KASSERT(is_cacheable_mem(pa),
 		("pmap_kenter: memory at 0x%lx is not cacheable", (u_long)pa));
 
 	pmap_kenter_attr(va, pa, PTE_C_CACHE);
 }
 
 /*
  * remove a page from the kernel pagetables
  */
  /* PMAP_INLINE */ void
 pmap_kremove(vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	/*
 	 * Write back all caches from the page being destroyed
 	 */
 	mips_dcache_wbinv_range_index(va, PAGE_SIZE);
 
 	pte = pmap_pte(kernel_pmap, va);
 	*pte = PTE_G;
 	pmap_invalidate_page(kernel_pmap, va);
 }
 
 /*
  *	Used to map a range of physical addresses into kernel
  *	virtual address space.
  *
  *	The value passed in '*virt' is a suggested virtual address for
  *	the mapping. Architectures which can support a direct-mapped
  *	physical to virtual region can return the appropriate address
  *	within that region, leaving '*virt' unchanged. Other
  *	architectures should map the pages starting at '*virt' and
  *	update '*virt' with the first usable address after the mapped
  *	region.
  *
  *	Use XKPHYS for 64 bit, and KSEG0 where possible for 32 bit.
  */
 vm_offset_t
 pmap_map(vm_offset_t *virt, vm_paddr_t start, vm_paddr_t end, int prot)
 {
 	vm_offset_t va, sva;
 
 	if (MIPS_DIRECT_MAPPABLE(end - 1))
 		return (MIPS_PHYS_TO_DIRECT(start));
 
 	va = sva = *virt;
 	while (start < end) {
 		pmap_kenter(va, start);
 		va += PAGE_SIZE;
 		start += PAGE_SIZE;
 	}
 	*virt = va;
 	return (sva);
 }
 
 /*
  * Add a list of wired pages to the kva
  * this routine is only used for temporary
  * kernel mappings that do not need to have
  * page modification or references recorded.
  * Note that old mappings are simply written
  * over.  The page *must* be wired.
  */
 void
 pmap_qenter(vm_offset_t va, vm_page_t *m, int count)
 {
 	int i;
 	vm_offset_t origva = va;
 
 	for (i = 0; i < count; i++) {
 		pmap_flush_pvcache(m[i]);
 		pmap_kenter(va, VM_PAGE_TO_PHYS(m[i]));
 		va += PAGE_SIZE;
 	}
 
 	mips_dcache_wbinv_range_index(origva, PAGE_SIZE*count);
 }
 
 /*
  * this routine jerks page mappings from the
  * kernel -- it is meant only for temporary mappings.
  */
 void
 pmap_qremove(vm_offset_t va, int count)
 {
 	/*
 	 * No need to wb/inv caches here, 
 	 *   pmap_kremove will do it for us
 	 */
 
 	while (count-- > 0) {
 		pmap_kremove(va);
 		va += PAGE_SIZE;
 	}
 }
 
 /***************************************************
  * Page table page management routines.....
  ***************************************************/
 
 /*  Revision 1.507
  *
  * Simplify the reference counting of page table pages.	 Specifically, use
  * the page table page's wired count rather than its hold count to contain
  * the reference count.
  */
 
 /*
  * This routine unholds page table pages, and if the hold count
  * drops to zero, then it decrements the wire count.
  */
 static PMAP_INLINE int
 pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	--m->wire_count;
 	if (m->wire_count == 0)
 		return (_pmap_unwire_pte_hold(pmap, va, m));
 	else
 		return (0);
 }
 
 static int
 _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	pd_entry_t *pde;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	/*
 	 * unmap the page table page
 	 */
 #ifdef __mips_n64
 	if (m->pindex < NUPDE)
 		pde = pmap_pde(pmap, va);
 	else
 		pde = pmap_segmap(pmap, va);
 #else
 	pde = pmap_pde(pmap, va);
 #endif
 	*pde = 0;
 	pmap->pm_stats.resident_count--;
 
 #ifdef __mips_n64
 	if (m->pindex < NUPDE) {
 		pd_entry_t *pdp;
 		vm_page_t pdpg;
 
 		/*
 		 * Recursively decrement next level pagetable refcount
 		 */
 		pdp = (pd_entry_t *)*pmap_segmap(pmap, va);
 		pdpg = PHYS_TO_VM_PAGE(MIPS_DIRECT_TO_PHYS(pdp));
 		pmap_unwire_pte_hold(pmap, va, pdpg);
 	}
 #endif
 	if (pmap->pm_ptphint == m)
 		pmap->pm_ptphint = NULL;
 
 	/*
 	 * If the page is finally unwired, simply free it.
 	 */
 	vm_page_free_zero(m);
 	atomic_subtract_int(&cnt.v_wire_count, 1);
 	return (1);
 }
 
 /*
  * After removing a page table entry, this routine is used to
  * conditionally free the page, and manage the hold/wire counts.
  */
 static int
 pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte)
 {
 	unsigned ptepindex;
 	pd_entry_t pteva;
 
 	if (va >= VM_MAXUSER_ADDRESS)
 		return (0);
 
 	if (mpte == NULL) {
 		ptepindex = pmap_pde_pindex(va);
 		if (pmap->pm_ptphint &&
 		    (pmap->pm_ptphint->pindex == ptepindex)) {
 			mpte = pmap->pm_ptphint;
 		} else {
 			pteva = *pmap_pde(pmap, va);
 			mpte = PHYS_TO_VM_PAGE(MIPS_DIRECT_TO_PHYS(pteva));
 			pmap->pm_ptphint = mpte;
 		}
 	}
 	return (pmap_unwire_pte_hold(pmap, va, mpte));
 }
 
 void
 pmap_pinit0(pmap_t pmap)
 {
 	int i;
 
 	PMAP_LOCK_INIT(pmap);
 	pmap->pm_segtab = kernel_segmap;
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 	pmap->pm_ptphint = NULL;
 	for (i = 0; i < MAXCPU; i++) {
 		pmap->pm_asid[i].asid = PMAP_ASID_RESERVED;
 		pmap->pm_asid[i].gen = 0;
 	}
 	PCPU_SET(curpmap, pmap);
 	TAILQ_INIT(&pmap->pm_pvlist);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 }
 
 void
 pmap_grow_direct_page_cache()
 {
 
 #ifdef __mips_n64
 	vm_contig_grow_cache(3, 0, MIPS_XKPHYS_LARGEST_PHYS);
 #else
 	vm_contig_grow_cache(3, 0, MIPS_KSEG0_LARGEST_PHYS);
 #endif
 }
 
 vm_page_t
 pmap_alloc_direct_page(unsigned int index, int req)
 {
 	vm_page_t m;
 
 	m = vm_page_alloc_freelist(VM_FREELIST_DIRECT, req);
 	if (m == NULL)
 		return (NULL);
 
 	if ((m->flags & PG_ZERO) == 0)
 		pmap_zero_page(m);
 
 	m->pindex = index;
 	atomic_add_int(&cnt.v_wire_count, 1);
 	m->wire_count = 1;
 	return (m);
 }
 
 /*
  * Initialize a preallocated and zeroed pmap structure,
  * such as one in a vmspace structure.
  */
 int
 pmap_pinit(pmap_t pmap)
 {
 	vm_offset_t ptdva;
 	vm_page_t ptdpg;
 	int i;
 
 	PMAP_LOCK_INIT(pmap);
 
 	/*
 	 * allocate the page directory page
 	 */
 	while ((ptdpg = pmap_alloc_direct_page(NUSERPGTBLS, VM_ALLOC_NORMAL)) == NULL)
 	       pmap_grow_direct_page_cache();
 
 	ptdva = MIPS_PHYS_TO_DIRECT(VM_PAGE_TO_PHYS(ptdpg));
 	pmap->pm_segtab = (pd_entry_t *)ptdva;
-	pmap->pm_active = 0;
+	CPU_ZERO(&pmap->pm_active);
 	pmap->pm_ptphint = NULL;
 	for (i = 0; i < MAXCPU; i++) {
 		pmap->pm_asid[i].asid = PMAP_ASID_RESERVED;
 		pmap->pm_asid[i].gen = 0;
 	}
 	TAILQ_INIT(&pmap->pm_pvlist);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 
 	return (1);
 }
 
 /*
  * this routine is called if the page table page is not
  * mapped correctly.
  */
 static vm_page_t
 _pmap_allocpte(pmap_t pmap, unsigned ptepindex, int flags)
 {
 	vm_offset_t pageva;
 	vm_page_t m;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("_pmap_allocpte: flags is neither M_NOWAIT nor M_WAITOK"));
 
 	/*
 	 * Find or fabricate a new pagetable page
 	 */
 	if ((m = pmap_alloc_direct_page(ptepindex, VM_ALLOC_NORMAL)) == NULL) {
 		if (flags & M_WAITOK) {
 			PMAP_UNLOCK(pmap);
 			vm_page_unlock_queues();
 			pmap_grow_direct_page_cache();
 			vm_page_lock_queues();
 			PMAP_LOCK(pmap);
 		}
 
 		/*
 		 * Indicate the need to retry.	While waiting, the page
 		 * table page may have been allocated.
 		 */
 		return (NULL);
 	}
 
 	/*
 	 * Map the pagetable page into the process address space, if it
 	 * isn't already there.
 	 */
 	pageva = MIPS_PHYS_TO_DIRECT(VM_PAGE_TO_PHYS(m));
 
 #ifdef __mips_n64
 	if (ptepindex >= NUPDE) {
 		pmap->pm_segtab[ptepindex - NUPDE] = (pd_entry_t)pageva;
 	} else {
 		pd_entry_t *pdep, *pde;
 		int segindex = ptepindex >> (SEGSHIFT - PDRSHIFT);
 		int pdeindex = ptepindex & (NPDEPG - 1);
 		vm_page_t pg;
 		
 		pdep = &pmap->pm_segtab[segindex];
 		if (*pdep == NULL) { 
 			/* recurse for allocating page dir */
 			if (_pmap_allocpte(pmap, NUPDE + segindex, 
 			    flags) == NULL) {
 				/* alloc failed, release current */
 				--m->wire_count;
 				atomic_subtract_int(&cnt.v_wire_count, 1);
 				vm_page_free_zero(m);
 				return (NULL);
 			}
 		} else {
 			pg = PHYS_TO_VM_PAGE(MIPS_DIRECT_TO_PHYS(*pdep));
 			pg->wire_count++;
 		}
 		/* Next level entry */
 		pde = (pd_entry_t *)*pdep;
 		pde[pdeindex] = (pd_entry_t)pageva;
 		pmap->pm_ptphint = m;
 	}
 #else
 	pmap->pm_segtab[ptepindex] = (pd_entry_t)pageva;
 #endif
 	pmap->pm_stats.resident_count++;
 
 	/*
 	 * Set the page table hint
 	 */
 	pmap->pm_ptphint = m;
 	return (m);
 }
 
 static vm_page_t
 pmap_allocpte(pmap_t pmap, vm_offset_t va, int flags)
 {
 	unsigned ptepindex;
 	pd_entry_t *pde;
 	vm_page_t m;
 
 	KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT ||
 	    (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK,
 	    ("pmap_allocpte: flags is neither M_NOWAIT nor M_WAITOK"));
 
 	/*
 	 * Calculate pagetable page index
 	 */
 	ptepindex = pmap_pde_pindex(va);
 retry:
 	/*
 	 * Get the page directory entry
 	 */
 	pde = pmap_pde(pmap, va);
 
 	/*
 	 * If the page table page is mapped, we just increment the hold
 	 * count, and activate it.
 	 */
 	if (pde != NULL && *pde != NULL) {
 		/*
 		 * In order to get the page table page, try the hint first.
 		 */
 		if (pmap->pm_ptphint &&
 		    (pmap->pm_ptphint->pindex == ptepindex)) {
 			m = pmap->pm_ptphint;
 		} else {
 			m = PHYS_TO_VM_PAGE(MIPS_DIRECT_TO_PHYS(*pde));
 			pmap->pm_ptphint = m;
 		}
 		m->wire_count++;
 	} else {
 		/*
 		 * Here if the pte page isn't mapped, or if it has been
 		 * deallocated.
 		 */
 		m = _pmap_allocpte(pmap, ptepindex, flags);
 		if (m == NULL && (flags & M_WAITOK))
 			goto retry;
 	}
 	return (m);
 }
 
 
 /***************************************************
 * Pmap allocation/deallocation routines.
  ***************************************************/
 /*
  *  Revision 1.397
  *  - Merged pmap_release and pmap_release_free_page.  When pmap_release is
  *    called only the page directory page(s) can be left in the pmap pte
  *    object, since all page table pages will have been freed by
  *    pmap_remove_pages and pmap_remove.  In addition, there can only be one
  *    reference to the pmap and the page directory is wired, so the page(s)
  *    can never be busy.  So all there is to do is clear the magic mappings
  *    from the page directory and free the page(s).
  */
 
 
 /*
  * Release any resources held by the given physical map.
  * Called when a pmap initialized by pmap_pinit is being released.
  * Should only be called if the map contains no valid mappings.
  */
 void
 pmap_release(pmap_t pmap)
 {
 	vm_offset_t ptdva;
 	vm_page_t ptdpg;
 
 	KASSERT(pmap->pm_stats.resident_count == 0,
 	    ("pmap_release: pmap resident count %ld != 0",
 	    pmap->pm_stats.resident_count));
 
 	ptdva = (vm_offset_t)pmap->pm_segtab;
 	ptdpg = PHYS_TO_VM_PAGE(MIPS_DIRECT_TO_PHYS(ptdva));
 
 	ptdpg->wire_count--;
 	atomic_subtract_int(&cnt.v_wire_count, 1);
 	vm_page_free_zero(ptdpg);
 	PMAP_LOCK_DESTROY(pmap);
 }
 
 /*
  * grow the number of kernel page table entries, if needed
  */
 void
 pmap_growkernel(vm_offset_t addr)
 {
 	vm_page_t nkpg;
 	pd_entry_t *pde, *pdpe;
 	pt_entry_t *pte;
 	int i;
 
 	mtx_assert(&kernel_map->system_mtx, MA_OWNED);
 	addr = roundup2(addr, NBSEG);
 	if (addr - 1 >= kernel_map->max_offset)
 		addr = kernel_map->max_offset;
 	while (kernel_vm_end < addr) {
 		pdpe = pmap_segmap(kernel_pmap, kernel_vm_end);
 #ifdef __mips_n64
 		if (*pdpe == 0) {
 			/* new intermediate page table entry */
 			nkpg = pmap_alloc_direct_page(nkpt, VM_ALLOC_INTERRUPT);
 			if (nkpg == NULL)
 				panic("pmap_growkernel: no memory to grow kernel");
 			*pdpe = (pd_entry_t)MIPS_PHYS_TO_DIRECT(VM_PAGE_TO_PHYS(nkpg));
 			continue; /* try again */
 		}
 #endif
 		pde = pmap_pdpe_to_pde(pdpe, kernel_vm_end);
 		if (*pde != 0) {
 			kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK;
 			if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 				kernel_vm_end = kernel_map->max_offset;
 				break;
 			}
 			continue;
 		}
 
 		/*
 		 * This index is bogus, but out of the way
 		 */
 		nkpg = pmap_alloc_direct_page(nkpt, VM_ALLOC_INTERRUPT);
 		if (!nkpg)
 			panic("pmap_growkernel: no memory to grow kernel");
 		nkpt++;
 		*pde = (pd_entry_t)MIPS_PHYS_TO_DIRECT(VM_PAGE_TO_PHYS(nkpg));
 
 		/*
 		 * The R[4-7]?00 stores only one copy of the Global bit in
 		 * the translation lookaside buffer for each 2 page entry.
 		 * Thus invalid entrys must have the Global bit set so when
 		 * Entry LO and Entry HI G bits are anded together they will
 		 * produce a global bit to store in the tlb.
 		 */
 		pte = (pt_entry_t *)*pde;
 		for (i = 0; i < NPTEPG; i++)
 			pte[i] = PTE_G;
 
 		kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK;
 		if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 			kernel_vm_end = kernel_map->max_offset;
 			break;
 		}
 	}
 }
 
 /***************************************************
 * page management routines.
  ***************************************************/
 
 /*
  * free the pv_entry back to the free list
  */
 static PMAP_INLINE void
 free_pv_entry(pv_entry_t pv)
 {
 
 	pv_entry_count--;
 	uma_zfree(pvzone, pv);
 }
 
 /*
  * get a new pv_entry, allocating a block from the system
  * when needed.
  * the memory allocation is performed bypassing the malloc code
  * because of the possibility of allocations at interrupt time.
  */
 static pv_entry_t
 get_pv_entry(pmap_t locked_pmap)
 {
 	static const struct timeval printinterval = { 60, 0 };
 	static struct timeval lastprint;
 	struct vpgqueues *vpq;
 	pt_entry_t *pte, oldpte;
 	pmap_t pmap;
 	pv_entry_t allocated_pv, next_pv, pv;
 	vm_offset_t va;
 	vm_page_t m;
 
 	PMAP_LOCK_ASSERT(locked_pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	allocated_pv = uma_zalloc(pvzone, M_NOWAIT);
 	if (allocated_pv != NULL) {
 		pv_entry_count++;
 		if (pv_entry_count > pv_entry_high_water)
 			pagedaemon_wakeup();
 		else
 			return (allocated_pv);
 	}
 	/*
 	 * Reclaim pv entries: At first, destroy mappings to inactive
 	 * pages.  After that, if a pv entry is still needed, destroy
 	 * mappings to active pages.
 	 */
 	if (ratecheck(&lastprint, &printinterval))
 		printf("Approaching the limit on PV entries, "
 		    "increase the vm.pmap.shpgperproc tunable.\n");
 	vpq = &vm_page_queues[PQ_INACTIVE];
 retry:
 	TAILQ_FOREACH(m, &vpq->pl, pageq) {
 		if (m->hold_count || m->busy)
 			continue;
 		TAILQ_FOREACH_SAFE(pv, &m->md.pv_list, pv_list, next_pv) {
 			va = pv->pv_va;
 			pmap = pv->pv_pmap;
 			/* Avoid deadlock and lock recursion. */
 			if (pmap > locked_pmap)
 				PMAP_LOCK(pmap);
 			else if (pmap != locked_pmap && !PMAP_TRYLOCK(pmap))
 				continue;
 			pmap->pm_stats.resident_count--;
 			pte = pmap_pte(pmap, va);
 			KASSERT(pte != NULL, ("pte"));
 			oldpte = *pte;
 			if (is_kernel_pmap(pmap))
 				*pte = PTE_G;
 			else
 				*pte = 0;
 			KASSERT(!pte_test(&oldpte, PTE_W),
 			    ("wired pte for unwired page"));
 			if (m->md.pv_flags & PV_TABLE_REF)
 				vm_page_flag_set(m, PG_REFERENCED);
 			if (pte_test(&oldpte, PTE_D))
 				vm_page_dirty(m);
 			pmap_invalidate_page(pmap, va);
 			TAILQ_REMOVE(&pmap->pm_pvlist, pv, pv_plist);
 			m->md.pv_list_count--;
 			TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 			pmap_unuse_pt(pmap, va, pv->pv_ptem);
 			if (pmap != locked_pmap)
 				PMAP_UNLOCK(pmap);
 			if (allocated_pv == NULL)
 				allocated_pv = pv;
 			else
 				free_pv_entry(pv);
 		}
 		if (TAILQ_EMPTY(&m->md.pv_list)) {
 			vm_page_flag_clear(m, PG_WRITEABLE);
 			m->md.pv_flags &= ~(PV_TABLE_REF | PV_TABLE_MOD);
 		}
 	}
 	if (allocated_pv == NULL) {
 		if (vpq == &vm_page_queues[PQ_INACTIVE]) {
 			vpq = &vm_page_queues[PQ_ACTIVE];
 			goto retry;
 		}
 		panic("get_pv_entry: increase the vm.pmap.shpgperproc tunable");
 	}
 	return (allocated_pv);
 }
 
 /*
  *  Revision 1.370
  *
  *  Move pmap_collect() out of the machine-dependent code, rename it
  *  to reflect its new location, and add page queue and flag locking.
  *
  *  Notes: (1) alpha, i386, and ia64 had identical implementations
  *  of pmap_collect() in terms of machine-independent interfaces;
  *  (2) sparc64 doesn't require it; (3) powerpc had it as a TODO.
  *
  *  MIPS implementation was identical to alpha [Junos 8.2]
  */
 
 /*
  * If it is the first entry on the list, it is actually
  * in the header and we must copy the following entry up
  * to the header.  Otherwise we must search the list for
  * the entry.  In either case we free the now unused entry.
  */
 
 static pv_entry_t
 pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if (pvh->pv_list_count < pmap->pm_stats.resident_count) {
 		TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) {
 			if (pmap == pv->pv_pmap && va == pv->pv_va)
 				break;
 		}
 	} else {
 		TAILQ_FOREACH(pv, &pmap->pm_pvlist, pv_plist) {
 			if (va == pv->pv_va)
 				break;
 		}
 	}
 	if (pv != NULL) {
 		TAILQ_REMOVE(&pvh->pv_list, pv, pv_list);
 		pvh->pv_list_count--;
 		TAILQ_REMOVE(&pmap->pm_pvlist, pv, pv_plist);
 	}
 	return (pv);
 }
 
 static void
 pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	pv = pmap_pvh_remove(pvh, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pvh_free: pv not found, pa %lx va %lx",
 	     (u_long)VM_PAGE_TO_PHYS(member2struct(vm_page, md, pvh)),
 	     (u_long)va));
 	free_pv_entry(pv);
 }
 
 static void
 pmap_remove_entry(pmap_t pmap, vm_page_t m, vm_offset_t va)
 {
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	pmap_pvh_free(&m->md, pmap, va);
 	if (TAILQ_EMPTY(&m->md.pv_list))
 		vm_page_flag_clear(m, PG_WRITEABLE);
 }
 
 /*
  * Conditionally create a pv entry.
  */
 static boolean_t
 pmap_try_insert_pv_entry(pmap_t pmap, vm_page_t mpte, vm_offset_t va,
     vm_page_t m)
 {
 	pv_entry_t pv;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	if (pv_entry_count < pv_entry_high_water && 
 	    (pv = uma_zalloc(pvzone, M_NOWAIT)) != NULL) {
 		pv_entry_count++;
 		pv->pv_va = va;
 		pv->pv_pmap = pmap;
 		pv->pv_ptem = mpte;
 		TAILQ_INSERT_TAIL(&pmap->pm_pvlist, pv, pv_plist);
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 		m->md.pv_list_count++;
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 /*
  * pmap_remove_pte: do the things to unmap a page in a process
  */
 static int
 pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq, vm_offset_t va)
 {
 	pt_entry_t oldpte;
 	vm_page_t m;
 	vm_paddr_t pa;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	oldpte = *ptq;
 	if (is_kernel_pmap(pmap))
 		*ptq = PTE_G;
 	else
 		*ptq = 0;
 
 	if (pte_test(&oldpte, PTE_W))
 		pmap->pm_stats.wired_count -= 1;
 
 	pmap->pm_stats.resident_count -= 1;
 	pa = TLBLO_PTE_TO_PA(oldpte);
 
 	if (page_is_managed(pa)) {
 		m = PHYS_TO_VM_PAGE(pa);
 		if (pte_test(&oldpte, PTE_D)) {
 			KASSERT(!pte_test(&oldpte, PTE_RO),
 			    ("%s: modified page not writable: va: %p, pte: %#jx",
 			    __func__, (void *)va, (uintmax_t)oldpte));
 			vm_page_dirty(m);
 		}
 		if (m->md.pv_flags & PV_TABLE_REF)
 			vm_page_flag_set(m, PG_REFERENCED);
 		m->md.pv_flags &= ~(PV_TABLE_REF | PV_TABLE_MOD);
 
 		pmap_remove_entry(pmap, m, va);
 	}
 	return (pmap_unuse_pt(pmap, va, NULL));
 }
 
 /*
  * Remove a single page from a process address space
  */
 static void
 pmap_remove_page(struct pmap *pmap, vm_offset_t va)
 {
 	pt_entry_t *ptq;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	ptq = pmap_pte(pmap, va);
 
 	/*
 	 * if there is no pte for this address, just skip it!!!
 	 */
 	if (!ptq || !pte_test(ptq, PTE_V)) {
 		return;
 	}
 
 	/*
 	 * Write back all caches from the page being destroyed
 	 */
 	mips_dcache_wbinv_range_index(va, PAGE_SIZE);
 
 	/*
 	 * get a local va for mappings for this pmap.
 	 */
 	(void)pmap_remove_pte(pmap, ptq, va);
 	pmap_invalidate_page(pmap, va);
 
 	return;
 }
 
 /*
  *	Remove the given range of addresses from the specified map.
  *
  *	It is assumed that the start and end are properly
  *	rounded to the page size.
  */
 void
 pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t va_next;
 	pd_entry_t *pde, *pdpe;
 	pt_entry_t *pte;
 
 	if (pmap == NULL)
 		return;
 
 	if (pmap->pm_stats.resident_count == 0)
 		return;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 
 	/*
 	 * special handling of removing one page.  a very common operation
 	 * and easy to short circuit some code.
 	 */
 	if ((sva + PAGE_SIZE) == eva) {
 		pmap_remove_page(pmap, sva);
 		goto out;
 	}
 	for (; sva < eva; sva = va_next) {
 		pdpe = pmap_segmap(pmap, sva);
 #ifdef __mips_n64
 		if (*pdpe == 0) {
 			va_next = (sva + NBSEG) & ~SEGMASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 #endif
 		va_next = (sva + NBPDR) & ~PDRMASK;
 		if (va_next < sva)
 			va_next = eva;
 
 		pde = pmap_pdpe_to_pde(pdpe, sva);
 		if (*pde == 0)
 			continue;
 		if (va_next > eva)
 			va_next = eva;
 		for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; 
 		    pte++, sva += PAGE_SIZE) {
 			pmap_remove_page(pmap, sva);
 		}
 	}
 out:
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	Routine:	pmap_remove_all
  *	Function:
  *		Removes this physical page from
  *		all physical maps in which it resides.
  *		Reflects back modify bits to the pager.
  *
  *	Notes:
  *		Original versions of this routine were very
  *		inefficient because they iteratively called
  *		pmap_remove (slow...)
  */
 
 void
 pmap_remove_all(vm_page_t m)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte, tpte;
 
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_remove_all: page %p is fictitious", m));
 	vm_page_lock_queues();
 
 	if (m->md.pv_flags & PV_TABLE_REF)
 		vm_page_flag_set(m, PG_REFERENCED);
 
 	while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		PMAP_LOCK(pv->pv_pmap);
 
 		/*
 		 * If it's last mapping writeback all caches from 
 		 * the page being destroyed
 	 	 */
 		if (m->md.pv_list_count == 1) 
 			mips_dcache_wbinv_range_index(pv->pv_va, PAGE_SIZE);
 
 		pv->pv_pmap->pm_stats.resident_count--;
 
 		pte = pmap_pte(pv->pv_pmap, pv->pv_va);
 
 		tpte = *pte;
 		if (is_kernel_pmap(pv->pv_pmap))
 			*pte = PTE_G;
 		else
 			*pte = 0;
 
 		if (pte_test(&tpte, PTE_W))
 			pv->pv_pmap->pm_stats.wired_count--;
 
 		/*
 		 * Update the vm_page_t clean and reference bits.
 		 */
 		if (pte_test(&tpte, PTE_D)) {
 			KASSERT(!pte_test(&tpte, PTE_RO),
 			    ("%s: modified page not writable: va: %p, pte: %#jx",
 			    __func__, (void *)pv->pv_va, (uintmax_t)tpte));
 			vm_page_dirty(m);
 		}
 		pmap_invalidate_page(pv->pv_pmap, pv->pv_va);
 
 		TAILQ_REMOVE(&pv->pv_pmap->pm_pvlist, pv, pv_plist);
 		TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 		m->md.pv_list_count--;
 		pmap_unuse_pt(pv->pv_pmap, pv->pv_va, pv->pv_ptem);
 		PMAP_UNLOCK(pv->pv_pmap);
 		free_pv_entry(pv);
 	}
 
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	m->md.pv_flags &= ~(PV_TABLE_REF | PV_TABLE_MOD);
 	vm_page_unlock_queues();
 }
 
 /*
  *	Set the physical protection on the
  *	specified range of this map as requested.
  */
 void
 pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
 {
 	pt_entry_t *pte;
 	pd_entry_t *pde, *pdpe;
 	vm_offset_t va_next;
 
 	if (pmap == NULL)
 		return;
 
 	if ((prot & VM_PROT_READ) == VM_PROT_NONE) {
 		pmap_remove(pmap, sva, eva);
 		return;
 	}
 	if (prot & VM_PROT_WRITE)
 		return;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	for (; sva < eva; sva = va_next) {
 		pt_entry_t pbits;
 		vm_page_t m;
 		vm_paddr_t pa;
 
 		pdpe = pmap_segmap(pmap, sva);
 #ifdef __mips_n64
 		if (*pdpe == 0) {
 			va_next = (sva + NBSEG) & ~SEGMASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 #endif
 		va_next = (sva + NBPDR) & ~PDRMASK;
 		if (va_next < sva)
 			va_next = eva;
 
 		pde = pmap_pdpe_to_pde(pdpe, sva);
 		if (pde == NULL || *pde == NULL)
 			continue;
 		if (va_next > eva)
 			va_next = eva;
 
 		for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++,
 		     sva += PAGE_SIZE) {
 
 			/* Skip invalid PTEs */
 			if (!pte_test(pte, PTE_V))
 				continue;
 			pbits = *pte;
 			pa = TLBLO_PTE_TO_PA(pbits);
 			if (page_is_managed(pa) && pte_test(&pbits, PTE_D)) {
 				m = PHYS_TO_VM_PAGE(pa);
 				vm_page_dirty(m);
 				m->md.pv_flags &= ~PV_TABLE_MOD;
 			}
 			pte_clear(&pbits, PTE_D);
 			pte_set(&pbits, PTE_RO);
 			
 			if (pbits != *pte) {
 				*pte = pbits;
 				pmap_update_page(pmap, sva, pbits);
 			}
 		}
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	Insert the given physical page (p) at
  *	the specified virtual address (v) in the
  *	target physical map with the protection requested.
  *
  *	If specified, the page will be wired down, meaning
  *	that the related pte can not be reclaimed.
  *
  *	NB:  This is the only routine which MAY NOT lazy-evaluate
  *	or lose information.  That is, this routine must actually
  *	insert this page into the given map NOW.
  */
 void
 pmap_enter(pmap_t pmap, vm_offset_t va, vm_prot_t access, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 	vm_paddr_t pa, opa;
 	pt_entry_t *pte;
 	pt_entry_t origpte, newpte;
 	pv_entry_t pv;
 	vm_page_t mpte, om;
 	pt_entry_t rw = 0;
 
 	if (pmap == NULL)
 		return;
 
 	va &= ~PAGE_MASK;
  	KASSERT(va <= VM_MAX_KERNEL_ADDRESS, ("pmap_enter: toobig"));
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 	    (m->oflags & VPO_BUSY) != 0,
 	    ("pmap_enter: page %p is not busy", m));
 
 	mpte = NULL;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 
 	/*
 	 * In the case that a page table page is not resident, we are
 	 * creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		mpte = pmap_allocpte(pmap, va, M_WAITOK);
 	}
 	pte = pmap_pte(pmap, va);
 
 	/*
 	 * Page Directory table entry not valid, we need a new PT page
 	 */
 	if (pte == NULL) {
 		panic("pmap_enter: invalid page directory, pdir=%p, va=%p",
 		    (void *)pmap->pm_segtab, (void *)va);
 	}
 	pa = VM_PAGE_TO_PHYS(m);
 	om = NULL;
 	origpte = *pte;
 	opa = TLBLO_PTE_TO_PA(origpte);
 
 	/*
 	 * Mapping has not changed, must be protection or wiring change.
 	 */
 	if (pte_test(&origpte, PTE_V) && opa == pa) {
 		/*
 		 * Wiring change, just update stats. We don't worry about
 		 * wiring PT pages as they remain resident as long as there
 		 * are valid mappings in them. Hence, if a user page is
 		 * wired, the PT page will be also.
 		 */
 		if (wired && !pte_test(&origpte, PTE_W))
 			pmap->pm_stats.wired_count++;
 		else if (!wired && pte_test(&origpte, PTE_W))
 			pmap->pm_stats.wired_count--;
 
 		KASSERT(!pte_test(&origpte, PTE_D | PTE_RO),
 		    ("%s: modified page not writable: va: %p, pte: %#jx",
 		    __func__, (void *)va, (uintmax_t)origpte));
 
 		/*
 		 * Remove extra pte reference
 		 */
 		if (mpte)
 			mpte->wire_count--;
 
 		if (page_is_managed(opa)) {
 			om = m;
 		}
 		goto validate;
 	}
 
 	pv = NULL;
 
 	/*
 	 * Mapping has changed, invalidate old range and fall through to
 	 * handle validating new mapping.
 	 */
 	if (opa) {
 		if (pte_test(&origpte, PTE_W))
 			pmap->pm_stats.wired_count--;
 
 		if (page_is_managed(opa)) {
 			om = PHYS_TO_VM_PAGE(opa);
 			pv = pmap_pvh_remove(&om->md, pmap, va);
 		}
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			KASSERT(mpte->wire_count > 0,
 			    ("pmap_enter: missing reference to page table page,"
 			    " va: %p", (void *)va));
 		}
 	} else
 		pmap->pm_stats.resident_count++;
 
 	/*
 	 * Enter on the PV list if part of our managed memory. Note that we
 	 * raise IPL while manipulating pv_table since pmap_enter can be
 	 * called at interrupt time.
 	 */
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0) {
 		KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva,
 		    ("pmap_enter: managed mapping within the clean submap"));
 		if (pv == NULL)
 			pv = get_pv_entry(pmap);
 		pv->pv_va = va;
 		pv->pv_pmap = pmap;
 		pv->pv_ptem = mpte;
 		TAILQ_INSERT_TAIL(&pmap->pm_pvlist, pv, pv_plist);
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list);
 		m->md.pv_list_count++;
 	} else if (pv != NULL)
 		free_pv_entry(pv);
 
 	/*
 	 * Increment counters
 	 */
 	if (wired)
 		pmap->pm_stats.wired_count++;
 
 validate:
 	if ((access & VM_PROT_WRITE) != 0)
 		m->md.pv_flags |= PV_TABLE_MOD | PV_TABLE_REF;
 	rw = init_pte_prot(va, m, prot);
 
 #ifdef PMAP_DEBUG
 	printf("pmap_enter:  va: %p -> pa: %p\n", (void *)va, (void *)pa);
 #endif
 	/*
 	 * Now validate mapping with desired protection/wiring.
 	 */
 	newpte = TLBLO_PA_TO_PFN(pa) | rw | PTE_V;
 
 	if (is_cacheable_mem(pa))
 		newpte |= PTE_C_CACHE;
 	else
 		newpte |= PTE_C_UNCACHED;
 
 	if (wired)
 		newpte |= PTE_W;
 
 	if (is_kernel_pmap(pmap))
 	         newpte |= PTE_G;
 
 	/*
 	 * if the mapping or permission bits are different, we need to
 	 * update the pte.
 	 */
 	if (origpte != newpte) {
 		if (pte_test(&origpte, PTE_V)) {
 			*pte = newpte;
 			if (page_is_managed(opa) && (opa != pa)) {
 				if (om->md.pv_flags & PV_TABLE_REF)
 					vm_page_flag_set(om, PG_REFERENCED);
 				om->md.pv_flags &=
 				    ~(PV_TABLE_REF | PV_TABLE_MOD);
 			}
 			if (pte_test(&origpte, PTE_D)) {
 				KASSERT(!pte_test(&origpte, PTE_RO),
 				    ("pmap_enter: modified page not writable:"
 				    " va: %p, pte: %#jx", (void *)va, (uintmax_t)origpte));
 				if (page_is_managed(opa))
 					vm_page_dirty(om);
 			}
 			if (page_is_managed(opa) &&
 			    TAILQ_EMPTY(&om->md.pv_list))
 				vm_page_flag_clear(om, PG_WRITEABLE);
 		} else {
 			*pte = newpte;
 		}
 	}
 	pmap_update_page(pmap, va, newpte);
 
 	/*
 	 * Sync I & D caches for executable pages.  Do this only if the
 	 * target pmap belongs to the current process.  Otherwise, an
 	 * unresolvable TLB miss may occur.
 	 */
 	if (!is_kernel_pmap(pmap) && (pmap == &curproc->p_vmspace->vm_pmap) &&
 	    (prot & VM_PROT_EXECUTE)) {
 		mips_icache_sync_range(va, PAGE_SIZE);
 		mips_dcache_wbinv_range(va, PAGE_SIZE);
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * this code makes some *MAJOR* assumptions:
  * 1. Current pmap & pmap exists.
  * 2. Not wired.
  * 3. Read access.
  * 4. No page table pages.
  * but is *MUCH* faster than pmap_enter...
  */
 
 void
 pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	(void)pmap_enter_quick_locked(pmap, va, m, prot, NULL);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 static vm_page_t
 pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, vm_page_t mpte)
 {
 	pt_entry_t *pte;
 	vm_paddr_t pa;
 
 	KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva ||
 	    (m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0,
 	    ("pmap_enter_quick_locked: managed mapping within the clean submap"));
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * In the case that a page table page is not resident, we are
 	 * creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		pd_entry_t *pde;
 		unsigned ptepindex;
 
 		/*
 		 * Calculate pagetable page index
 		 */
 		ptepindex = pmap_pde_pindex(va);
 		if (mpte && (mpte->pindex == ptepindex)) {
 			mpte->wire_count++;
 		} else {
 			/*
 			 * Get the page directory entry
 			 */
 			pde = pmap_pde(pmap, va);
 
 			/*
 			 * If the page table page is mapped, we just
 			 * increment the hold count, and activate it.
 			 */
 			if (pde && *pde != 0) {
 				if (pmap->pm_ptphint &&
 				    (pmap->pm_ptphint->pindex == ptepindex)) {
 					mpte = pmap->pm_ptphint;
 				} else {
 					mpte = PHYS_TO_VM_PAGE(
 						MIPS_DIRECT_TO_PHYS(*pde));
 					pmap->pm_ptphint = mpte;
 				}
 				mpte->wire_count++;
 			} else {
 				mpte = _pmap_allocpte(pmap, ptepindex,
 				    M_NOWAIT);
 				if (mpte == NULL)
 					return (mpte);
 			}
 		}
 	} else {
 		mpte = NULL;
 	}
 
 	pte = pmap_pte(pmap, va);
 	if (pte_test(pte, PTE_V)) {
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0 &&
 	    !pmap_try_insert_pv_entry(pmap, mpte, va, m)) {
 		if (mpte != NULL) {
 			pmap_unwire_pte_hold(pmap, va, mpte);
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Increment counters
 	 */
 	pmap->pm_stats.resident_count++;
 
 	pa = VM_PAGE_TO_PHYS(m);
 
 	/*
 	 * Now validate mapping with RO protection
 	 */
 	*pte = TLBLO_PA_TO_PFN(pa) | PTE_V;
 
 	if (is_cacheable_mem(pa))
 		*pte |= PTE_C_CACHE;
 	else
 		*pte |= PTE_C_UNCACHED;
 
 	if (is_kernel_pmap(pmap))
 		*pte |= PTE_G;
 	else {
 		*pte |= PTE_RO;
 		/*
 		 * Sync I & D caches.  Do this only if the target pmap
 		 * belongs to the current process.  Otherwise, an
 		 * unresolvable TLB miss may occur. */
 		if (pmap == &curproc->p_vmspace->vm_pmap) {
 			va &= ~PAGE_MASK;
 			mips_icache_sync_range(va, PAGE_SIZE);
 			mips_dcache_wbinv_range(va, PAGE_SIZE);
 		}
 	}
 	return (mpte);
 }
 
 /*
  * Make a temporary mapping for a physical address.  This is only intended
  * to be used for panic dumps.
  *
  * Use XKPHYS for 64 bit, and KSEG0 where possible for 32 bit.
  */
 void *
 pmap_kenter_temporary(vm_paddr_t pa, int i)
 {
 	vm_offset_t va;
 
 	if (i != 0)
 		printf("%s: ERROR!!! More than one page of virtual address mapping not supported\n",
 		    __func__);
 
 	if (MIPS_DIRECT_MAPPABLE(pa)) {
 		va = MIPS_PHYS_TO_DIRECT(pa);
 	} else {
 #ifndef __mips_n64    /* XXX : to be converted to new style */
 		int cpu;
 		register_t intr;
 		struct local_sysmaps *sysm;
 		pt_entry_t *pte, npte;
 
 		/* If this is used other than for dumps, we may need to leave
 		 * interrupts disasbled on return. If crash dumps don't work when
 		 * we get to this point, we might want to consider this (leaving things
 		 * disabled as a starting point ;-)
 	 	 */
 		intr = intr_disable();
 		cpu = PCPU_GET(cpuid);
 		sysm = &sysmap_lmem[cpu];
 		/* Since this is for the debugger, no locks or any other fun */
 		npte = TLBLO_PA_TO_PFN(pa) | PTE_D | PTE_V | PTE_G | PTE_W | PTE_C_CACHE;
 		pte = pmap_pte(kernel_pmap, sysm->base);
 		*pte = npte;
 		sysm->valid1 = 1;
 		pmap_update_page(kernel_pmap, sysm->base, npte);
 		va = sysm->base;
 		intr_restore(intr);
 #endif
 	}
 	return ((void *)va);
 }
 
 void
 pmap_kenter_temporary_free(vm_paddr_t pa)
 {
 #ifndef __mips_n64    /* XXX : to be converted to new style */
 	int cpu;
 	register_t intr;
 	struct local_sysmaps *sysm;
 #endif
 
 	if (MIPS_DIRECT_MAPPABLE(pa)) {
 		/* nothing to do for this case */
 		return;
 	}
 #ifndef __mips_n64    /* XXX : to be converted to new style */
 	cpu = PCPU_GET(cpuid);
 	sysm = &sysmap_lmem[cpu];
 	if (sysm->valid1) {
 		pt_entry_t *pte;
 
 		intr = intr_disable();
 		pte = pmap_pte(kernel_pmap, sysm->base);
 		*pte = PTE_G;
 		pmap_invalidate_page(kernel_pmap, sysm->base);
 		intr_restore(intr);
 		sysm->valid1 = 0;
 	}
 #endif
 }
 
 /*
  * Moved the code to Machine Independent
  *	 vm_map_pmap_enter()
  */
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 pmap_enter_object(pmap_t pmap, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	vm_page_t m, mpte;
 	vm_pindex_t diff, psize;
 
 	VM_OBJECT_LOCK_ASSERT(m_start->object, MA_OWNED);
 	psize = atop(end - start);
 	mpte = NULL;
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		mpte = pmap_enter_quick_locked(pmap, start + ptoa(diff), m,
 		    prot, mpte);
 		m = TAILQ_NEXT(m, listq);
 	}
 	vm_page_unlock_queues();
  	PMAP_UNLOCK(pmap);
 }
 
 /*
  * pmap_object_init_pt preloads the ptes for a given object
  * into the specified pmap.  This eliminates the blast of soft
  * faults on process startup and immediately after an mmap.
  */
 void
 pmap_object_init_pt(pmap_t pmap, vm_offset_t addr,
     vm_object_t object, vm_pindex_t pindex, vm_size_t size)
 {
 	VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
 	KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG,
 	    ("pmap_object_init_pt: non-device object"));
 }
 
 /*
  *	Routine:	pmap_change_wiring
  *	Function:	Change the wiring attribute for a map/virtual-address
  *			pair.
  *	In/out conditions:
  *			The mapping must already exist in the pmap.
  */
 void
 pmap_change_wiring(pmap_t pmap, vm_offset_t va, boolean_t wired)
 {
 	pt_entry_t *pte;
 
 	if (pmap == NULL)
 		return;
 
 	PMAP_LOCK(pmap);
 	pte = pmap_pte(pmap, va);
 
 	if (wired && !pte_test(pte, PTE_W))
 		pmap->pm_stats.wired_count++;
 	else if (!wired && pte_test(pte, PTE_W))
 		pmap->pm_stats.wired_count--;
 
 	/*
 	 * Wiring is not a hardware characteristic so there is no need to
 	 * invalidate TLB.
 	 */
 	if (wired)
 		pte_set(pte, PTE_W);
 	else
 		pte_clear(pte, PTE_W);
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	Copy the range specified by src_addr/len
  *	from the source map to the range dst_addr/len
  *	in the destination map.
  *
  *	This routine is only advisory and need not do anything.
  */
 
 void
 pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr,
     vm_size_t len, vm_offset_t src_addr)
 {
 }
 
 /*
  *	pmap_zero_page zeros the specified hardware page by mapping
  *	the page into KVM and using bzero to clear its contents.
  *
  * 	Use XKPHYS for 64 bit, and KSEG0 where possible for 32 bit.
  */
 void
 pmap_zero_page(vm_page_t m)
 {
 	vm_offset_t va;
 	vm_paddr_t phys = VM_PAGE_TO_PHYS(m);
 
 	if (MIPS_DIRECT_MAPPABLE(phys)) {
 		va = MIPS_PHYS_TO_DIRECT(phys);
 		bzero((caddr_t)va, PAGE_SIZE);
 		mips_dcache_wbinv_range(va, PAGE_SIZE);
 	} else {
 		va = pmap_lmem_map1(phys);
 		bzero((caddr_t)va, PAGE_SIZE);
 		mips_dcache_wbinv_range(va, PAGE_SIZE);
 		pmap_lmem_unmap();
 	}
 }
 
 /*
  *	pmap_zero_page_area zeros the specified hardware page by mapping
  *	the page into KVM and using bzero to clear its contents.
  *
  *	off and size may not cover an area beyond a single hardware page.
  */
 void
 pmap_zero_page_area(vm_page_t m, int off, int size)
 {
 	vm_offset_t va;
 	vm_paddr_t phys = VM_PAGE_TO_PHYS(m);
 
 	if (MIPS_DIRECT_MAPPABLE(phys)) {
 		va = MIPS_PHYS_TO_DIRECT(phys);
 		bzero((char *)(caddr_t)va + off, size);
 		mips_dcache_wbinv_range(va + off, size);
 	} else {
 		va = pmap_lmem_map1(phys);
 		bzero((char *)va + off, size);
 		mips_dcache_wbinv_range(va + off, size);
 		pmap_lmem_unmap();
 	}
 }
 
 void
 pmap_zero_page_idle(vm_page_t m)
 {
 	vm_offset_t va;
 	vm_paddr_t phys = VM_PAGE_TO_PHYS(m);
 
 	if (MIPS_DIRECT_MAPPABLE(phys)) {
 		va = MIPS_PHYS_TO_DIRECT(phys);
 		bzero((caddr_t)va, PAGE_SIZE);
 		mips_dcache_wbinv_range(va, PAGE_SIZE);
 	} else {
 		va = pmap_lmem_map1(phys);
 		bzero((caddr_t)va, PAGE_SIZE);
 		mips_dcache_wbinv_range(va, PAGE_SIZE);
 		pmap_lmem_unmap();
 	}
 }
 
 /*
  *	pmap_copy_page copies the specified (machine independent)
  *	page by mapping the page into virtual memory and using
  *	bcopy to copy the page, one machine dependent page at a
  *	time.
  *
  * 	Use XKPHYS for 64 bit, and KSEG0 where possible for 32 bit.
  */
 void
 pmap_copy_page(vm_page_t src, vm_page_t dst)
 {
 	vm_offset_t va_src, va_dst;
 	vm_paddr_t phys_src = VM_PAGE_TO_PHYS(src);
 	vm_paddr_t phys_dst = VM_PAGE_TO_PHYS(dst);
 
 	if (MIPS_DIRECT_MAPPABLE(phys_src) && MIPS_DIRECT_MAPPABLE(phys_dst)) {
 		/* easy case, all can be accessed via KSEG0 */
 		/*
 		 * Flush all caches for VA that are mapped to this page
 		 * to make sure that data in SDRAM is up to date
 		 */
 		pmap_flush_pvcache(src);
 		mips_dcache_wbinv_range_index(
 		    MIPS_PHYS_TO_DIRECT(phys_dst), PAGE_SIZE);
 		va_src = MIPS_PHYS_TO_DIRECT(phys_src);
 		va_dst = MIPS_PHYS_TO_DIRECT(phys_dst);
 		bcopy((caddr_t)va_src, (caddr_t)va_dst, PAGE_SIZE);
 		mips_dcache_wbinv_range(va_dst, PAGE_SIZE);
 	} else {
 		va_src = pmap_lmem_map2(phys_src, phys_dst);
 		va_dst = va_src + PAGE_SIZE;
 		bcopy((void *)va_src, (void *)va_dst, PAGE_SIZE);
 		mips_dcache_wbinv_range(va_dst, PAGE_SIZE);
 		pmap_lmem_unmap();
 	}
 }
 
 /*
  * Returns true if the pmap's pv is one of the first
  * 16 pvs linked to from this page.  This count may
  * be changed upwards or downwards in the future; it
  * is only necessary that true be returned for a small
  * subset of pmaps for proper page aging.
  */
 boolean_t
 pmap_page_exists_quick(pmap_t pmap, vm_page_t m)
 {
 	pv_entry_t pv;
 	int loops = 0;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_page_exists_quick: page %p is not managed", m));
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		if (pv->pv_pmap == pmap) {
 			rv = TRUE;
 			break;
 		}
 		loops++;
 		if (loops >= 16)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Remove all pages from specified address space
  * this aids process exit speeds.  Also, this code
  * is special cased for current process only, but
  * can have the more generic (and slightly slower)
  * mode enabled.  This is much faster than pmap_remove
  * in the case of running down an entire address space.
  */
 void
 pmap_remove_pages(pmap_t pmap)
 {
 	pt_entry_t *pte, tpte;
 	pv_entry_t pv, npv;
 	vm_page_t m;
 
 	if (pmap != vmspace_pmap(curthread->td_proc->p_vmspace)) {
 		printf("warning: pmap_remove_pages called with non-current pmap\n");
 		return;
 	}
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	for (pv = TAILQ_FIRST(&pmap->pm_pvlist); pv != NULL; pv = npv) {
 
 		pte = pmap_pte(pv->pv_pmap, pv->pv_va);
 		if (!pte_test(pte, PTE_V))
 			panic("pmap_remove_pages: page on pm_pvlist has no pte");
 		tpte = *pte;
 
 /*
  * We cannot remove wired pages from a process' mapping at this time
  */
 		if (pte_test(&tpte, PTE_W)) {
 			npv = TAILQ_NEXT(pv, pv_plist);
 			continue;
 		}
 		*pte = is_kernel_pmap(pmap) ? PTE_G : 0;
 
 		m = PHYS_TO_VM_PAGE(TLBLO_PTE_TO_PA(tpte));
 		KASSERT(m != NULL,
 		    ("pmap_remove_pages: bad tpte %#jx", (uintmax_t)tpte));
 
 		pv->pv_pmap->pm_stats.resident_count--;
 
 		/*
 		 * Update the vm_page_t clean and reference bits.
 		 */
 		if (pte_test(&tpte, PTE_D)) {
 			vm_page_dirty(m);
 		}
 		npv = TAILQ_NEXT(pv, pv_plist);
 		TAILQ_REMOVE(&pv->pv_pmap->pm_pvlist, pv, pv_plist);
 
 		m->md.pv_list_count--;
 		TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
 		if (TAILQ_FIRST(&m->md.pv_list) == NULL) {
 			vm_page_flag_clear(m, PG_WRITEABLE);
 		}
 		pmap_unuse_pt(pv->pv_pmap, pv->pv_va, pv->pv_ptem);
 		free_pv_entry(pv);
 	}
 	pmap_invalidate_all(pmap);
 	PMAP_UNLOCK(pmap);
 	vm_page_unlock_queues();
 }
 
 /*
  * pmap_testbit tests bits in pte's
  * note that the testbit/changebit routines are inline,
  * and a lot of things compile-time evaluate.
  */
 static boolean_t
 pmap_testbit(vm_page_t m, int bit)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 	boolean_t rv = FALSE;
 
 	if (m->flags & PG_FICTITIOUS)
 		return (rv);
 
 	if (TAILQ_FIRST(&m->md.pv_list) == NULL)
 		return (rv);
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		PMAP_LOCK(pv->pv_pmap);
 		pte = pmap_pte(pv->pv_pmap, pv->pv_va);
 		rv = pte_test(pte, bit);
 		PMAP_UNLOCK(pv->pv_pmap);
 		if (rv)
 			break;
 	}
 	return (rv);
 }
 
 /*
  * this routine is used to clear dirty bits in ptes
  */
 static __inline void
 pmap_changebit(vm_page_t m, int bit, boolean_t setem)
 {
 	pv_entry_t pv;
 	pt_entry_t *pte;
 
 	if (m->flags & PG_FICTITIOUS)
 		return;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	/*
 	 * Loop over all current mappings setting/clearing as appropos If
 	 * setting RO do we need to clear the VAC?
 	 */
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		PMAP_LOCK(pv->pv_pmap);
 		pte = pmap_pte(pv->pv_pmap, pv->pv_va);
 		if (setem) {
 			*pte |= bit;
 			pmap_update_page(pv->pv_pmap, pv->pv_va, *pte);
 		} else {
 			pt_entry_t pbits = *pte;
 
 			if (pbits & bit) {
 				if (bit == PTE_D) {
 					if (pbits & PTE_D)
 						vm_page_dirty(m);
 					*pte = (pbits & ~PTE_D) | PTE_RO;
 				} else {
 					*pte = pbits & ~bit;
 				}
 				pmap_update_page(pv->pv_pmap, pv->pv_va, *pte);
 			}
 		}
 		PMAP_UNLOCK(pv->pv_pmap);
 	}
 	if (!setem && bit == PTE_D)
 		vm_page_flag_clear(m, PG_WRITEABLE);
 }
 
 /*
  *	pmap_page_wired_mappings:
  *
  *	Return the number of managed mappings to the given physical page
  *	that are wired.
  */
 int
 pmap_page_wired_mappings(vm_page_t m)
 {
 	pv_entry_t pv;
 	pmap_t pmap;
 	pt_entry_t *pte;
 	int count;
 
 	count = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
 		pmap = pv->pv_pmap;
 		PMAP_LOCK(pmap);
 		pte = pmap_pte(pmap, pv->pv_va);
 		if (pte_test(pte, PTE_W))
 			count++;
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 void
 pmap_remove_write(vm_page_t m)
 {
 	pv_entry_t pv, npv;
 	vm_offset_t va;
 	pt_entry_t *pte;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return;
 
 	/*
 	 * Loop over all current mappings setting/clearing as appropos.
 	 */
 	vm_page_lock_queues();
 	for (pv = TAILQ_FIRST(&m->md.pv_list); pv; pv = npv) {
 		npv = TAILQ_NEXT(pv, pv_plist);
 		pte = pmap_pte(pv->pv_pmap, pv->pv_va);
 		if (pte == NULL || !pte_test(pte, PTE_V))
 			panic("page on pm_pvlist has no pte");
 
 		va = pv->pv_va;
 		pmap_protect(pv->pv_pmap, va, va + PAGE_SIZE,
 		    VM_PROT_READ | VM_PROT_EXECUTE);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 /*
  *	pmap_ts_referenced:
  *
  *	Return the count of reference bits for a page, clearing all of them.
  */
 int
 pmap_ts_referenced(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_ts_referenced: page %p is not managed", m));
 	if (m->md.pv_flags & PV_TABLE_REF) {
 		vm_page_lock_queues();
 		m->md.pv_flags &= ~PV_TABLE_REF;
 		vm_page_unlock_queues();
 		return (1);
 	}
 	return (0);
 }
 
 /*
  *	pmap_is_modified:
  *
  *	Return whether or not the specified physical page was modified
  *	in any physical maps.
  */
 boolean_t
 pmap_is_modified(vm_page_t m)
 {
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_modified: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no PTEs can have PTE_D set.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return (FALSE);
 	vm_page_lock_queues();
 	if (m->md.pv_flags & PV_TABLE_MOD)
 		rv = TRUE;
 	else
 		rv = pmap_testbit(m, PTE_D);
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /* N/C */
 
 /*
  *	pmap_is_prefaultable:
  *
  *	Return whether or not the specified virtual address is elgible
  *	for prefault.
  */
 boolean_t
 pmap_is_prefaultable(pmap_t pmap, vm_offset_t addr)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	boolean_t rv;
 
 	rv = FALSE;
 	PMAP_LOCK(pmap);
 	pde = pmap_pde(pmap, addr);
 	if (pde != NULL && *pde != 0) {
 		pte = pmap_pde_to_pte(pde, addr);
 		rv = (*pte == 0);
 	}
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 /*
  *	Clear the modify bits on the specified physical page.
  */
 void
 pmap_clear_modify(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("pmap_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no PTEs can have PTE_D set.
 	 * If the object containing the page is locked and the page is not
 	 * VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	if (m->md.pv_flags & PV_TABLE_MOD) {
 		pmap_changebit(m, PTE_D, FALSE);
 		m->md.pv_flags &= ~PV_TABLE_MOD;
 	}
 	vm_page_unlock_queues();
 }
 
 /*
  *	pmap_is_referenced:
  *
  *	Return whether or not the specified physical page was referenced
  *	in any physical maps.
  */
 boolean_t
 pmap_is_referenced(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_referenced: page %p is not managed", m));
 	return ((m->md.pv_flags & PV_TABLE_REF) != 0);
 }
 
 /*
  *	pmap_clear_reference:
  *
  *	Clear the reference bit on the specified physical page.
  */
 void
 pmap_clear_reference(vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_reference: page %p is not managed", m));
 	vm_page_lock_queues();
 	if (m->md.pv_flags & PV_TABLE_REF) {
 		m->md.pv_flags &= ~PV_TABLE_REF;
 	}
 	vm_page_unlock_queues();
 }
 
 /*
  * Miscellaneous support routines follow
  */
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  */
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  *
  * Use XKPHYS uncached for 64 bit, and KSEG1 where possible for 32 bit.
  */
 void *
 pmap_mapdev(vm_paddr_t pa, vm_size_t size)
 {
         vm_offset_t va, tmpva, offset;
 
 	/* 
 	 * KSEG1 maps only first 512M of phys address space. For 
 	 * pa > 0x20000000 we should make proper mapping * using pmap_kenter.
 	 */
 	if (MIPS_DIRECT_MAPPABLE(pa + size - 1))
 		return ((void *)MIPS_PHYS_TO_DIRECT_UNCACHED(pa));
 	else {
 		offset = pa & PAGE_MASK;
 		size = roundup(size + offset, PAGE_SIZE);
         
 		va = kmem_alloc_nofault(kernel_map, size);
 		if (!va)
 			panic("pmap_mapdev: Couldn't alloc kernel virtual memory");
 		pa = trunc_page(pa);
 		for (tmpva = va; size > 0;) {
 			pmap_kenter_attr(tmpva, pa, PTE_C_UNCACHED);
 			size -= PAGE_SIZE;
 			tmpva += PAGE_SIZE;
 			pa += PAGE_SIZE;
 		}
 	}
 
 	return ((void *)(va + offset));
 }
 
 void
 pmap_unmapdev(vm_offset_t va, vm_size_t size)
 {
 #ifndef __mips_n64
 	vm_offset_t base, offset, tmpva;
 
 	/* If the address is within KSEG1 then there is nothing to do */
 	if (va >= MIPS_KSEG1_START && va <= MIPS_KSEG1_END)
 		return;
 
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = roundup(size + offset, PAGE_SIZE);
 	for (tmpva = base; tmpva < base + size; tmpva += PAGE_SIZE)
 		pmap_kremove(tmpva);
 	kmem_free(kernel_map, base, size);
 #endif
 }
 
 /*
  * perform the pmap work for mincore
  */
 int
 pmap_mincore(pmap_t pmap, vm_offset_t addr, vm_paddr_t *locked_pa)
 {
 	pt_entry_t *ptep, pte;
 	vm_paddr_t pa;
 	vm_page_t m;
 	int val;
 	boolean_t managed;
 
 	PMAP_LOCK(pmap);
 retry:
 	ptep = pmap_pte(pmap, addr);
 	pte = (ptep != NULL) ? *ptep : 0;
 	if (!pte_test(&pte, PTE_V)) {
 		val = 0;
 		goto out;
 	}
 	val = MINCORE_INCORE;
 	if (pte_test(&pte, PTE_D))
 		val |= MINCORE_MODIFIED | MINCORE_MODIFIED_OTHER;
 	pa = TLBLO_PTE_TO_PA(pte);
 	managed = page_is_managed(pa);
 	if (managed) {
 		/*
 		 * This may falsely report the given address as
 		 * MINCORE_REFERENCED.  Unfortunately, due to the lack of
 		 * per-PTE reference information, it is impossible to
 		 * determine if the address is MINCORE_REFERENCED.  
 		 */
 		m = PHYS_TO_VM_PAGE(pa);
 		if ((m->flags & PG_REFERENCED) != 0)
 			val |= MINCORE_REFERENCED | MINCORE_REFERENCED_OTHER;
 	}
 	if ((val & (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER)) !=
 	    (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER) && managed) {
 		/* Ensure that "PHYS_TO_VM_PAGE(pa)->object" doesn't change. */
 		if (vm_page_pa_tryrelock(pmap, pa, locked_pa))
 			goto retry;
 	} else
 out:
 		PA_UNLOCK_COND(*locked_pa);
 	PMAP_UNLOCK(pmap);
 	return (val);
 }
 
 void
 pmap_activate(struct thread *td)
 {
 	pmap_t pmap, oldpmap;
 	struct proc *p = td->td_proc;
 
 	critical_enter();
 
 	pmap = vmspace_pmap(p->p_vmspace);
 	oldpmap = PCPU_GET(curpmap);
 
 	if (oldpmap)
-		atomic_clear_32(&oldpmap->pm_active, PCPU_GET(cpumask));
-	atomic_set_32(&pmap->pm_active, PCPU_GET(cpumask));
+		CPU_NAND_ATOMIC(&oldpmap->pm_active, PCPU_PTR(cpumask));
+	CPU_OR_ATOMIC(&pmap->pm_active, PCPU_PTR(cpumask));
 	pmap_asid_alloc(pmap);
 	if (td == curthread) {
 		PCPU_SET(segbase, pmap->pm_segtab);
 		mips_wr_entryhi(pmap->pm_asid[PCPU_GET(cpuid)].asid);
 	}
 
 	PCPU_SET(curpmap, pmap);
 	critical_exit();
 }
 
 void
 pmap_sync_icache(pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 }
 
 /*
  *	Increase the starting virtual address of the given mapping if a
  *	different alignment might result in more superpage mappings.
  */
 void
 pmap_align_superpage(vm_object_t object, vm_ooffset_t offset,
     vm_offset_t *addr, vm_size_t size)
 {
 	vm_offset_t superpage_offset;
 
 	if (size < NBSEG)
 		return;
 	if (object != NULL && (object->flags & OBJ_COLORED) != 0)
 		offset += ptoa(object->pg_color);
 	superpage_offset = offset & SEGMASK;
 	if (size - ((NBSEG - superpage_offset) & SEGMASK) < NBSEG ||
 	    (*addr & SEGMASK) == superpage_offset)
 		return;
 	if ((*addr & SEGMASK) < superpage_offset)
 		*addr = (*addr & ~SEGMASK) + superpage_offset;
 	else
 		*addr = ((*addr + SEGMASK) & ~SEGMASK) + superpage_offset;
 }
 
 /*
  * 	Increase the starting virtual address of the given mapping so
  * 	that it is aligned to not be the second page in a TLB entry.
  * 	This routine assumes that the length is appropriately-sized so
  * 	that the allocation does not share a TLB entry at all if required.
  */
 void
 pmap_align_tlb(vm_offset_t *addr)
 {
 	if ((*addr & PAGE_SIZE) == 0)
 		return;
 	*addr += PAGE_SIZE;
 	return;
 }
 
 #ifdef DDB
 DB_SHOW_COMMAND(ptable, ddb_pid_dump)
 {
 	pmap_t pmap;
 	struct thread *td = NULL;
 	struct proc *p;
 	int i, j, k;
 	vm_paddr_t pa;
 	vm_offset_t va;
 
 	if (have_addr) {
 		td = db_lookup_thread(addr, TRUE);
 		if (td == NULL) {
 			db_printf("Invalid pid or tid");
 			return;
 		}
 		p = td->td_proc;
 		if (p->p_vmspace == NULL) {
 			db_printf("No vmspace for process");
 			return;
 		}
 			pmap = vmspace_pmap(p->p_vmspace);
 	} else
 		pmap = kernel_pmap;
 
 	db_printf("pmap:%p segtab:%p asid:%x generation:%x\n",
 	    pmap, pmap->pm_segtab, pmap->pm_asid[0].asid,
 	    pmap->pm_asid[0].gen);
 	for (i = 0; i < NPDEPG; i++) {
 		pd_entry_t *pdpe;
 		pt_entry_t *pde;
 		pt_entry_t pte;
 
 		pdpe = (pd_entry_t *)pmap->pm_segtab[i];
 		if (pdpe == NULL)
 			continue;
 		db_printf("[%4d] %p\n", i, pdpe);
 #ifdef __mips_n64
 		for (j = 0; j < NPDEPG; j++) {
 			pde = (pt_entry_t *)pdpe[j];
 			if (pde == NULL)
 				continue;
 			db_printf("\t[%4d] %p\n", j, pde);
 #else
 		{
 			j = 0;
 			pde =  (pt_entry_t *)pdpe;
 #endif
 			for (k = 0; k < NPTEPG; k++) {
 				pte = pde[k];
 				if (pte == 0 || !pte_test(&pte, PTE_V))
 					continue;
 				pa = TLBLO_PTE_TO_PA(pte);
 				va = ((u_long)i << SEGSHIFT) | (j << PDRSHIFT) | (k << PAGE_SHIFT);
 				db_printf("\t\t[%04d] va: %p pte: %8jx pa:%jx\n",
 				       k, (void *)va, (uintmax_t)pte, (uintmax_t)pa);
 			}
 		}
 	}
 }
 #endif
 
 #if defined(DEBUG)
 
 static void pads(pmap_t pm);
 void pmap_pvdump(vm_offset_t pa);
 
 /* print address space of pmap*/
 static void
 pads(pmap_t pm)
 {
 	unsigned va, i, j;
 	pt_entry_t *ptep;
 
 	if (pm == kernel_pmap)
 		return;
 	for (i = 0; i < NPTEPG; i++)
 		if (pm->pm_segtab[i])
 			for (j = 0; j < NPTEPG; j++) {
 				va = (i << SEGSHIFT) + (j << PAGE_SHIFT);
 				if (pm == kernel_pmap && va < KERNBASE)
 					continue;
 				if (pm != kernel_pmap &&
 				    va >= VM_MAXUSER_ADDRESS)
 					continue;
 				ptep = pmap_pte(pm, va);
 				if (pte_test(ptep, PTE_V))
 					printf("%x:%x ", va, *(int *)ptep);
 			}
 
 }
 
 void
 pmap_pvdump(vm_offset_t pa)
 {
 	register pv_entry_t pv;
 	vm_page_t m;
 
 	printf("pa %x", pa);
 	m = PHYS_TO_VM_PAGE(pa);
 	for (pv = TAILQ_FIRST(&m->md.pv_list); pv;
 	    pv = TAILQ_NEXT(pv, pv_list)) {
 		printf(" -> pmap %p, va %x", (void *)pv->pv_pmap, pv->pv_va);
 		pads(pv->pv_pmap);
 	}
 	printf(" ");
 }
 
 /* N/C */
 #endif
 
 
 /*
  * Allocate TLB address space tag (called ASID or TLBPID) and return it.
  * It takes almost as much or more time to search the TLB for a
  * specific ASID and flush those entries as it does to flush the entire TLB.
  * Therefore, when we allocate a new ASID, we just take the next number. When
  * we run out of numbers, we flush the TLB, increment the generation count
  * and start over. ASID zero is reserved for kernel use.
  */
 static void
 pmap_asid_alloc(pmap)
 	pmap_t pmap;
 {
 	if (pmap->pm_asid[PCPU_GET(cpuid)].asid != PMAP_ASID_RESERVED &&
 	    pmap->pm_asid[PCPU_GET(cpuid)].gen == PCPU_GET(asid_generation));
 	else {
 		if (PCPU_GET(next_asid) == pmap_max_asid) {
 			tlb_invalidate_all_user(NULL);
 			PCPU_SET(asid_generation,
 			    (PCPU_GET(asid_generation) + 1) & ASIDGEN_MASK);
 			if (PCPU_GET(asid_generation) == 0) {
 				PCPU_SET(asid_generation, 1);
 			}
 			PCPU_SET(next_asid, 1);	/* 0 means invalid */
 		}
 		pmap->pm_asid[PCPU_GET(cpuid)].asid = PCPU_GET(next_asid);
 		pmap->pm_asid[PCPU_GET(cpuid)].gen = PCPU_GET(asid_generation);
 		PCPU_SET(next_asid, PCPU_GET(next_asid) + 1);
 	}
 }
 
 int
 page_is_managed(vm_paddr_t pa)
 {
 	vm_offset_t pgnum = atop(pa);
 
 	if (pgnum >= first_page) {
 		vm_page_t m;
 
 		m = PHYS_TO_VM_PAGE(pa);
 		if (m == NULL)
 			return (0);
 		if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0)
 			return (1);
 	}
 	return (0);
 }
 
 static pt_entry_t
 init_pte_prot(vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 	pt_entry_t rw;
 
 	if (!(prot & VM_PROT_WRITE))
 		rw =  PTE_V | PTE_RO | PTE_C_CACHE;
 	else if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0) {
 		if ((m->md.pv_flags & PV_TABLE_MOD) != 0)
 			rw =  PTE_V | PTE_D | PTE_C_CACHE;
 		else
 			rw = PTE_V | PTE_C_CACHE;
 		vm_page_flag_set(m, PG_WRITEABLE);
 	} else
 		/* Needn't emulate a modified bit for unmanaged pages. */
 		rw =  PTE_V | PTE_D | PTE_C_CACHE;
 	return (rw);
 }
 
 /*
  * pmap_emulate_modified : do dirty bit emulation
  *
  * On SMP, update just the local TLB, other CPUs will update their
  * TLBs from PTE lazily, if they get the exception.
  * Returns 0 in case of sucess, 1 if the page is read only and we
  * need to fault.
  */
 int
 pmap_emulate_modified(pmap_t pmap, vm_offset_t va)
 {
 	vm_page_t m;
 	pt_entry_t *pte;
  	vm_paddr_t pa;
 
 	PMAP_LOCK(pmap);
 	pte = pmap_pte(pmap, va);
 	if (pte == NULL)
 		panic("pmap_emulate_modified: can't find PTE");
 #ifdef SMP
 	/* It is possible that some other CPU changed m-bit */
 	if (!pte_test(pte, PTE_V) || pte_test(pte, PTE_D)) {
 		pmap_update_page_local(pmap, va, *pte);
 		PMAP_UNLOCK(pmap);
 		return (0);
 	}
 #else
 	if (!pte_test(pte, PTE_V) || pte_test(pte, PTE_D))
 		panic("pmap_emulate_modified: invalid pte");
 #endif
 	if (pte_test(pte, PTE_RO)) {
 		/* write to read only page in the kernel */
 		PMAP_UNLOCK(pmap);
 		return (1);
 	}
 	pte_set(pte, PTE_D);
 	pmap_update_page_local(pmap, va, *pte);
 	pa = TLBLO_PTE_TO_PA(*pte);
 	if (!page_is_managed(pa))
 		panic("pmap_emulate_modified: unmanaged page");
 	m = PHYS_TO_VM_PAGE(pa);
 	m->md.pv_flags |= (PV_TABLE_REF | PV_TABLE_MOD);
 	PMAP_UNLOCK(pmap);
 	return (0);
 }
 
 /*
  *	Routine:	pmap_kextract
  *	Function:
  *		Extract the physical page address associated
  *		virtual address.
  */
  /* PMAP_INLINE */ vm_offset_t
 pmap_kextract(vm_offset_t va)
 {
 	int mapped;
 
 	/*
 	 * First, the direct-mapped regions.
 	 */
 #if defined(__mips_n64)
 	if (va >= MIPS_XKPHYS_START && va < MIPS_XKPHYS_END)
 		return (MIPS_XKPHYS_TO_PHYS(va));
 #endif
 	if (va >= MIPS_KSEG0_START && va < MIPS_KSEG0_END)
 		return (MIPS_KSEG0_TO_PHYS(va));
 
 	if (va >= MIPS_KSEG1_START && va < MIPS_KSEG1_END)
 		return (MIPS_KSEG1_TO_PHYS(va));
 
 	/*
 	 * User virtual addresses.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		pt_entry_t *ptep;
 
 		if (curproc && curproc->p_vmspace) {
 			ptep = pmap_pte(&curproc->p_vmspace->vm_pmap, va);
 			if (ptep) {
 				return (TLBLO_PTE_TO_PA(*ptep) |
 				    (va & PAGE_MASK));
 			}
 			return (0);
 		}
 	}
 
 	/*
 	 * Should be kernel virtual here, otherwise fail
 	 */
 	mapped = (va >= MIPS_KSEG2_START || va < MIPS_KSEG2_END);
 #if defined(__mips_n64)
 	mapped = mapped || (va >= MIPS_XKSEG_START || va < MIPS_XKSEG_END);
 #endif 
 	/*
 	 * Kernel virtual.
 	 */
 
 	if (mapped) {
 		pt_entry_t *ptep;
 
 		/* Is the kernel pmap initialized? */
-		if (kernel_pmap->pm_active) {
+		if (!CPU_EMPTY(&kernel_pmap->pm_active)) {
 			/* It's inside the virtual address range */
 			ptep = pmap_pte(kernel_pmap, va);
 			if (ptep) {
 				return (TLBLO_PTE_TO_PA(*ptep) |
 				    (va & PAGE_MASK));
 			}
 		}
 		return (0);
 	}
 
 	panic("%s for unknown address space %p.", __func__, (void *)va);
 }
 
 
 void 
 pmap_flush_pvcache(vm_page_t m)
 {
 	pv_entry_t pv;
 
 	if (m != NULL) {
 		for (pv = TAILQ_FIRST(&m->md.pv_list); pv;
 		    pv = TAILQ_NEXT(pv, pv_list)) {
 			mips_dcache_wbinv_range_index(pv->pv_va, PAGE_SIZE);
 		}
 	}
 }
Index: head/sys/mips/rmi/xlr_machdep.c
===================================================================
--- head/sys/mips/rmi/xlr_machdep.c	(revision 222812)
+++ head/sys/mips/rmi/xlr_machdep.c	(revision 222813)
@@ -1,631 +1,635 @@
 /*-
  * Copyright (c) 2006-2009 RMI Corporation
  * Copyright (c) 2002-2004 Juli Mallett <jmallett@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/conf.h>
 #include <sys/rtprio.h>
 #include <sys/systm.h>
 #include <sys/interrupt.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/random.h>
 
 #include <sys/cons.h>		/* cinit() */
 #include <sys/kdb.h>
 #include <sys/reboot.h>
 #include <sys/queue.h>
 #include <sys/smp.h>
 #include <sys/timetc.h>
 
 #include <vm/vm.h>
 #include <vm/vm_page.h>
 
 #include <machine/cpu.h>
 #include <machine/cpufunc.h>
 #include <machine/cpuinfo.h>
 #include <machine/cpuregs.h>
 #include <machine/frame.h>
 #include <machine/hwfunc.h>
 #include <machine/md_var.h>
 #include <machine/asm.h>
 #include <machine/pmap.h>
 #include <machine/trap.h>
 #include <machine/clock.h>
 #include <machine/fls64.h>
 #include <machine/intr_machdep.h>
 #include <machine/smp.h>
 
 #include <mips/rmi/iomap.h>
 #include <mips/rmi/msgring.h>
 #include <mips/rmi/interrupt.h>
 #include <mips/rmi/pic.h>
 #include <mips/rmi/board.h>
 #include <mips/rmi/rmi_mips_exts.h>
 #include <mips/rmi/rmi_boot_info.h>
 
 void mpwait(void);
 unsigned long xlr_io_base = (unsigned long)(DEFAULT_XLR_IO_BASE);
 
 /* 4KB static data aread to keep a copy of the bootload env until
    the dynamic kenv is setup */
 char boot1_env[4096];
 int rmi_spin_mutex_safe=0;
 struct mtx xlr_pic_lock;
 
 /*
  * Parameters from boot loader
  */
 struct boot1_info xlr_boot1_info;
 int xlr_run_mode;
 int xlr_argc;
 int32_t *xlr_argv, *xlr_envp;
 uint64_t cpu_mask_info;
 uint32_t xlr_online_cpumask;
 uint32_t xlr_core_cpu_mask = 0x1;	/* Core 0 thread 0 is always there */
 
 int xlr_shtlb_enabled;
 int xlr_ncores;
 int xlr_threads_per_core;
 uint32_t xlr_hw_thread_mask;
 int xlr_cpuid_to_hwtid[MAXCPU];
 int xlr_hwtid_to_cpuid[MAXCPU];
 
 static void 
 xlr_setup_mmu_split(void)
 {
 	uint64_t mmu_setup;
 	int val = 0;
 
 	if (xlr_threads_per_core == 4 && xlr_shtlb_enabled == 0)
 		return;   /* no change from boot setup */	
 
 	switch (xlr_threads_per_core) {
 	case 1: 
 		val = 0; break;
 	case 2: 
 		val = 2; break;
 	case 4: 
 		val = 3; break;
 	}
 	
 	mmu_setup = read_xlr_ctrl_register(4, 0);
 	mmu_setup = mmu_setup & ~0x06;
 	mmu_setup |= (val << 1);
 
 	/* turn on global mode */
 	if (xlr_shtlb_enabled)
 		mmu_setup |= 0x01;
 
 	write_xlr_ctrl_register(4, 0, mmu_setup);
 }
 
 static void
 xlr_parse_mmu_options(void)
 {
 #ifdef notyet
 	char *hw_env, *start, *end;
 #endif
 	uint32_t cpu_map;
 	uint8_t core0_thr_mask, core_thr_mask;
 	int i, j, k;
 
 	/* First check for the shared TLB setup */
 	xlr_shtlb_enabled = 0;
 #ifdef notyet
 	/* 
 	 * We don't support sharing TLB per core - TODO
 	 */
 	xlr_shtlb_enabled = 0;
 	if ((hw_env = getenv("xlr.shtlb")) != NULL) {
 		start = hw_env;
 		tmp = strtoul(start, &end, 0);
 		if (start != end)
 			xlr_shtlb_enabled = (tmp != 0);
 		else
 			printf("Bad value for xlr.shtlb [%s]\n", hw_env);
 		freeenv(hw_env);
 	}
 #endif
 	/*
 	 * XLR supports splitting the 64 TLB entries across one, two or four
 	 * threads (split mode).  XLR also allows the 64 TLB entries to be shared
          * across all threads in the core using a global flag (shared TLB mode).
          * We will support 1/2/4 threads in split mode or shared mode.
 	 *
 	 */
 	xlr_ncores = 1;
 	cpu_map = xlr_boot1_info.cpu_online_map;
 
 #ifndef SMP /* Uniprocessor! */
 	if (cpu_map != 0x1) {
 		printf("WARNING: Starting uniprocessor kernel on cpumask [0x%lx]!\n"
 		   "WARNING: Other CPUs will be unused.\n", (u_long)cpu_map);
 		cpu_map = 0x1;
 	}
 #endif
 	core0_thr_mask = cpu_map & 0xf;
 	switch (core0_thr_mask) {
 	case 1:
 		xlr_threads_per_core = 1; break;
 	case 3:
 		xlr_threads_per_core = 2; break;
 	case 0xf: 
 		xlr_threads_per_core = 4; break;
 	default:
 		goto unsupp;
 	}
 
 	/* Verify other cores CPU masks */
 	for (i = 1; i < XLR_MAX_CORES; i++) {
 		core_thr_mask = (cpu_map >> (i*4)) & 0xf;
 		if (core_thr_mask) {
 			if (core_thr_mask != core0_thr_mask)
 				goto unsupp; 
 			xlr_ncores++;
 		}
 	}
 	xlr_hw_thread_mask = cpu_map;
 
 	/* setup hardware processor id to cpu id mapping */
 	for (i = 0; i< MAXCPU; i++)
 		xlr_cpuid_to_hwtid[i] = 
 		    xlr_hwtid_to_cpuid [i] = -1;
 	for (i = 0, k = 0; i < XLR_MAX_CORES; i++) {
 		if (((cpu_map >> (i*4)) & 0xf) == 0)
 			continue;
 		for (j = 0; j < xlr_threads_per_core; j++) {
 			xlr_cpuid_to_hwtid[k] = i*4 + j;
 			xlr_hwtid_to_cpuid[i*4 + j] = k;
 			k++;
 		}
 	}
 
 	/* setup for the startup core */
 	xlr_setup_mmu_split();
 	return;
 
 unsupp:
 	printf("ERROR : Unsupported CPU mask [use 1,2 or 4 threads per core].\n"
 	    "\tcore0 thread mask [%lx], boot cpu mask [%lx]\n"
 	    "\tUsing default, 16 TLB entries per CPU, split mode\n", 
 	    (u_long)core0_thr_mask, (u_long)cpu_map);
 	panic("Invalid CPU mask - halting.\n");
 	return;
 }
 
 static void 
 xlr_set_boot_flags(void)
 {
 	char *p;
 
 	p = getenv("bootflags");
 	if (p == NULL)
 		p = getenv("boot_flags");  /* old style */
 	if (p == NULL)
 		return;
 
 	for (; p && *p != '\0'; p++) {
 		switch (*p) {
 		case 'd':
 		case 'D':
 			boothowto |= RB_KDB;
 			break;
 		case 'g':
 		case 'G':
 			boothowto |= RB_GDB;
 			break;
 		case 'v':
 		case 'V':
 			boothowto |= RB_VERBOSE;
 			break;
 
 		case 's':	/* single-user (default, supported for sanity) */
 		case 'S':
 			boothowto |= RB_SINGLE;
 			break;
 
 		default:
 			printf("Unrecognized boot flag '%c'.\n", *p);
 			break;
 		}
 	}
 
 	freeenv(p);
 	return;
 }
 extern uint32_t _end;
 
 static void
 mips_init(void)
 {
 	init_param1();
 	init_param2(physmem);
 
 	mips_cpu_init();
 	cpuinfo.cache_coherent_dma = TRUE;
 	pmap_bootstrap();
 #ifdef DDB
 	kdb_init();
 	if (boothowto & RB_KDB) {
 		kdb_enter("Boot flags requested debugger", NULL);
 	}
 #endif
 	mips_proc0_init();
 	mutex_init();
 }
 
 u_int
 platform_get_timecount(struct timecounter *tc __unused)
 {
 
 	return (0xffffffffU - pic_timer_count32(PIC_CLOCK_TIMER));
 }
 
 static void 
 xlr_pic_init(void)
 {
 	struct timecounter pic_timecounter = {
 		platform_get_timecount, /* get_timecount */
 		0,                      /* no poll_pps */
 		~0U,                    /* counter_mask */
 		PIC_TIMER_HZ,           /* frequency */
 		"XLRPIC",               /* name */
 		2000,                   /* quality (adjusted in code) */
 	};
 	xlr_reg_t *mmio = xlr_io_mmio(XLR_IO_PIC_OFFSET);
 	int i, irq;
 
 	write_c0_eimr64(0ULL);
 	mtx_init(&xlr_pic_lock, "pic", NULL, MTX_SPIN);
 	xlr_write_reg(mmio, PIC_CTRL, 0);
 
 	/* Initialize all IRT entries */
 	for (i = 0; i < PIC_NUM_IRTS; i++) {
 		irq = PIC_INTR_TO_IRQ(i);
 
 		/*
 		 * Disable all IRTs. Set defaults (local scheduling, high
 		 * polarity, level * triggered, and CPU irq)
 		 */
 		xlr_write_reg(mmio, PIC_IRT_1(i), (1 << 30) | (1 << 6) | irq);
 		/* Bind all PIC irqs to cpu 0 */
 		xlr_write_reg(mmio, PIC_IRT_0(i), 0x01);
 	}
 
 	/* Setup timer 7 of PIC as a timestamp, no interrupts */
 	pic_init_timer(PIC_CLOCK_TIMER);
 	pic_set_timer(PIC_CLOCK_TIMER, ~UINT64_C(0));
 	platform_timecounter = &pic_timecounter;
 }
 
 static void
 xlr_mem_init(void)
 {
 	struct xlr_boot1_mem_map *boot_map;
 	vm_size_t physsz = 0;
 	int i, j;
 
 	/* get physical memory info from boot loader */
 	boot_map = (struct xlr_boot1_mem_map *)
 	    (unsigned long)xlr_boot1_info.psb_mem_map;
 	for (i = 0, j = 0; i < boot_map->num_entries; i++, j += 2) {
 		if (boot_map->physmem_map[i].type != BOOT1_MEM_RAM)
 			continue;
 		if (j == 14) {
 			printf("*** ERROR *** memory map too large ***\n");
 			break;
 		}
 		if (j == 0) {
 			/* start after kernel end */
 			phys_avail[0] = (vm_paddr_t)
 			    MIPS_KSEG0_TO_PHYS(&_end) + 0x20000;
 			/* boot loader start */
 			/* HACK to Use bootloaders memory region */
 			if (boot_map->physmem_map[0].size == 0x0c000000) {
 				boot_map->physmem_map[0].size = 0x0ff00000;
 			}
 			phys_avail[1] = boot_map->physmem_map[0].addr +
 			    boot_map->physmem_map[0].size;
 			printf("First segment: addr:%#jx -> %#jx \n",
 			       (uintmax_t)phys_avail[0], 
 			       (uintmax_t)phys_avail[1]);
 
 			dump_avail[0] = phys_avail[0];
 			dump_avail[1] = phys_avail[1];
 		} else {
 #if !defined(__mips_n64) && !defined(__mips_n32) /* !PHYSADDR_64_BIT */
 			/*
 			 * In 32 bit physical address mode we cannot use 
 			 * mem > 0xffffffff
 			 */
 			if (boot_map->physmem_map[i].addr > 0xfffff000U) {
 				printf("Memory: start %#jx size %#jx ignored"
 				    "(>4GB)\n",
 				    (intmax_t)boot_map->physmem_map[i].addr,
 				    (intmax_t)boot_map->physmem_map[i].size);
 				continue;
 			}
 			if (boot_map->physmem_map[i].addr +
 			    boot_map->physmem_map[i].size > 0xfffff000U) {
 				boot_map->physmem_map[i].size = 0xfffff000U - 
 				    boot_map->physmem_map[i].addr;
 				printf("Memory: start %#jx limited to 4GB\n",
 				    (intmax_t)boot_map->physmem_map[i].addr);
 			}
 #endif /* !PHYSADDR_64_BIT */
 			phys_avail[j] = (vm_paddr_t)
 			    boot_map->physmem_map[i].addr;
 			phys_avail[j + 1] = phys_avail[j] +
 			    boot_map->physmem_map[i].size;
 			printf("Next segment : addr:%#jx -> %#jx\n",
 			       (uintmax_t)phys_avail[j], 
 			       (uintmax_t)phys_avail[j+1]);
 		}
 
 		dump_avail[j] = phys_avail[j];
 		dump_avail[j+1] = phys_avail[j+1];
 
 		physsz += boot_map->physmem_map[i].size;
 	}
 
 	phys_avail[j] = phys_avail[j + 1] = 0;
 	realmem = physmem = btoc(physsz);
 }
 
 void
 platform_start(__register_t a0 __unused,
     __register_t a1 __unused,
     __register_t a2 __unused,
     __register_t a3 __unused)
 {
 	int i;
 #ifdef SMP
 	uint32_t tmp;
 	void (*wakeup) (void *, void *, unsigned int);
 #endif
 
 	/* Save boot loader and other stuff from scratch regs */
 	xlr_boot1_info = *(struct boot1_info *)(intptr_t)(int)read_c0_register32(MIPS_COP_0_OSSCRATCH, 0);
 	cpu_mask_info = read_c0_register64(MIPS_COP_0_OSSCRATCH, 1);
 	xlr_online_cpumask = read_c0_register32(MIPS_COP_0_OSSCRATCH, 2);
 	xlr_run_mode = read_c0_register32(MIPS_COP_0_OSSCRATCH, 3);
 	xlr_argc = read_c0_register32(MIPS_COP_0_OSSCRATCH, 4);
 	/*
 	 * argv and envp are passed in array of 32bit pointers
 	 */
 	xlr_argv = (int32_t *)(intptr_t)(int)read_c0_register32(MIPS_COP_0_OSSCRATCH, 5);
 	xlr_envp = (int32_t *)(intptr_t)(int)read_c0_register32(MIPS_COP_0_OSSCRATCH, 6);
 
 	/* Initialize pcpu stuff */
 	mips_pcpu0_init();
 
 	/* initialize console so that we have printf */
 	boothowto |= (RB_SERIAL | RB_MULTIPLE);	/* Use multiple consoles */
 
 	/* clockrate used by delay, so initialize it here */
 	cpu_clock = xlr_boot1_info.cpu_frequency / 1000000;
 
 	/*
 	 * Note the time counter on CPU0 runs not at system clock speed, but
 	 * at PIC time counter speed (which is returned by
 	 * platform_get_frequency(). Thus we do not use
 	 * xlr_boot1_info.cpu_frequency here.
 	 */
 	mips_timer_early_init(xlr_boot1_info.cpu_frequency);
 
 	/* Init console please */
 	cninit();
 	init_static_kenv(boot1_env, sizeof(boot1_env));
 	printf("Environment (from %d args):\n", xlr_argc - 1);
 	if (xlr_argc == 1)
 		printf("\tNone\n");
 	for (i = 1; i < xlr_argc; i++) {
 		char *n, *arg;
 
 		arg = (char *)(intptr_t)xlr_argv[i];
 		printf("\t%s\n", arg);
 		n = strsep(&arg, "=");
 		if (arg == NULL)
 			setenv(n, "1");
 		else
 			setenv(n, arg);
 	}
 
 	xlr_set_boot_flags();
 	xlr_parse_mmu_options();
 
 	xlr_mem_init();
 	/* Set up hz, among others. */
 	mips_init();
 
 #ifdef SMP
 	/*
 	 * If thread 0 of any core is not available then mark whole core as
 	 * not available
 	 */
 	tmp = xlr_boot1_info.cpu_online_map;
 	for (i = 4; i < MAXCPU; i += 4) {
 		if ((tmp & (0xf << i)) && !(tmp & (0x1 << i))) {
 			/*
 			 * Oops.. thread 0 is not available. Disable whole
 			 * core
 			 */
 			tmp = tmp & ~(0xf << i);
 			printf("WARNING: Core %d is disabled because thread 0"
 			    " of this core is not enabled.\n", i / 4);
 		}
 	}
 	xlr_boot1_info.cpu_online_map = tmp;
 
 	/* Wakeup Other cpus, and put them in bsd park code. */
 	wakeup = ((void (*) (void *, void *, unsigned int))
 	    (unsigned long)(xlr_boot1_info.wakeup));
 	printf("Waking up CPUs 0x%jx.\n", 
 	    (intmax_t)xlr_boot1_info.cpu_online_map & ~(0x1U));
 	if (xlr_boot1_info.cpu_online_map & ~(0x1U))
 		wakeup(mpwait, 0,
 		    (unsigned int)xlr_boot1_info.cpu_online_map);
 #endif
 
 	/* xlr specific post initialization */
 	/* initialize other on chip stuff */
 	xlr_board_info_setup();
 	xlr_msgring_config();
 	xlr_pic_init();
 	xlr_msgring_cpu_init();
 
 	mips_timer_init_params(xlr_boot1_info.cpu_frequency, 0);
 
 	printf("Platform specific startup now completes\n");
 }
 
 void 
 platform_cpu_init()
 {
 }
 
 void
 platform_identify(void)
 {
 
 	printf("Board [%d:%d], processor 0x%08x\n", (int)xlr_boot1_info.board_major_version,
 	    (int)xlr_boot1_info.board_minor_version, mips_rd_prid());
 }
 
 void
 platform_trap_enter(void)
 {
 }
 
 void
 platform_reset(void)
 {
 	xlr_reg_t *mmio = xlr_io_mmio(XLR_IO_GPIO_OFFSET);
 
 	/* write 1 to GPIO software reset register */
 	xlr_write_reg(mmio, 8, 1);
 }
 
 void
 platform_trap_exit(void)
 {
 }
 
 #ifdef SMP
 int xlr_ap_release[MAXCPU];
 
 int
 platform_start_ap(int cpuid)
 {
 	int hwid = xlr_cpuid_to_hwtid[cpuid];
 
 	if (xlr_boot1_info.cpu_online_map & (1<<hwid)) {
 		/*
 		 * other cpus are enabled by the boot loader and they will be 
 		 * already looping in mpwait, release them
 		 */
 		atomic_store_rel_int(&xlr_ap_release[hwid], 1);
 		return (0);
 	} else
 		return (-1);
 }
 
 void
 platform_init_ap(int cpuid)
 {
 	uint32_t stat;
 
 	/* The first thread has to setup the core MMU split  */
 	if (xlr_thr_id() == 0)
 		xlr_setup_mmu_split();
 
 	/* Setup interrupts for secondary CPUs here */
 	stat = mips_rd_status();
 	KASSERT((stat & MIPS_SR_INT_IE) == 0,
 	    ("Interrupts enabled in %s!", __func__));
 	stat |= MIPS_SR_COP_2_BIT | MIPS_SR_COP_0_BIT;
 	mips_wr_status(stat);
 
 	write_c0_eimr64(0ULL);
 	xlr_enable_irq(IRQ_IPI);
 	xlr_enable_irq(IRQ_TIMER);
 	if (xlr_thr_id() == 0)
 		xlr_msgring_cpu_init(); 
 	 xlr_enable_irq(IRQ_MSGRING);
 
 	return;
 }
 
 int
 platform_ipi_intrnum(void) 
 {
 
 	return (IRQ_IPI);
 }
 
 void
 platform_ipi_send(int cpuid)
 {
 
 	pic_send_ipi(xlr_cpuid_to_hwtid[cpuid], platform_ipi_intrnum());
 }
 
 void
 platform_ipi_clear(void)
 {
 }
 
 int
 platform_processor_id(void)
 {
 
 	return (xlr_hwtid_to_cpuid[xlr_cpu_id()]);
 }
 
-cpumask_t
-platform_cpu_mask(void)
+void
+platform_cpu_mask(cpuset_t *mask)
 {
+	int i, s;
 
-	return (~0U >> (32 - (xlr_ncores * xlr_threads_per_core)));
+	CPU_ZERO(mask);
+	s = xlr_ncores * xlr_threads_per_core;
+	for (i = 0; i < s; i++)
+		CPU_SET(i, mask);
 }
 
 struct cpu_group *
 platform_smp_topo()
 {
 
 	return (smp_topo_2level(CG_SHARE_L2, xlr_ncores, CG_SHARE_L1,
 		xlr_threads_per_core, CG_FLAG_THREAD));
 }
 #endif
Index: head/sys/mips/sibyte/sb_scd.c
===================================================================
--- head/sys/mips/sibyte/sb_scd.c	(revision 222812)
+++ head/sys/mips/sibyte/sb_scd.c	(revision 222813)
@@ -1,301 +1,306 @@
 /*-
  * Copyright (c) 2009 Neelkanth Natu
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/systm.h>
 #include <sys/module.h>
 #include <sys/bus.h>
+#include <sys/cpuset.h>
 
 #include <machine/resource.h>
 #include <machine/hwfunc.h>
 
 #include "sb_scd.h"
 
 /*
  * We compile a 32-bit kernel to run on the SB-1 processor which is a 64-bit
  * processor. It has some registers that must be accessed using 64-bit load
  * and store instructions.
  *
  * We use the mips_ld() and mips_sd() functions to do this for us.
  */
 #define	sb_store64(addr, val)	mips3_sd((uint64_t *)(uintptr_t)(addr), (val))
 #define	sb_load64(addr)		mips3_ld((uint64_t *)(uintptr_t)(addr))
 
 /*
  * System Control and Debug (SCD) unit on the Sibyte ZBbus.
  */
 
 /*
  * Extract the value starting at bit position 'b' for 'n' bits from 'x'.
  */
 #define	GET_VAL_64(x, b, n)	(((x) >> (b)) & ((1ULL << (n)) - 1))
 
 #define	SYSREV_ADDR		MIPS_PHYS_TO_KSEG1(0x10020000)
 #define	SYSREV_NUM_PROCESSORS(x) GET_VAL_64((x), 24, 4)
 
 #define	SYSCFG_ADDR		MIPS_PHYS_TO_KSEG1(0x10020008)
 #define SYSCFG_PLLDIV(x)	GET_VAL_64((x), 7, 5)
 
 #define	ZBBUS_CYCLE_COUNT_ADDR	MIPS_PHYS_TO_KSEG1(0x10030000)
 
 #define	INTSRC_MASK_ADDR(cpu)	\
 	(MIPS_PHYS_TO_KSEG1(0x10020028) | ((cpu) << 13))
 
 #define	INTSRC_MAP_ADDR(cpu, intsrc)	\
 	(MIPS_PHYS_TO_KSEG1(0x10020200) | ((cpu) << 13)) + (intsrc * 8)
 
 #define	MAILBOX_SET_ADDR(cpu)	\
 	(MIPS_PHYS_TO_KSEG1(0x100200C8) | ((cpu) << 13))
 
 #define	MAILBOX_CLEAR_ADDR(cpu)	\
 	(MIPS_PHYS_TO_KSEG1(0x100200D0) | ((cpu) << 13))
 
 static uint64_t
 sb_read_syscfg(void)
 {
 
 	return (sb_load64(SYSCFG_ADDR));
 }
 
 static void
 sb_write_syscfg(uint64_t val)
 {
 	
 	sb_store64(SYSCFG_ADDR, val);
 }
 
 uint64_t
 sb_zbbus_cycle_count(void)
 {
 
 	return (sb_load64(ZBBUS_CYCLE_COUNT_ADDR));
 }
 
 uint64_t
 sb_cpu_speed(void)
 {
 	int plldiv;
 	const uint64_t MHZ = 1000000;
 	
 	plldiv = SYSCFG_PLLDIV(sb_read_syscfg());
 	if (plldiv == 0) {
 		printf("PLL_DIV is 0 - assuming 6 (300MHz).\n");
 		plldiv = 6;
 	}
 
 	return (plldiv * 50 * MHZ);
 }
 
 void
 sb_system_reset(void)
 {
 	uint64_t syscfg;
 
 	const uint64_t SYSTEM_RESET = 1ULL << 60;
 	const uint64_t EXT_RESET = 1ULL << 59;
 	const uint64_t SOFT_RESET = 1ULL << 58;
 
 	syscfg = sb_read_syscfg();
 	syscfg &= ~SOFT_RESET;
 	syscfg |= SYSTEM_RESET | EXT_RESET;
 	sb_write_syscfg(syscfg);
 }
 
 void
 sb_disable_intsrc(int cpu, int src)
 {
 	int regaddr;
 	uint64_t val;
 
 	regaddr = INTSRC_MASK_ADDR(cpu);
 
 	val = sb_load64(regaddr);
 	val |= 1ULL << src;
 	sb_store64(regaddr, val);
 }
 
 void
 sb_enable_intsrc(int cpu, int src)
 {
 	int regaddr;
 	uint64_t val;
 
 	regaddr = INTSRC_MASK_ADDR(cpu);
 
 	val = sb_load64(regaddr);
 	val &= ~(1ULL << src);
 	sb_store64(regaddr, val);
 }
 
 void
 sb_write_intsrc_mask(int cpu, uint64_t val)
 {
 	int regaddr;
 
 	regaddr = INTSRC_MASK_ADDR(cpu);
 	sb_store64(regaddr, val);
 }
 
 uint64_t
 sb_read_intsrc_mask(int cpu)
 {
 	int regaddr;
 	uint64_t val;
 
 	regaddr = INTSRC_MASK_ADDR(cpu);
 	val = sb_load64(regaddr);
 
 	return (val);
 }
 
 void
 sb_write_intmap(int cpu, int intsrc, int intrnum)
 {
 	int regaddr;
 
 	regaddr = INTSRC_MAP_ADDR(cpu, intsrc);
 	sb_store64(regaddr, intrnum);
 }
 
 int
 sb_read_intmap(int cpu, int intsrc)
 {
 	int regaddr;
 
 	regaddr = INTSRC_MAP_ADDR(cpu, intsrc);
 	return (sb_load64(regaddr) & 0x7);
 }
 
 int
 sb_route_intsrc(int intsrc)
 {
 	int intrnum;
 
 	KASSERT(intsrc >= 0 && intsrc < NUM_INTSRC,
 		("Invalid interrupt source number (%d)", intsrc));
 
 	/*
 	 * Interrupt 5 is used by sources internal to the CPU (e.g. timer).
 	 * Use a deterministic mapping for the remaining sources.
 	 */
 #ifdef SMP
 	KASSERT(platform_ipi_intrnum() == 4,
 		("Unexpected interrupt number used for IPI"));
 	intrnum = intsrc % 4;
 #else
 	intrnum = intsrc % 5;
 #endif
 
 	return (intrnum);
 }
 
 #ifdef SMP
 static uint64_t
 sb_read_sysrev(void)
 {
 
 	return (sb_load64(SYSREV_ADDR));
 }
 
 void
 sb_set_mailbox(int cpu, uint64_t val)
 {
 	int regaddr;
 
 	regaddr = MAILBOX_SET_ADDR(cpu);
 	sb_store64(regaddr, val);
 }
 
 void
 sb_clear_mailbox(int cpu, uint64_t val)
 {
 	int regaddr;
 
 	regaddr = MAILBOX_CLEAR_ADDR(cpu);
 	sb_store64(regaddr, val);
 }
 
-cpumask_t
-platform_cpu_mask(void)
+void
+platform_cpu_mask(cpuset_t *mask)
 {
+	int i, s;
 
-	return (~0U >> (32 - SYSREV_NUM_PROCESSORS(sb_read_sysrev())));
+	CPU_ZERO(mask);
+	s = SYSREV_NUM_PROCESSORS(sb_read_sysrev());
+	for (i = 0; i < s; i++)
+		CPU_SET(i, mask);
 }
 #endif	/* SMP */
 
 #define	SCD_PHYSADDR	0x10000000
 #define	SCD_SIZE	0x00060000
 
 static int
 scd_probe(device_t dev)
 {
 
 	device_set_desc(dev, "Broadcom/Sibyte System Control and Debug");
 	return (0);
 }
 
 static int
 scd_attach(device_t dev)
 {
 	int rid;
 	struct resource *res;
 
 	if (bootverbose)
 		device_printf(dev, "attached.\n");
 
 	rid = 0;
 	res = bus_alloc_resource(dev, SYS_RES_MEMORY, &rid, SCD_PHYSADDR,
 				 SCD_PHYSADDR + SCD_SIZE - 1, SCD_SIZE, 0);
 	if (res == NULL)
 		panic("Cannot allocate resource for system control and debug.");
 	
 	return (0);
 }
 
 static device_method_t scd_methods[] ={
 	/* Device interface */
 	DEVMETHOD(device_probe,		scd_probe),
 	DEVMETHOD(device_attach,	scd_attach),
 	DEVMETHOD(device_detach,	bus_generic_detach),
 	DEVMETHOD(device_shutdown,	bus_generic_shutdown),
 	DEVMETHOD(device_suspend,	bus_generic_suspend),
 	DEVMETHOD(device_resume,	bus_generic_resume),
 
 	{ 0, 0 }
 };
 
 static driver_t scd_driver = {
 	"scd",
 	scd_methods
 };
 
 static devclass_t scd_devclass;
 
 DRIVER_MODULE(scd, zbbus, scd_driver, scd_devclass, 0, 0);
Index: head/sys/ofed/include/linux/list.h
===================================================================
--- head/sys/ofed/include/linux/list.h	(revision 222812)
+++ head/sys/ofed/include/linux/list.h	(revision 222813)
@@ -1,331 +1,332 @@
 /*-
  * Copyright (c) 2010 Isilon Systems, Inc.
  * Copyright (c) 2010 iX Systems, Inc.
  * Copyright (c) 2010 Panasas, Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 #ifndef _LINUX_LIST_H_
 #define _LINUX_LIST_H_
 
 /*
  * Since LIST_HEAD conflicts with the linux definition we must include any
  * FreeBSD header which requires it here so it is resolved with the correct
  * definition prior to the undef.
  */
 #include <linux/types.h>
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/queue.h>
+#include <sys/cpuset.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/vnode.h>
 #include <sys/conf.h>
 #include <sys/socket.h>
 #include <sys/mbuf.h>
 
 #include <net/bpf.h>
 #include <net/if.h>
 #include <net/if_types.h>
 #include <net/if_media.h>
 
 #include <netinet/in.h>
 #include <netinet/in_pcb.h>
 
 #include <netinet6/in6_var.h>
 #include <netinet6/nd6.h>
 
 #include <vm/vm.h>
 #include <vm/vm_object.h>
 
 #define	prefetch(x)
 
 struct list_head {
 	struct list_head *next;
 	struct list_head *prev;
 };
 
 static inline void
 INIT_LIST_HEAD(struct list_head *list)
 {
 
 	list->next = list->prev = list;
 }
  
 static inline int
 list_empty(const struct list_head *head)
 {
 
 	return (head->next == head);
 }
 
 static inline void
 list_del(struct list_head *entry)
 {
 
 	entry->next->prev = entry->prev;
 	entry->prev->next = entry->next;
 }
 
 static inline void
 _list_add(struct list_head *new, struct list_head *prev,
     struct list_head *next)
 {
 
 	next->prev = new;
 	new->next = next;
 	new->prev = prev;
 	prev->next = new;
 }
 
 static inline void
 list_del_init(struct list_head *entry)
 {	
 
 	list_del(entry);
 	INIT_LIST_HEAD(entry);
 }
 
 #define	list_entry(ptr, type, field)	container_of(ptr, type, field)
 
 #define	list_for_each(p, head)						\
 	for (p = (head)->next; p != (head); p = p->next)
 
 #define	list_for_each_safe(p, n, head)					\
 	for (p = (head)->next, n = p->next; p != (head); p = n, n = p->next)
 
 #define list_for_each_entry(p, h, field)				\
 	for (p = list_entry((h)->next, typeof(*p), field); &p->field != (h); \
 	    p = list_entry(p->field.next, typeof(*p), field))
 
 #define list_for_each_entry_safe(p, n, h, field)			\
 	for (p = list_entry((h)->next, typeof(*p), field), 		\
 	    n = list_entry(p->field.next, typeof(*p), field); &p->field != (h);\
 	    p = n, n = list_entry(n->field.next, typeof(*n), field))
 
 #define	list_for_each_entry_reverse(p, h, field)			\
 	for (p = list_entry((h)->prev, typeof(*p), field); &p->field != (h); \
 	    p = list_entry(p->field.prev, typeof(*p), field))
 
 #define	list_for_each_prev(p, h) for (p = (h)->prev; p != (h); p = p->prev)
 
 static inline void
 list_add(struct list_head *new, struct list_head *head)
 {
 
 	_list_add(new, head, head->next);
 }
 
 static inline void
 list_add_tail(struct list_head *new, struct list_head *head)
 {
 
 	_list_add(new, head->prev, head);
 }
 
 static inline void
 list_move(struct list_head *list, struct list_head *head)
 {
 
 	list_del(list);
 	list_add(list, head);
 }
 
 static inline void
 list_move_tail(struct list_head *entry, struct list_head *head)
 {
 
 	list_del(entry);
 	list_add_tail(entry, head);
 }
 
 static inline void
 _list_splice(const struct list_head *list, struct list_head *prev,  
     struct list_head *next)
 {
 	struct list_head *first;
 	struct list_head *last;
 
 	if (list_empty(list))
 		return;
 	first = list->next;
 	last = list->prev;
 	first->prev = prev;
 	prev->next = first;
 	last->next = next;
 	next->prev = last;
 }
 
 static inline void
 list_splice(const struct list_head *list, struct list_head *head)
 {
 
 	_list_splice(list, head, head->next);
 } 
 
 static inline void
 list_splice_tail(struct list_head *list, struct list_head *head)
 {
 
 	_list_splice(list, head->prev, head);
 }
  
 static inline void
 list_splice_init(struct list_head *list, struct list_head *head)
 {
 
 	_list_splice(list, head, head->next);
 	INIT_LIST_HEAD(list);   
 }
  
 static inline void
 list_splice_tail_init(struct list_head *list, struct list_head *head)
 {
 
 	_list_splice(list, head->prev, head);
 	INIT_LIST_HEAD(list);
 }
 
 #undef LIST_HEAD
 #define LIST_HEAD(name)	struct list_head name = { &(name), &(name) }
 
 
 struct hlist_head {
 	struct hlist_node *first;
 };
 
 struct hlist_node {
 	struct hlist_node *next, **pprev;
 };
 
 #define	HLIST_HEAD_INIT { }
 #define	HLIST_HEAD(name) struct hlist_head name = HLIST_HEAD_INIT
 #define	INIT_HLIST_HEAD(head) (head)->first = NULL
 #define	INIT_HLIST_NODE(node)						\
 do {									\
 	(node)->next = NULL;						\
 	(node)->pprev = NULL;						\
 } while (0)
 
 static inline int
 hlist_unhashed(const struct hlist_node *h)
 {
 
 	return !h->pprev;
 }
 
 static inline int
 hlist_empty(const struct hlist_head *h)
 {
 
 	return !h->first;
 }
 
 static inline void
 hlist_del(struct hlist_node *n)
 {
 
         if (n->next)
                 n->next->pprev = n->pprev;
         *n->pprev = n->next;
 }
 
 static inline void
 hlist_del_init(struct hlist_node *n)
 {
 
 	if (hlist_unhashed(n))
 		return;
 	hlist_del(n);
 	INIT_HLIST_NODE(n);
 }
 
 static inline void
 hlist_add_head(struct hlist_node *n, struct hlist_head *h)
 {
 
 	n->next = h->first;
 	if (h->first)
 		h->first->pprev = &n->next;
 	h->first = n;
 	n->pprev = &h->first;
 }
 
 static inline void
 hlist_add_before(struct hlist_node *n, struct hlist_node *next)
 {
 
 	n->pprev = next->pprev;
 	n->next = next;
 	next->pprev = &n->next;
 	*(n->pprev) = n;
 }
  
 static inline void
 hlist_add_after(struct hlist_node *n, struct hlist_node *next)
 {
 
 	next->next = n->next;
 	n->next = next;
 	next->pprev = &n->next;
 	if (next->next)
 		next->next->pprev = &next->next;
 }
  
 static inline void
 hlist_move_list(struct hlist_head *old, struct hlist_head *new)
 {
 
 	new->first = old->first;
 	if (new->first)
 		new->first->pprev = &new->first;
 	old->first = NULL;
 }
  
 #define	hlist_entry(ptr, type, field)	container_of(ptr, type, field)
 
 #define	hlist_for_each(p, head)						\
 	for (p = (head)->first; p; p = p->next)
 
 #define	hlist_for_each_safe(p, n, head)					\
 	for (p = (head)->first; p && ({ n = p->next; 1; }); p = n)
 
 #define	hlist_for_each_entry(tp, p, head, field)			\
 	for (p = (head)->first;						\
 	    p ? (tp = hlist_entry(p, typeof(*tp), field)): NULL; p = p->next)
  
 #define hlist_for_each_entry_continue(tp, p, field)			\
 	for (p = (p)->next;						\
 	    p ? (tp = hlist_entry(p, typeof(*tp), field)): NULL; p = p->next)
 
 #define	hlist_for_each_entry_from(tp, p, field)				\
 	for (; p ? (tp = hlist_entry(p, typeof(*tp), field)): NULL; p = p->next)
 
 #define	hlist_for_each_entry_safe(tp, p, n, head, field)		\
 	for (p = (head)->first;	p ?					\
 	    (n = p->next) | (tp = hlist_entry(p, typeof(*tp), field)) :	\
 	    NULL; p = n)
 
 #endif /* _LINUX_LIST_H_ */
Index: head/sys/powerpc/aim/mmu_oea.c
===================================================================
--- head/sys/powerpc/aim/mmu_oea.c	(revision 222812)
+++ head/sys/powerpc/aim/mmu_oea.c	(revision 222813)
@@ -1,2505 +1,2512 @@
 /*-
  * Copyright (c) 2001 The NetBSD Foundation, Inc.
  * All rights reserved.
  *
  * This code is derived from software contributed to The NetBSD Foundation
  * by Matt Thomas <matt@3am-software.com> of Allegro Networks, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *        This product includes software developed by the NetBSD
  *        Foundation, Inc. and its contributors.
  * 4. Neither the name of The NetBSD Foundation nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
  * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
  * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
  * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
  * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  */
 /*-
  * Copyright (C) 1995, 1996 Wolfgang Solfrank.
  * Copyright (C) 1995, 1996 TooLs GmbH.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by TooLs GmbH.
  * 4. The name of TooLs GmbH may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $NetBSD: pmap.c,v 1.28 2000/03/26 20:42:36 kleink Exp $
  */
 /*-
  * Copyright (C) 2001 Benno Rice.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  * Manages physical address maps.
  *
  * In addition to hardware address maps, this module is called upon to
  * provide software-use-only maps which may or may not be stored in the
  * same form as hardware maps.  These pseudo-maps are used to store
  * intermediate results from copy operations to and from address spaces.
  *
  * Since the information managed by this module is also stored by the
  * logical address mapping module, this module may throw away valid virtual
  * to physical mappings at almost any time.  However, invalidations of
  * mappings must be done as requested.
  *
  * In order to cope with hardware architectures which make virtual to
  * physical map invalidates expensive, this module may delay invalidate
  * reduced protection operations until such time as they are actually
  * necessary.  This module is given full information as to which processors
  * are currently using which maps, and to when physical maps must be made
  * correct.
  */
 
 #include "opt_kstack_pages.h"
 
 #include <sys/param.h>
 #include <sys/kernel.h>
+#include <sys/queue.h>
+#include <sys/cpuset.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/msgbuf.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
+#include <sys/sched.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/vmmeter.h>
 
 #include <dev/ofw/openfirm.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 #include <vm/uma.h>
 
 #include <machine/cpu.h>
 #include <machine/platform.h>
 #include <machine/bat.h>
 #include <machine/frame.h>
 #include <machine/md_var.h>
 #include <machine/psl.h>
 #include <machine/pte.h>
 #include <machine/smp.h>
 #include <machine/sr.h>
 #include <machine/mmuvar.h>
 
 #include "mmu_if.h"
 
 #define	MOEA_DEBUG
 
 #define TODO	panic("%s: not implemented", __func__);
 
 #define	VSID_MAKE(sr, hash)	((sr) | (((hash) & 0xfffff) << 4))
 #define	VSID_TO_SR(vsid)	((vsid) & 0xf)
 #define	VSID_TO_HASH(vsid)	(((vsid) >> 4) & 0xfffff)
 
 struct ofw_map {
 	vm_offset_t	om_va;
 	vm_size_t	om_len;
 	vm_offset_t	om_pa;
 	u_int		om_mode;
 };
 
 /*
  * Map of physical memory regions.
  */
 static struct	mem_region *regions;
 static struct	mem_region *pregions;
 static u_int    phys_avail_count;
 static int	regions_sz, pregions_sz;
 static struct	ofw_map *translations;
 
 /*
  * Lock for the pteg and pvo tables.
  */
 struct mtx	moea_table_mutex;
 struct mtx	moea_vsid_mutex;
 
 /* tlbie instruction synchronization */
 static struct mtx tlbie_mtx;
 
 /*
  * PTEG data.
  */
 static struct	pteg *moea_pteg_table;
 u_int		moea_pteg_count;
 u_int		moea_pteg_mask;
 
 /*
  * PVO data.
  */
 struct	pvo_head *moea_pvo_table;		/* pvo entries by pteg index */
 struct	pvo_head moea_pvo_kunmanaged =
     LIST_HEAD_INITIALIZER(moea_pvo_kunmanaged);	/* list of unmanaged pages */
 
 uma_zone_t	moea_upvo_zone;	/* zone for pvo entries for unmanaged pages */
 uma_zone_t	moea_mpvo_zone;	/* zone for pvo entries for managed pages */
 
 #define	BPVO_POOL_SIZE	32768
 static struct	pvo_entry *moea_bpvo_pool;
 static int	moea_bpvo_pool_index = 0;
 
 #define	VSID_NBPW	(sizeof(u_int32_t) * 8)
 static u_int	moea_vsid_bitmap[NPMAPS / VSID_NBPW];
 
 static boolean_t moea_initialized = FALSE;
 
 /*
  * Statistics.
  */
 u_int	moea_pte_valid = 0;
 u_int	moea_pte_overflow = 0;
 u_int	moea_pte_replacements = 0;
 u_int	moea_pvo_entries = 0;
 u_int	moea_pvo_enter_calls = 0;
 u_int	moea_pvo_remove_calls = 0;
 u_int	moea_pte_spills = 0;
 SYSCTL_INT(_machdep, OID_AUTO, moea_pte_valid, CTLFLAG_RD, &moea_pte_valid,
     0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea_pte_overflow, CTLFLAG_RD,
     &moea_pte_overflow, 0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea_pte_replacements, CTLFLAG_RD,
     &moea_pte_replacements, 0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea_pvo_entries, CTLFLAG_RD, &moea_pvo_entries,
     0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea_pvo_enter_calls, CTLFLAG_RD,
     &moea_pvo_enter_calls, 0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea_pvo_remove_calls, CTLFLAG_RD,
     &moea_pvo_remove_calls, 0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea_pte_spills, CTLFLAG_RD,
     &moea_pte_spills, 0, "");
 
 /*
  * Allocate physical memory for use in moea_bootstrap.
  */
 static vm_offset_t	moea_bootstrap_alloc(vm_size_t, u_int);
 
 /*
  * PTE calls.
  */
 static int		moea_pte_insert(u_int, struct pte *);
 
 /*
  * PVO calls.
  */
 static int	moea_pvo_enter(pmap_t, uma_zone_t, struct pvo_head *,
 		    vm_offset_t, vm_offset_t, u_int, int);
 static void	moea_pvo_remove(struct pvo_entry *, int);
 static struct	pvo_entry *moea_pvo_find_va(pmap_t, vm_offset_t, int *);
 static struct	pte *moea_pvo_to_pte(const struct pvo_entry *, int);
 
 /*
  * Utility routines.
  */
 static void		moea_enter_locked(pmap_t, vm_offset_t, vm_page_t,
 			    vm_prot_t, boolean_t);
 static void		moea_syncicache(vm_offset_t, vm_size_t);
 static boolean_t	moea_query_bit(vm_page_t, int);
 static u_int		moea_clear_bit(vm_page_t, int);
 static void		moea_kremove(mmu_t, vm_offset_t);
 int		moea_pte_spill(vm_offset_t);
 
 /*
  * Kernel MMU interface
  */
 void moea_change_wiring(mmu_t, pmap_t, vm_offset_t, boolean_t);
 void moea_clear_modify(mmu_t, vm_page_t);
 void moea_clear_reference(mmu_t, vm_page_t);
 void moea_copy_page(mmu_t, vm_page_t, vm_page_t);
 void moea_enter(mmu_t, pmap_t, vm_offset_t, vm_page_t, vm_prot_t, boolean_t);
 void moea_enter_object(mmu_t, pmap_t, vm_offset_t, vm_offset_t, vm_page_t,
     vm_prot_t);
 void moea_enter_quick(mmu_t, pmap_t, vm_offset_t, vm_page_t, vm_prot_t);
 vm_paddr_t moea_extract(mmu_t, pmap_t, vm_offset_t);
 vm_page_t moea_extract_and_hold(mmu_t, pmap_t, vm_offset_t, vm_prot_t);
 void moea_init(mmu_t);
 boolean_t moea_is_modified(mmu_t, vm_page_t);
 boolean_t moea_is_prefaultable(mmu_t, pmap_t, vm_offset_t);
 boolean_t moea_is_referenced(mmu_t, vm_page_t);
 boolean_t moea_ts_referenced(mmu_t, vm_page_t);
 vm_offset_t moea_map(mmu_t, vm_offset_t *, vm_offset_t, vm_offset_t, int);
 boolean_t moea_page_exists_quick(mmu_t, pmap_t, vm_page_t);
 int moea_page_wired_mappings(mmu_t, vm_page_t);
 void moea_pinit(mmu_t, pmap_t);
 void moea_pinit0(mmu_t, pmap_t);
 void moea_protect(mmu_t, pmap_t, vm_offset_t, vm_offset_t, vm_prot_t);
 void moea_qenter(mmu_t, vm_offset_t, vm_page_t *, int);
 void moea_qremove(mmu_t, vm_offset_t, int);
 void moea_release(mmu_t, pmap_t);
 void moea_remove(mmu_t, pmap_t, vm_offset_t, vm_offset_t);
 void moea_remove_all(mmu_t, vm_page_t);
 void moea_remove_write(mmu_t, vm_page_t);
 void moea_zero_page(mmu_t, vm_page_t);
 void moea_zero_page_area(mmu_t, vm_page_t, int, int);
 void moea_zero_page_idle(mmu_t, vm_page_t);
 void moea_activate(mmu_t, struct thread *);
 void moea_deactivate(mmu_t, struct thread *);
 void moea_cpu_bootstrap(mmu_t, int);
 void moea_bootstrap(mmu_t, vm_offset_t, vm_offset_t);
 void *moea_mapdev(mmu_t, vm_offset_t, vm_size_t);
 void *moea_mapdev_attr(mmu_t, vm_offset_t, vm_size_t, vm_memattr_t);
 void moea_unmapdev(mmu_t, vm_offset_t, vm_size_t);
 vm_offset_t moea_kextract(mmu_t, vm_offset_t);
 void moea_kenter_attr(mmu_t, vm_offset_t, vm_offset_t, vm_memattr_t);
 void moea_kenter(mmu_t, vm_offset_t, vm_offset_t);
 void moea_page_set_memattr(mmu_t mmu, vm_page_t m, vm_memattr_t ma);
 boolean_t moea_dev_direct_mapped(mmu_t, vm_offset_t, vm_size_t);
 static void moea_sync_icache(mmu_t, pmap_t, vm_offset_t, vm_size_t);
 
 static mmu_method_t moea_methods[] = {
 	MMUMETHOD(mmu_change_wiring,	moea_change_wiring),
 	MMUMETHOD(mmu_clear_modify,	moea_clear_modify),
 	MMUMETHOD(mmu_clear_reference,	moea_clear_reference),
 	MMUMETHOD(mmu_copy_page,	moea_copy_page),
 	MMUMETHOD(mmu_enter,		moea_enter),
 	MMUMETHOD(mmu_enter_object,	moea_enter_object),
 	MMUMETHOD(mmu_enter_quick,	moea_enter_quick),
 	MMUMETHOD(mmu_extract,		moea_extract),
 	MMUMETHOD(mmu_extract_and_hold,	moea_extract_and_hold),
 	MMUMETHOD(mmu_init,		moea_init),
 	MMUMETHOD(mmu_is_modified,	moea_is_modified),
 	MMUMETHOD(mmu_is_prefaultable,	moea_is_prefaultable),
 	MMUMETHOD(mmu_is_referenced,	moea_is_referenced),
 	MMUMETHOD(mmu_ts_referenced,	moea_ts_referenced),
 	MMUMETHOD(mmu_map,     		moea_map),
 	MMUMETHOD(mmu_page_exists_quick,moea_page_exists_quick),
 	MMUMETHOD(mmu_page_wired_mappings,moea_page_wired_mappings),
 	MMUMETHOD(mmu_pinit,		moea_pinit),
 	MMUMETHOD(mmu_pinit0,		moea_pinit0),
 	MMUMETHOD(mmu_protect,		moea_protect),
 	MMUMETHOD(mmu_qenter,		moea_qenter),
 	MMUMETHOD(mmu_qremove,		moea_qremove),
 	MMUMETHOD(mmu_release,		moea_release),
 	MMUMETHOD(mmu_remove,		moea_remove),
 	MMUMETHOD(mmu_remove_all,      	moea_remove_all),
 	MMUMETHOD(mmu_remove_write,	moea_remove_write),
 	MMUMETHOD(mmu_sync_icache,	moea_sync_icache),
 	MMUMETHOD(mmu_zero_page,       	moea_zero_page),
 	MMUMETHOD(mmu_zero_page_area,	moea_zero_page_area),
 	MMUMETHOD(mmu_zero_page_idle,	moea_zero_page_idle),
 	MMUMETHOD(mmu_activate,		moea_activate),
 	MMUMETHOD(mmu_deactivate,      	moea_deactivate),
 	MMUMETHOD(mmu_page_set_memattr,	moea_page_set_memattr),
 
 	/* Internal interfaces */
 	MMUMETHOD(mmu_bootstrap,       	moea_bootstrap),
 	MMUMETHOD(mmu_cpu_bootstrap,   	moea_cpu_bootstrap),
 	MMUMETHOD(mmu_mapdev_attr,	moea_mapdev_attr),
 	MMUMETHOD(mmu_mapdev,		moea_mapdev),
 	MMUMETHOD(mmu_unmapdev,		moea_unmapdev),
 	MMUMETHOD(mmu_kextract,		moea_kextract),
 	MMUMETHOD(mmu_kenter,		moea_kenter),
 	MMUMETHOD(mmu_kenter_attr,	moea_kenter_attr),
 	MMUMETHOD(mmu_dev_direct_mapped,moea_dev_direct_mapped),
 
 	{ 0, 0 }
 };
 
 MMU_DEF(oea_mmu, MMU_TYPE_OEA, moea_methods, 0);
 
 static __inline uint32_t
 moea_calc_wimg(vm_offset_t pa, vm_memattr_t ma)
 {
 	uint32_t pte_lo;
 	int i;
 
 	if (ma != VM_MEMATTR_DEFAULT) {
 		switch (ma) {
 		case VM_MEMATTR_UNCACHEABLE:
 			return (PTE_I | PTE_G);
 		case VM_MEMATTR_WRITE_COMBINING:
 		case VM_MEMATTR_WRITE_BACK:
 		case VM_MEMATTR_PREFETCHABLE:
 			return (PTE_I);
 		case VM_MEMATTR_WRITE_THROUGH:
 			return (PTE_W | PTE_M);
 		}
 	}
 
 	/*
 	 * Assume the page is cache inhibited and access is guarded unless
 	 * it's in our available memory array.
 	 */
 	pte_lo = PTE_I | PTE_G;
 	for (i = 0; i < pregions_sz; i++) {
 		if ((pa >= pregions[i].mr_start) &&
 		    (pa < (pregions[i].mr_start + pregions[i].mr_size))) {
 			pte_lo = PTE_M;
 			break;
 		}
 	}
 
 	return pte_lo;
 }
 
 static void
 tlbie(vm_offset_t va)
 {
 
 	mtx_lock_spin(&tlbie_mtx);
 	__asm __volatile("ptesync");
 	__asm __volatile("tlbie %0" :: "r"(va));
 	__asm __volatile("eieio; tlbsync; ptesync");
 	mtx_unlock_spin(&tlbie_mtx);
 }
 
 static void
 tlbia(void)
 {
 	vm_offset_t va;
  
 	for (va = 0; va < 0x00040000; va += 0x00001000) {
 		__asm __volatile("tlbie %0" :: "r"(va));
 		powerpc_sync();
 	}
 	__asm __volatile("tlbsync");
 	powerpc_sync();
 }
 
 static __inline int
 va_to_sr(u_int *sr, vm_offset_t va)
 {
 	return (sr[(uintptr_t)va >> ADDR_SR_SHFT]);
 }
 
 static __inline u_int
 va_to_pteg(u_int sr, vm_offset_t addr)
 {
 	u_int hash;
 
 	hash = (sr & SR_VSID_MASK) ^ (((u_int)addr & ADDR_PIDX) >>
 	    ADDR_PIDX_SHFT);
 	return (hash & moea_pteg_mask);
 }
 
 static __inline struct pvo_head *
 vm_page_to_pvoh(vm_page_t m)
 {
 
 	return (&m->md.mdpg_pvoh);
 }
 
 static __inline void
 moea_attr_clear(vm_page_t m, int ptebit)
 {
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	m->md.mdpg_attrs &= ~ptebit;
 }
 
 static __inline int
 moea_attr_fetch(vm_page_t m)
 {
 
 	return (m->md.mdpg_attrs);
 }
 
 static __inline void
 moea_attr_save(vm_page_t m, int ptebit)
 {
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	m->md.mdpg_attrs |= ptebit;
 }
 
 static __inline int
 moea_pte_compare(const struct pte *pt, const struct pte *pvo_pt)
 {
 	if (pt->pte_hi == pvo_pt->pte_hi)
 		return (1);
 
 	return (0);
 }
 
 static __inline int
 moea_pte_match(struct pte *pt, u_int sr, vm_offset_t va, int which)
 {
 	return (pt->pte_hi & ~PTE_VALID) ==
 	    (((sr & SR_VSID_MASK) << PTE_VSID_SHFT) |
 	    ((va >> ADDR_API_SHFT) & PTE_API) | which);
 }
 
 static __inline void
 moea_pte_create(struct pte *pt, u_int sr, vm_offset_t va, u_int pte_lo)
 {
 
 	mtx_assert(&moea_table_mutex, MA_OWNED);
 
 	/*
 	 * Construct a PTE.  Default to IMB initially.  Valid bit only gets
 	 * set when the real pte is set in memory.
 	 *
 	 * Note: Don't set the valid bit for correct operation of tlb update.
 	 */
 	pt->pte_hi = ((sr & SR_VSID_MASK) << PTE_VSID_SHFT) |
 	    (((va & ADDR_PIDX) >> ADDR_API_SHFT) & PTE_API);
 	pt->pte_lo = pte_lo;
 }
 
 static __inline void
 moea_pte_synch(struct pte *pt, struct pte *pvo_pt)
 {
 
 	mtx_assert(&moea_table_mutex, MA_OWNED);
 	pvo_pt->pte_lo |= pt->pte_lo & (PTE_REF | PTE_CHG);
 }
 
 static __inline void
 moea_pte_clear(struct pte *pt, vm_offset_t va, int ptebit)
 {
 
 	mtx_assert(&moea_table_mutex, MA_OWNED);
 
 	/*
 	 * As shown in Section 7.6.3.2.3
 	 */
 	pt->pte_lo &= ~ptebit;
 	tlbie(va);
 }
 
 static __inline void
 moea_pte_set(struct pte *pt, struct pte *pvo_pt)
 {
 
 	mtx_assert(&moea_table_mutex, MA_OWNED);
 	pvo_pt->pte_hi |= PTE_VALID;
 
 	/*
 	 * Update the PTE as defined in section 7.6.3.1.
 	 * Note that the REF/CHG bits are from pvo_pt and thus should havce
 	 * been saved so this routine can restore them (if desired).
 	 */
 	pt->pte_lo = pvo_pt->pte_lo;
 	powerpc_sync();
 	pt->pte_hi = pvo_pt->pte_hi;
 	powerpc_sync();
 	moea_pte_valid++;
 }
 
 static __inline void
 moea_pte_unset(struct pte *pt, struct pte *pvo_pt, vm_offset_t va)
 {
 
 	mtx_assert(&moea_table_mutex, MA_OWNED);
 	pvo_pt->pte_hi &= ~PTE_VALID;
 
 	/*
 	 * Force the reg & chg bits back into the PTEs.
 	 */
 	powerpc_sync();
 
 	/*
 	 * Invalidate the pte.
 	 */
 	pt->pte_hi &= ~PTE_VALID;
 
 	tlbie(va);
 
 	/*
 	 * Save the reg & chg bits.
 	 */
 	moea_pte_synch(pt, pvo_pt);
 	moea_pte_valid--;
 }
 
 static __inline void
 moea_pte_change(struct pte *pt, struct pte *pvo_pt, vm_offset_t va)
 {
 
 	/*
 	 * Invalidate the PTE
 	 */
 	moea_pte_unset(pt, pvo_pt, va);
 	moea_pte_set(pt, pvo_pt);
 }
 
 /*
  * Quick sort callout for comparing memory regions.
  */
 static int	om_cmp(const void *a, const void *b);
 
 static int
 om_cmp(const void *a, const void *b)
 {
 	const struct	ofw_map *mapa;
 	const struct	ofw_map *mapb;
 
 	mapa = a;
 	mapb = b;
 	if (mapa->om_pa < mapb->om_pa)
 		return (-1);
 	else if (mapa->om_pa > mapb->om_pa)
 		return (1);
 	else
 		return (0);
 }
 
 void
 moea_cpu_bootstrap(mmu_t mmup, int ap)
 {
 	u_int sdr;
 	int i;
 
 	if (ap) {
 		powerpc_sync();
 		__asm __volatile("mtdbatu 0,%0" :: "r"(battable[0].batu));
 		__asm __volatile("mtdbatl 0,%0" :: "r"(battable[0].batl));
 		isync();
 		__asm __volatile("mtibatu 0,%0" :: "r"(battable[0].batu));
 		__asm __volatile("mtibatl 0,%0" :: "r"(battable[0].batl));
 		isync();
 	}
 
 	__asm __volatile("mtdbatu 1,%0" :: "r"(battable[8].batu));
 	__asm __volatile("mtdbatl 1,%0" :: "r"(battable[8].batl));
 	isync();
 
 	__asm __volatile("mtibatu 1,%0" :: "r"(0));
 	__asm __volatile("mtdbatu 2,%0" :: "r"(0));
 	__asm __volatile("mtibatu 2,%0" :: "r"(0));
 	__asm __volatile("mtdbatu 3,%0" :: "r"(0));
 	__asm __volatile("mtibatu 3,%0" :: "r"(0));
 	isync();
 
 	for (i = 0; i < 16; i++)
 		mtsrin(i << ADDR_SR_SHFT, kernel_pmap->pm_sr[i]);
 	powerpc_sync();
 
 	sdr = (u_int)moea_pteg_table | (moea_pteg_mask >> 10);
 	__asm __volatile("mtsdr1 %0" :: "r"(sdr));
 	isync();
 
 	tlbia();
 }
 
 void
 moea_bootstrap(mmu_t mmup, vm_offset_t kernelstart, vm_offset_t kernelend)
 {
 	ihandle_t	mmui;
 	phandle_t	chosen, mmu;
 	int		sz;
 	int		i, j;
 	vm_size_t	size, physsz, hwphyssz;
 	vm_offset_t	pa, va, off;
 	void		*dpcpu;
 	register_t	msr;
 
         /*
          * Set up BAT0 to map the lowest 256 MB area
          */
         battable[0x0].batl = BATL(0x00000000, BAT_M, BAT_PP_RW);
         battable[0x0].batu = BATU(0x00000000, BAT_BL_256M, BAT_Vs);
 
         /*
          * Map PCI memory space.
          */
         battable[0x8].batl = BATL(0x80000000, BAT_I|BAT_G, BAT_PP_RW);
         battable[0x8].batu = BATU(0x80000000, BAT_BL_256M, BAT_Vs);
 
         battable[0x9].batl = BATL(0x90000000, BAT_I|BAT_G, BAT_PP_RW);
         battable[0x9].batu = BATU(0x90000000, BAT_BL_256M, BAT_Vs);
 
         battable[0xa].batl = BATL(0xa0000000, BAT_I|BAT_G, BAT_PP_RW);
         battable[0xa].batu = BATU(0xa0000000, BAT_BL_256M, BAT_Vs);
 
         battable[0xb].batl = BATL(0xb0000000, BAT_I|BAT_G, BAT_PP_RW);
         battable[0xb].batu = BATU(0xb0000000, BAT_BL_256M, BAT_Vs);
 
         /*
          * Map obio devices.
          */
         battable[0xf].batl = BATL(0xf0000000, BAT_I|BAT_G, BAT_PP_RW);
         battable[0xf].batu = BATU(0xf0000000, BAT_BL_256M, BAT_Vs);
 
 	/*
 	 * Use an IBAT and a DBAT to map the bottom segment of memory
 	 * where we are. Turn off instruction relocation temporarily
 	 * to prevent faults while reprogramming the IBAT.
 	 */
 	msr = mfmsr();
 	mtmsr(msr & ~PSL_IR);
 	__asm (".balign 32; \n"
 	       "mtibatu 0,%0; mtibatl 0,%1; isync; \n"
 	       "mtdbatu 0,%0; mtdbatl 0,%1; isync"
 	    :: "r"(battable[0].batu), "r"(battable[0].batl));
 	mtmsr(msr);
 
 	/* map pci space */
 	__asm __volatile("mtdbatu 1,%0" :: "r"(battable[8].batu));
 	__asm __volatile("mtdbatl 1,%0" :: "r"(battable[8].batl));
 	isync();
 
 	/* set global direct map flag */
 	hw_direct_map = 1;
 
 	mem_regions(&pregions, &pregions_sz, &regions, &regions_sz);
 	CTR0(KTR_PMAP, "moea_bootstrap: physical memory");
 
 	for (i = 0; i < pregions_sz; i++) {
 		vm_offset_t pa;
 		vm_offset_t end;
 
 		CTR3(KTR_PMAP, "physregion: %#x - %#x (%#x)",
 			pregions[i].mr_start,
 			pregions[i].mr_start + pregions[i].mr_size,
 			pregions[i].mr_size);
 		/*
 		 * Install entries into the BAT table to allow all
 		 * of physmem to be convered by on-demand BAT entries.
 		 * The loop will sometimes set the same battable element
 		 * twice, but that's fine since they won't be used for
 		 * a while yet.
 		 */
 		pa = pregions[i].mr_start & 0xf0000000;
 		end = pregions[i].mr_start + pregions[i].mr_size;
 		do {
                         u_int n = pa >> ADDR_SR_SHFT;
 
 			battable[n].batl = BATL(pa, BAT_M, BAT_PP_RW);
 			battable[n].batu = BATU(pa, BAT_BL_256M, BAT_Vs);
 			pa += SEGMENT_LENGTH;
 		} while (pa < end);
 	}
 
 	if (sizeof(phys_avail)/sizeof(phys_avail[0]) < regions_sz)
 		panic("moea_bootstrap: phys_avail too small");
 
 	phys_avail_count = 0;
 	physsz = 0;
 	hwphyssz = 0;
 	TUNABLE_ULONG_FETCH("hw.physmem", (u_long *) &hwphyssz);
 	for (i = 0, j = 0; i < regions_sz; i++, j += 2) {
 		CTR3(KTR_PMAP, "region: %#x - %#x (%#x)", regions[i].mr_start,
 		    regions[i].mr_start + regions[i].mr_size,
 		    regions[i].mr_size);
 		if (hwphyssz != 0 &&
 		    (physsz + regions[i].mr_size) >= hwphyssz) {
 			if (physsz < hwphyssz) {
 				phys_avail[j] = regions[i].mr_start;
 				phys_avail[j + 1] = regions[i].mr_start +
 				    hwphyssz - physsz;
 				physsz = hwphyssz;
 				phys_avail_count++;
 			}
 			break;
 		}
 		phys_avail[j] = regions[i].mr_start;
 		phys_avail[j + 1] = regions[i].mr_start + regions[i].mr_size;
 		phys_avail_count++;
 		physsz += regions[i].mr_size;
 	}
 	physmem = btoc(physsz);
 
 	/*
 	 * Allocate PTEG table.
 	 */
 #ifdef PTEGCOUNT
 	moea_pteg_count = PTEGCOUNT;
 #else
 	moea_pteg_count = 0x1000;
 
 	while (moea_pteg_count < physmem)
 		moea_pteg_count <<= 1;
 
 	moea_pteg_count >>= 1;
 #endif /* PTEGCOUNT */
 
 	size = moea_pteg_count * sizeof(struct pteg);
 	CTR2(KTR_PMAP, "moea_bootstrap: %d PTEGs, %d bytes", moea_pteg_count,
 	    size);
 	moea_pteg_table = (struct pteg *)moea_bootstrap_alloc(size, size);
 	CTR1(KTR_PMAP, "moea_bootstrap: PTEG table at %p", moea_pteg_table);
 	bzero((void *)moea_pteg_table, moea_pteg_count * sizeof(struct pteg));
 	moea_pteg_mask = moea_pteg_count - 1;
 
 	/*
 	 * Allocate pv/overflow lists.
 	 */
 	size = sizeof(struct pvo_head) * moea_pteg_count;
 	moea_pvo_table = (struct pvo_head *)moea_bootstrap_alloc(size,
 	    PAGE_SIZE);
 	CTR1(KTR_PMAP, "moea_bootstrap: PVO table at %p", moea_pvo_table);
 	for (i = 0; i < moea_pteg_count; i++)
 		LIST_INIT(&moea_pvo_table[i]);
 
 	/*
 	 * Initialize the lock that synchronizes access to the pteg and pvo
 	 * tables.
 	 */
 	mtx_init(&moea_table_mutex, "pmap table", NULL, MTX_DEF |
 	    MTX_RECURSE);
 	mtx_init(&moea_vsid_mutex, "VSID table", NULL, MTX_DEF);
 
 	mtx_init(&tlbie_mtx, "tlbie", NULL, MTX_SPIN);
 
 	/*
 	 * Initialise the unmanaged pvo pool.
 	 */
 	moea_bpvo_pool = (struct pvo_entry *)moea_bootstrap_alloc(
 		BPVO_POOL_SIZE*sizeof(struct pvo_entry), 0);
 	moea_bpvo_pool_index = 0;
 
 	/*
 	 * Make sure kernel vsid is allocated as well as VSID 0.
 	 */
 	moea_vsid_bitmap[(KERNEL_VSIDBITS & (NPMAPS - 1)) / VSID_NBPW]
 		|= 1 << (KERNEL_VSIDBITS % VSID_NBPW);
 	moea_vsid_bitmap[0] |= 1;
 
 	/*
 	 * Initialize the kernel pmap (which is statically allocated).
 	 */
 	PMAP_LOCK_INIT(kernel_pmap);
 	for (i = 0; i < 16; i++)
 		kernel_pmap->pm_sr[i] = EMPTY_SEGMENT + i;
-	kernel_pmap->pm_active = ~0;
+	CPU_FILL(&kernel_pmap->pm_active);
 
 	/*
 	 * Set up the Open Firmware mappings
 	 */
 	if ((chosen = OF_finddevice("/chosen")) == -1)
 		panic("moea_bootstrap: can't find /chosen");
 	OF_getprop(chosen, "mmu", &mmui, 4);
 	if ((mmu = OF_instance_to_package(mmui)) == -1)
 		panic("moea_bootstrap: can't get mmu package");
 	if ((sz = OF_getproplen(mmu, "translations")) == -1)
 		panic("moea_bootstrap: can't get ofw translation count");
 	translations = NULL;
 	for (i = 0; phys_avail[i] != 0; i += 2) {
 		if (phys_avail[i + 1] >= sz) {
 			translations = (struct ofw_map *)phys_avail[i];
 			break;
 		}
 	}
 	if (translations == NULL)
 		panic("moea_bootstrap: no space to copy translations");
 	bzero(translations, sz);
 	if (OF_getprop(mmu, "translations", translations, sz) == -1)
 		panic("moea_bootstrap: can't get ofw translations");
 	CTR0(KTR_PMAP, "moea_bootstrap: translations");
 	sz /= sizeof(*translations);
 	qsort(translations, sz, sizeof (*translations), om_cmp);
 	for (i = 0; i < sz; i++) {
 		CTR3(KTR_PMAP, "translation: pa=%#x va=%#x len=%#x",
 		    translations[i].om_pa, translations[i].om_va,
 		    translations[i].om_len);
 
 		/*
 		 * If the mapping is 1:1, let the RAM and device on-demand
 		 * BAT tables take care of the translation.
 		 */
 		if (translations[i].om_va == translations[i].om_pa)
 			continue;
 
 		/* Enter the pages */
 		for (off = 0; off < translations[i].om_len; off += PAGE_SIZE)
 			moea_kenter(mmup, translations[i].om_va + off, 
 				    translations[i].om_pa + off);
 	}
 
 	/*
 	 * Calculate the last available physical address.
 	 */
 	for (i = 0; phys_avail[i + 2] != 0; i += 2)
 		;
 	Maxmem = powerpc_btop(phys_avail[i + 1]);
 
 	moea_cpu_bootstrap(mmup,0);
 
 	pmap_bootstrapped++;
 
 	/*
 	 * Set the start and end of kva.
 	 */
 	virtual_avail = VM_MIN_KERNEL_ADDRESS;
 	virtual_end = VM_MAX_SAFE_KERNEL_ADDRESS;
 
 	/*
 	 * Allocate a kernel stack with a guard page for thread0 and map it
 	 * into the kernel page map.
 	 */
 	pa = moea_bootstrap_alloc(KSTACK_PAGES * PAGE_SIZE, PAGE_SIZE);
 	va = virtual_avail + KSTACK_GUARD_PAGES * PAGE_SIZE;
 	virtual_avail = va + KSTACK_PAGES * PAGE_SIZE;
 	CTR2(KTR_PMAP, "moea_bootstrap: kstack0 at %#x (%#x)", pa, va);
 	thread0.td_kstack = va;
 	thread0.td_kstack_pages = KSTACK_PAGES;
 	for (i = 0; i < KSTACK_PAGES; i++) {
 		moea_kenter(mmup, va, pa);
 		pa += PAGE_SIZE;
 		va += PAGE_SIZE;
 	}
 
 	/*
 	 * Allocate virtual address space for the message buffer.
 	 */
 	pa = msgbuf_phys = moea_bootstrap_alloc(msgbufsize, PAGE_SIZE);
 	msgbufp = (struct msgbuf *)virtual_avail;
 	va = virtual_avail;
 	virtual_avail += round_page(msgbufsize);
 	while (va < virtual_avail) {
 		moea_kenter(mmup, va, pa);
 		pa += PAGE_SIZE;
 		va += PAGE_SIZE;
 	}
 
 	/*
 	 * Allocate virtual address space for the dynamic percpu area.
 	 */
 	pa = moea_bootstrap_alloc(DPCPU_SIZE, PAGE_SIZE);
 	dpcpu = (void *)virtual_avail;
 	va = virtual_avail;
 	virtual_avail += DPCPU_SIZE;
 	while (va < virtual_avail) {
 		moea_kenter(mmup, va, pa);
 		pa += PAGE_SIZE;
 		va += PAGE_SIZE;
 	}
 	dpcpu_init(dpcpu, 0);
 }
 
 /*
  * Activate a user pmap.  The pmap must be activated before it's address
  * space can be accessed in any way.
  */
 void
 moea_activate(mmu_t mmu, struct thread *td)
 {
 	pmap_t	pm, pmr;
 
 	/*
 	 * Load all the data we need up front to encourage the compiler to
 	 * not issue any loads while we have interrupts disabled below.
 	 */
 	pm = &td->td_proc->p_vmspace->vm_pmap;
 	pmr = pm->pmap_phys;
 
-	pm->pm_active |= PCPU_GET(cpumask);
+	sched_pin();
+	CPU_OR(&pm->pm_active, PCPU_PTR(cpumask));
+	sched_unpin();
 	PCPU_SET(curpmap, pmr);
 }
 
 void
 moea_deactivate(mmu_t mmu, struct thread *td)
 {
 	pmap_t	pm;
 
 	pm = &td->td_proc->p_vmspace->vm_pmap;
-	pm->pm_active &= ~PCPU_GET(cpumask);
+	sched_pin();
+	CPU_NAND(&pm->pm_active, PCPU_PTR(cpumask));
+	sched_unpin();
 	PCPU_SET(curpmap, NULL);
 }
 
 void
 moea_change_wiring(mmu_t mmu, pmap_t pm, vm_offset_t va, boolean_t wired)
 {
 	struct	pvo_entry *pvo;
 
 	PMAP_LOCK(pm);
 	pvo = moea_pvo_find_va(pm, va & ~ADDR_POFF, NULL);
 
 	if (pvo != NULL) {
 		if (wired) {
 			if ((pvo->pvo_vaddr & PVO_WIRED) == 0)
 				pm->pm_stats.wired_count++;
 			pvo->pvo_vaddr |= PVO_WIRED;
 		} else {
 			if ((pvo->pvo_vaddr & PVO_WIRED) != 0)
 				pm->pm_stats.wired_count--;
 			pvo->pvo_vaddr &= ~PVO_WIRED;
 		}
 	}
 	PMAP_UNLOCK(pm);
 }
 
 void
 moea_copy_page(mmu_t mmu, vm_page_t msrc, vm_page_t mdst)
 {
 	vm_offset_t	dst;
 	vm_offset_t	src;
 
 	dst = VM_PAGE_TO_PHYS(mdst);
 	src = VM_PAGE_TO_PHYS(msrc);
 
 	kcopy((void *)src, (void *)dst, PAGE_SIZE);
 }
 
 /*
  * Zero a page of physical memory by temporarily mapping it into the tlb.
  */
 void
 moea_zero_page(mmu_t mmu, vm_page_t m)
 {
 	vm_offset_t pa = VM_PAGE_TO_PHYS(m);
 	void *va = (void *)pa;
 
 	bzero(va, PAGE_SIZE);
 }
 
 void
 moea_zero_page_area(mmu_t mmu, vm_page_t m, int off, int size)
 {
 	vm_offset_t pa = VM_PAGE_TO_PHYS(m);
 	void *va = (void *)(pa + off);
 
 	bzero(va, size);
 }
 
 void
 moea_zero_page_idle(mmu_t mmu, vm_page_t m)
 {
 	vm_offset_t pa = VM_PAGE_TO_PHYS(m);
 	void *va = (void *)pa;
 
 	bzero(va, PAGE_SIZE);
 }
 
 /*
  * Map the given physical page at the specified virtual address in the
  * target pmap with the protection requested.  If specified the page
  * will be wired down.
  */
 void
 moea_enter(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
 	   boolean_t wired)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	moea_enter_locked(pmap, va, m, prot, wired);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Map the given physical page at the specified virtual address in the
  * target pmap with the protection requested.  If specified the page
  * will be wired down.
  *
  * The page queues and pmap must be locked.
  */
 static void
 moea_enter_locked(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
     boolean_t wired)
 {
 	struct		pvo_head *pvo_head;
 	uma_zone_t	zone;
 	vm_page_t	pg;
 	u_int		pte_lo, pvo_flags, was_exec;
 	int		error;
 
 	if (!moea_initialized) {
 		pvo_head = &moea_pvo_kunmanaged;
 		zone = moea_upvo_zone;
 		pvo_flags = 0;
 		pg = NULL;
 		was_exec = PTE_EXEC;
 	} else {
 		pvo_head = vm_page_to_pvoh(m);
 		pg = m;
 		zone = moea_mpvo_zone;
 		pvo_flags = PVO_MANAGED;
 		was_exec = 0;
 	}
 	if (pmap_bootstrapped)
 		mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 	    (m->oflags & VPO_BUSY) != 0 || VM_OBJECT_LOCKED(m->object),
 	    ("moea_enter_locked: page %p is not busy", m));
 
 	/* XXX change the pvo head for fake pages */
 	if ((m->flags & PG_FICTITIOUS) == PG_FICTITIOUS) {
 		pvo_flags &= ~PVO_MANAGED;
 		pvo_head = &moea_pvo_kunmanaged;
 		zone = moea_upvo_zone;
 	}
 
 	/*
 	 * If this is a managed page, and it's the first reference to the page,
 	 * clear the execness of the page.  Otherwise fetch the execness.
 	 */
 	if ((pg != NULL) && ((m->flags & PG_FICTITIOUS) == 0)) {
 		if (LIST_EMPTY(pvo_head)) {
 			moea_attr_clear(pg, PTE_EXEC);
 		} else {
 			was_exec = moea_attr_fetch(pg) & PTE_EXEC;
 		}
 	}
 
 	pte_lo = moea_calc_wimg(VM_PAGE_TO_PHYS(m), pmap_page_get_memattr(m));
 
 	if (prot & VM_PROT_WRITE) {
 		pte_lo |= PTE_BW;
 		if (pmap_bootstrapped &&
 		    (m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0)
 			vm_page_flag_set(m, PG_WRITEABLE);
 	} else
 		pte_lo |= PTE_BR;
 
 	if (prot & VM_PROT_EXECUTE)
 		pvo_flags |= PVO_EXECUTABLE;
 
 	if (wired)
 		pvo_flags |= PVO_WIRED;
 
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		pvo_flags |= PVO_FAKE;
 
 	error = moea_pvo_enter(pmap, zone, pvo_head, va, VM_PAGE_TO_PHYS(m),
 	    pte_lo, pvo_flags);
 
 	/*
 	 * Flush the real page from the instruction cache if this page is
 	 * mapped executable and cacheable and was not previously mapped (or
 	 * was not mapped executable).
 	 */
 	if (error == 0 && (pvo_flags & PVO_EXECUTABLE) &&
 	    (pte_lo & PTE_I) == 0 && was_exec == 0) {
 		/*
 		 * Flush the real memory from the cache.
 		 */
 		moea_syncicache(VM_PAGE_TO_PHYS(m), PAGE_SIZE);
 		if (pg != NULL)
 			moea_attr_save(pg, PTE_EXEC);
 	}
 
 	/* XXX syncicache always until problems are sorted */
 	moea_syncicache(VM_PAGE_TO_PHYS(m), PAGE_SIZE);
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 moea_enter_object(mmu_t mmu, pmap_t pm, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	vm_page_t m;
 	vm_pindex_t diff, psize;
 
 	psize = atop(end - start);
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		moea_enter_locked(pm, start + ptoa(diff), m, prot &
 		    (VM_PROT_READ | VM_PROT_EXECUTE), FALSE);
 		m = TAILQ_NEXT(m, listq);
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 void
 moea_enter_quick(mmu_t mmu, pmap_t pm, vm_offset_t va, vm_page_t m,
     vm_prot_t prot)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	moea_enter_locked(pm, va, m, prot & (VM_PROT_READ | VM_PROT_EXECUTE),
 	    FALSE);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 vm_paddr_t
 moea_extract(mmu_t mmu, pmap_t pm, vm_offset_t va)
 {
 	struct	pvo_entry *pvo;
 	vm_paddr_t pa;
 
 	PMAP_LOCK(pm);
 	pvo = moea_pvo_find_va(pm, va & ~ADDR_POFF, NULL);
 	if (pvo == NULL)
 		pa = 0;
 	else
 		pa = (pvo->pvo_pte.pte.pte_lo & PTE_RPGN) | (va & ADDR_POFF);
 	PMAP_UNLOCK(pm);
 	return (pa);
 }
 
 /*
  * Atomically extract and hold the physical page with the given
  * pmap and virtual address pair if that mapping permits the given
  * protection.
  */
 vm_page_t
 moea_extract_and_hold(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_prot_t prot)
 {
 	struct	pvo_entry *pvo;
 	vm_page_t m;
         vm_paddr_t pa;
 
 	m = NULL;
 	pa = 0;
 	PMAP_LOCK(pmap);
 retry:
 	pvo = moea_pvo_find_va(pmap, va & ~ADDR_POFF, NULL);
 	if (pvo != NULL && (pvo->pvo_pte.pte.pte_hi & PTE_VALID) &&
 	    ((pvo->pvo_pte.pte.pte_lo & PTE_PP) == PTE_RW ||
 	     (prot & VM_PROT_WRITE) == 0)) {
 		if (vm_page_pa_tryrelock(pmap, pvo->pvo_pte.pte.pte_lo & PTE_RPGN, &pa))
 			goto retry;
 		m = PHYS_TO_VM_PAGE(pvo->pvo_pte.pte.pte_lo & PTE_RPGN);
 		vm_page_hold(m);
 	}
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pmap);
 	return (m);
 }
 
 void
 moea_init(mmu_t mmu)
 {
 
 	moea_upvo_zone = uma_zcreate("UPVO entry", sizeof (struct pvo_entry),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR,
 	    UMA_ZONE_VM | UMA_ZONE_NOFREE);
 	moea_mpvo_zone = uma_zcreate("MPVO entry", sizeof(struct pvo_entry),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR,
 	    UMA_ZONE_VM | UMA_ZONE_NOFREE);
 	moea_initialized = TRUE;
 }
 
 boolean_t
 moea_is_referenced(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea_is_referenced: page %p is not managed", m));
 	return (moea_query_bit(m, PTE_REF));
 }
 
 boolean_t
 moea_is_modified(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea_is_modified: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no PTEs can have PTE_CHG set.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return (FALSE);
 	return (moea_query_bit(m, PTE_CHG));
 }
 
 boolean_t
 moea_is_prefaultable(mmu_t mmu, pmap_t pmap, vm_offset_t va)
 {
 	struct pvo_entry *pvo;
 	boolean_t rv;
 
 	PMAP_LOCK(pmap);
 	pvo = moea_pvo_find_va(pmap, va & ~ADDR_POFF, NULL);
 	rv = pvo == NULL || (pvo->pvo_pte.pte.pte_hi & PTE_VALID) == 0;
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 void
 moea_clear_reference(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea_clear_reference: page %p is not managed", m));
 	moea_clear_bit(m, PTE_REF);
 }
 
 void
 moea_clear_modify(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("moea_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no PTEs can have PTE_CHG
 	 * set.  If the object containing the page is locked and the page is
 	 * not VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	moea_clear_bit(m, PTE_CHG);
 }
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 void
 moea_remove_write(mmu_t mmu, vm_page_t m)
 {
 	struct	pvo_entry *pvo;
 	struct	pte *pt;
 	pmap_t	pmap;
 	u_int	lo;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	lo = moea_attr_fetch(m);
 	powerpc_sync();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 		pmap = pvo->pvo_pmap;
 		PMAP_LOCK(pmap);
 		if ((pvo->pvo_pte.pte.pte_lo & PTE_PP) != PTE_BR) {
 			pt = moea_pvo_to_pte(pvo, -1);
 			pvo->pvo_pte.pte.pte_lo &= ~PTE_PP;
 			pvo->pvo_pte.pte.pte_lo |= PTE_BR;
 			if (pt != NULL) {
 				moea_pte_synch(pt, &pvo->pvo_pte.pte);
 				lo |= pvo->pvo_pte.pte.pte_lo;
 				pvo->pvo_pte.pte.pte_lo &= ~PTE_CHG;
 				moea_pte_change(pt, &pvo->pvo_pte.pte,
 				    pvo->pvo_vaddr);
 				mtx_unlock(&moea_table_mutex);
 			}
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	if ((lo & PTE_CHG) != 0) {
 		moea_attr_clear(m, PTE_CHG);
 		vm_page_dirty(m);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 /*
  *	moea_ts_referenced:
  *
  *	Return a count of reference bits for a page, clearing those bits.
  *	It is not necessary for every reference bit to be cleared, but it
  *	is necessary that 0 only be returned when there are truly no
  *	reference bits set.
  *
  *	XXX: The exact number of bits to check and clear is a matter that
  *	should be tested and standardized at some point in the future for
  *	optimal aging of shared pages.
  */
 boolean_t
 moea_ts_referenced(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea_ts_referenced: page %p is not managed", m));
 	return (moea_clear_bit(m, PTE_REF));
 }
 
 /*
  * Modify the WIMG settings of all mappings for a page.
  */
 void
 moea_page_set_memattr(mmu_t mmu, vm_page_t m, vm_memattr_t ma)
 {
 	struct	pvo_entry *pvo;
 	struct	pvo_head *pvo_head;
 	struct	pte *pt;
 	pmap_t	pmap;
 	u_int	lo;
 
 	if (m->flags & PG_FICTITIOUS) {
 		m->md.mdpg_cache_attrs = ma;
 		return;
 	}
 
 	vm_page_lock_queues();
 	pvo_head = vm_page_to_pvoh(m);
 	lo = moea_calc_wimg(VM_PAGE_TO_PHYS(m), ma);
 
 	LIST_FOREACH(pvo, pvo_head, pvo_vlink) {
 		pmap = pvo->pvo_pmap;
 		PMAP_LOCK(pmap);
 		pt = moea_pvo_to_pte(pvo, -1);
 		pvo->pvo_pte.pte.pte_lo &= ~PTE_WIMG;
 		pvo->pvo_pte.pte.pte_lo |= lo;
 		if (pt != NULL) {
 			moea_pte_change(pt, &pvo->pvo_pte.pte,
 			    pvo->pvo_vaddr);
 			if (pvo->pvo_pmap == kernel_pmap)
 				isync();
 		}
 		mtx_unlock(&moea_table_mutex);
 		PMAP_UNLOCK(pmap);
 	}
 	m->md.mdpg_cache_attrs = ma;
 	vm_page_unlock_queues();
 }
 
 /*
  * Map a wired page into kernel virtual address space.
  */
 void
 moea_kenter(mmu_t mmu, vm_offset_t va, vm_offset_t pa)
 {
 
 	moea_kenter_attr(mmu, va, pa, VM_MEMATTR_DEFAULT);
 }
 
 void
 moea_kenter_attr(mmu_t mmu, vm_offset_t va, vm_offset_t pa, vm_memattr_t ma)
 {
 	u_int		pte_lo;
 	int		error;	
 
 #if 0
 	if (va < VM_MIN_KERNEL_ADDRESS)
 		panic("moea_kenter: attempt to enter non-kernel address %#x",
 		    va);
 #endif
 
 	pte_lo = moea_calc_wimg(pa, ma);
 
 	PMAP_LOCK(kernel_pmap);
 	error = moea_pvo_enter(kernel_pmap, moea_upvo_zone,
 	    &moea_pvo_kunmanaged, va, pa, pte_lo, PVO_WIRED);
 
 	if (error != 0 && error != ENOENT)
 		panic("moea_kenter: failed to enter va %#x pa %#x: %d", va,
 		    pa, error);
 
 	/*
 	 * Flush the real memory from the instruction cache.
 	 */
 	if ((pte_lo & (PTE_I | PTE_G)) == 0) {
 		moea_syncicache(pa, PAGE_SIZE);
 	}
 	PMAP_UNLOCK(kernel_pmap);
 }
 
 /*
  * Extract the physical page address associated with the given kernel virtual
  * address.
  */
 vm_offset_t
 moea_kextract(mmu_t mmu, vm_offset_t va)
 {
 	struct		pvo_entry *pvo;
 	vm_paddr_t pa;
 
 	/*
 	 * Allow direct mappings on 32-bit OEA
 	 */
 	if (va < VM_MIN_KERNEL_ADDRESS) {
 		return (va);
 	}
 
 	PMAP_LOCK(kernel_pmap);
 	pvo = moea_pvo_find_va(kernel_pmap, va & ~ADDR_POFF, NULL);
 	KASSERT(pvo != NULL, ("moea_kextract: no addr found"));
 	pa = (pvo->pvo_pte.pte.pte_lo & PTE_RPGN) | (va & ADDR_POFF);
 	PMAP_UNLOCK(kernel_pmap);
 	return (pa);
 }
 
 /*
  * Remove a wired page from kernel virtual address space.
  */
 void
 moea_kremove(mmu_t mmu, vm_offset_t va)
 {
 
 	moea_remove(mmu, kernel_pmap, va, va + PAGE_SIZE);
 }
 
 /*
  * Map a range of physical addresses into kernel virtual address space.
  *
  * The value passed in *virt is a suggested virtual address for the mapping.
  * Architectures which can support a direct-mapped physical to virtual region
  * can return the appropriate address within that region, leaving '*virt'
  * unchanged.  We cannot and therefore do not; *virt is updated with the
  * first usable address after the mapped region.
  */
 vm_offset_t
 moea_map(mmu_t mmu, vm_offset_t *virt, vm_offset_t pa_start,
     vm_offset_t pa_end, int prot)
 {
 	vm_offset_t	sva, va;
 
 	sva = *virt;
 	va = sva;
 	for (; pa_start < pa_end; pa_start += PAGE_SIZE, va += PAGE_SIZE)
 		moea_kenter(mmu, va, pa_start);
 	*virt = va;
 	return (sva);
 }
 
 /*
  * Returns true if the pmap's pv is one of the first
  * 16 pvs linked to from this page.  This count may
  * be changed upwards or downwards in the future; it
  * is only necessary that true be returned for a small
  * subset of pmaps for proper page aging.
  */
 boolean_t
 moea_page_exists_quick(mmu_t mmu, pmap_t pmap, vm_page_t m)
 {
         int loops;
 	struct pvo_entry *pvo;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea_page_exists_quick: page %p is not managed", m));
 	loops = 0;
 	rv = FALSE;
 	vm_page_lock_queues();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 		if (pvo->pvo_pmap == pmap) {
 			rv = TRUE;
 			break;
 		}
 		if (++loops >= 16)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Return the number of managed mappings to the given physical page
  * that are wired.
  */
 int
 moea_page_wired_mappings(mmu_t mmu, vm_page_t m)
 {
 	struct pvo_entry *pvo;
 	int count;
 
 	count = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink)
 		if ((pvo->pvo_vaddr & PVO_WIRED) != 0)
 			count++;
 	vm_page_unlock_queues();
 	return (count);
 }
 
 static u_int	moea_vsidcontext;
 
 void
 moea_pinit(mmu_t mmu, pmap_t pmap)
 {
 	int	i, mask;
 	u_int	entropy;
 
 	KASSERT((int)pmap < VM_MIN_KERNEL_ADDRESS, ("moea_pinit: virt pmap"));
 	PMAP_LOCK_INIT(pmap);
 
 	entropy = 0;
 	__asm __volatile("mftb %0" : "=r"(entropy));
 
 	if ((pmap->pmap_phys = (pmap_t)moea_kextract(mmu, (vm_offset_t)pmap))
 	    == NULL) {
 		pmap->pmap_phys = pmap;
 	}
 	
 
 	mtx_lock(&moea_vsid_mutex);
 	/*
 	 * Allocate some segment registers for this pmap.
 	 */
 	for (i = 0; i < NPMAPS; i += VSID_NBPW) {
 		u_int	hash, n;
 
 		/*
 		 * Create a new value by mutiplying by a prime and adding in
 		 * entropy from the timebase register.  This is to make the
 		 * VSID more random so that the PT hash function collides
 		 * less often.  (Note that the prime casues gcc to do shifts
 		 * instead of a multiply.)
 		 */
 		moea_vsidcontext = (moea_vsidcontext * 0x1105) + entropy;
 		hash = moea_vsidcontext & (NPMAPS - 1);
 		if (hash == 0)		/* 0 is special, avoid it */
 			continue;
 		n = hash >> 5;
 		mask = 1 << (hash & (VSID_NBPW - 1));
 		hash = (moea_vsidcontext & 0xfffff);
 		if (moea_vsid_bitmap[n] & mask) {	/* collision? */
 			/* anything free in this bucket? */
 			if (moea_vsid_bitmap[n] == 0xffffffff) {
 				entropy = (moea_vsidcontext >> 20);
 				continue;
 			}
 			i = ffs(~moea_vsid_bitmap[n]) - 1;
 			mask = 1 << i;
 			hash &= 0xfffff & ~(VSID_NBPW - 1);
 			hash |= i;
 		}
 		moea_vsid_bitmap[n] |= mask;
 		for (i = 0; i < 16; i++)
 			pmap->pm_sr[i] = VSID_MAKE(i, hash);
 		mtx_unlock(&moea_vsid_mutex);
 		return;
 	}
 
 	mtx_unlock(&moea_vsid_mutex);
 	panic("moea_pinit: out of segments");
 }
 
 /*
  * Initialize the pmap associated with process 0.
  */
 void
 moea_pinit0(mmu_t mmu, pmap_t pm)
 {
 
 	moea_pinit(mmu, pm);
 	bzero(&pm->pm_stats, sizeof(pm->pm_stats));
 }
 
 /*
  * Set the physical protection on the specified range of this map as requested.
  */
 void
 moea_protect(mmu_t mmu, pmap_t pm, vm_offset_t sva, vm_offset_t eva,
     vm_prot_t prot)
 {
 	struct	pvo_entry *pvo;
 	struct	pte *pt;
 	int	pteidx;
 
 	KASSERT(pm == &curproc->p_vmspace->vm_pmap || pm == kernel_pmap,
 	    ("moea_protect: non current pmap"));
 
 	if ((prot & VM_PROT_READ) == VM_PROT_NONE) {
 		moea_remove(mmu, pm, sva, eva);
 		return;
 	}
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	for (; sva < eva; sva += PAGE_SIZE) {
 		pvo = moea_pvo_find_va(pm, sva, &pteidx);
 		if (pvo == NULL)
 			continue;
 
 		if ((prot & VM_PROT_EXECUTE) == 0)
 			pvo->pvo_vaddr &= ~PVO_EXECUTABLE;
 
 		/*
 		 * Grab the PTE pointer before we diddle with the cached PTE
 		 * copy.
 		 */
 		pt = moea_pvo_to_pte(pvo, pteidx);
 		/*
 		 * Change the protection of the page.
 		 */
 		pvo->pvo_pte.pte.pte_lo &= ~PTE_PP;
 		pvo->pvo_pte.pte.pte_lo |= PTE_BR;
 
 		/*
 		 * If the PVO is in the page table, update that pte as well.
 		 */
 		if (pt != NULL) {
 			moea_pte_change(pt, &pvo->pvo_pte.pte, pvo->pvo_vaddr);
 			mtx_unlock(&moea_table_mutex);
 		}
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 /*
  * Map a list of wired pages into kernel virtual address space.  This is
  * intended for temporary mappings which do not need page modification or
  * references recorded.  Existing mappings in the region are overwritten.
  */
 void
 moea_qenter(mmu_t mmu, vm_offset_t sva, vm_page_t *m, int count)
 {
 	vm_offset_t va;
 
 	va = sva;
 	while (count-- > 0) {
 		moea_kenter(mmu, va, VM_PAGE_TO_PHYS(*m));
 		va += PAGE_SIZE;
 		m++;
 	}
 }
 
 /*
  * Remove page mappings from kernel virtual address space.  Intended for
  * temporary mappings entered by moea_qenter.
  */
 void
 moea_qremove(mmu_t mmu, vm_offset_t sva, int count)
 {
 	vm_offset_t va;
 
 	va = sva;
 	while (count-- > 0) {
 		moea_kremove(mmu, va);
 		va += PAGE_SIZE;
 	}
 }
 
 void
 moea_release(mmu_t mmu, pmap_t pmap)
 {
         int idx, mask;
         
 	/*
 	 * Free segment register's VSID
 	 */
         if (pmap->pm_sr[0] == 0)
                 panic("moea_release");
 
 	mtx_lock(&moea_vsid_mutex);
         idx = VSID_TO_HASH(pmap->pm_sr[0]) & (NPMAPS-1);
         mask = 1 << (idx % VSID_NBPW);
         idx /= VSID_NBPW;
         moea_vsid_bitmap[idx] &= ~mask;
 	mtx_unlock(&moea_vsid_mutex);
 	PMAP_LOCK_DESTROY(pmap);
 }
 
 /*
  * Remove the given range of addresses from the specified map.
  */
 void
 moea_remove(mmu_t mmu, pmap_t pm, vm_offset_t sva, vm_offset_t eva)
 {
 	struct	pvo_entry *pvo;
 	int	pteidx;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	for (; sva < eva; sva += PAGE_SIZE) {
 		pvo = moea_pvo_find_va(pm, sva, &pteidx);
 		if (pvo != NULL) {
 			moea_pvo_remove(pvo, pteidx);
 		}
 	}
 	PMAP_UNLOCK(pm);
 	vm_page_unlock_queues();
 }
 
 /*
  * Remove physical page from all pmaps in which it resides. moea_pvo_remove()
  * will reflect changes in pte's back to the vm_page.
  */
 void
 moea_remove_all(mmu_t mmu, vm_page_t m)
 {
 	struct  pvo_head *pvo_head;
 	struct	pvo_entry *pvo, *next_pvo;
 	pmap_t	pmap;
 
 	vm_page_lock_queues();
 	pvo_head = vm_page_to_pvoh(m);
 	for (pvo = LIST_FIRST(pvo_head); pvo != NULL; pvo = next_pvo) {
 		next_pvo = LIST_NEXT(pvo, pvo_vlink);
 
 		pmap = pvo->pvo_pmap;
 		PMAP_LOCK(pmap);
 		moea_pvo_remove(pvo, -1);
 		PMAP_UNLOCK(pmap);
 	}
 	if ((m->flags & PG_WRITEABLE) && moea_is_modified(mmu, m)) {
 		moea_attr_clear(m, PTE_CHG);
 		vm_page_dirty(m);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 /*
  * Allocate a physical page of memory directly from the phys_avail map.
  * Can only be called from moea_bootstrap before avail start and end are
  * calculated.
  */
 static vm_offset_t
 moea_bootstrap_alloc(vm_size_t size, u_int align)
 {
 	vm_offset_t	s, e;
 	int		i, j;
 
 	size = round_page(size);
 	for (i = 0; phys_avail[i + 1] != 0; i += 2) {
 		if (align != 0)
 			s = (phys_avail[i] + align - 1) & ~(align - 1);
 		else
 			s = phys_avail[i];
 		e = s + size;
 
 		if (s < phys_avail[i] || e > phys_avail[i + 1])
 			continue;
 
 		if (s == phys_avail[i]) {
 			phys_avail[i] += size;
 		} else if (e == phys_avail[i + 1]) {
 			phys_avail[i + 1] -= size;
 		} else {
 			for (j = phys_avail_count * 2; j > i; j -= 2) {
 				phys_avail[j] = phys_avail[j - 2];
 				phys_avail[j + 1] = phys_avail[j - 1];
 			}
 
 			phys_avail[i + 3] = phys_avail[i + 1];
 			phys_avail[i + 1] = s;
 			phys_avail[i + 2] = e;
 			phys_avail_count++;
 		}
 
 		return (s);
 	}
 	panic("moea_bootstrap_alloc: could not allocate memory");
 }
 
 static void
 moea_syncicache(vm_offset_t pa, vm_size_t len)
 {
 	__syncicache((void *)pa, len);
 }
 
 static int
 moea_pvo_enter(pmap_t pm, uma_zone_t zone, struct pvo_head *pvo_head,
     vm_offset_t va, vm_offset_t pa, u_int pte_lo, int flags)
 {
 	struct	pvo_entry *pvo;
 	u_int	sr;
 	int	first;
 	u_int	ptegidx;
 	int	i;
 	int     bootstrap;
 
 	moea_pvo_enter_calls++;
 	first = 0;
 	bootstrap = 0;
 
 	/*
 	 * Compute the PTE Group index.
 	 */
 	va &= ~ADDR_POFF;
 	sr = va_to_sr(pm->pm_sr, va);
 	ptegidx = va_to_pteg(sr, va);
 
 	/*
 	 * Remove any existing mapping for this page.  Reuse the pvo entry if
 	 * there is a mapping.
 	 */
 	mtx_lock(&moea_table_mutex);
 	LIST_FOREACH(pvo, &moea_pvo_table[ptegidx], pvo_olink) {
 		if (pvo->pvo_pmap == pm && PVO_VADDR(pvo) == va) {
 			if ((pvo->pvo_pte.pte.pte_lo & PTE_RPGN) == pa &&
 			    (pvo->pvo_pte.pte.pte_lo & PTE_PP) ==
 			    (pte_lo & PTE_PP)) {
 				mtx_unlock(&moea_table_mutex);
 				return (0);
 			}
 			moea_pvo_remove(pvo, -1);
 			break;
 		}
 	}
 
 	/*
 	 * If we aren't overwriting a mapping, try to allocate.
 	 */
 	if (moea_initialized) {
 		pvo = uma_zalloc(zone, M_NOWAIT);
 	} else {
 		if (moea_bpvo_pool_index >= BPVO_POOL_SIZE) {
 			panic("moea_enter: bpvo pool exhausted, %d, %d, %d",
 			      moea_bpvo_pool_index, BPVO_POOL_SIZE, 
 			      BPVO_POOL_SIZE * sizeof(struct pvo_entry));
 		}
 		pvo = &moea_bpvo_pool[moea_bpvo_pool_index];
 		moea_bpvo_pool_index++;
 		bootstrap = 1;
 	}
 
 	if (pvo == NULL) {
 		mtx_unlock(&moea_table_mutex);
 		return (ENOMEM);
 	}
 
 	moea_pvo_entries++;
 	pvo->pvo_vaddr = va;
 	pvo->pvo_pmap = pm;
 	LIST_INSERT_HEAD(&moea_pvo_table[ptegidx], pvo, pvo_olink);
 	pvo->pvo_vaddr &= ~ADDR_POFF;
 	if (flags & VM_PROT_EXECUTE)
 		pvo->pvo_vaddr |= PVO_EXECUTABLE;
 	if (flags & PVO_WIRED)
 		pvo->pvo_vaddr |= PVO_WIRED;
 	if (pvo_head != &moea_pvo_kunmanaged)
 		pvo->pvo_vaddr |= PVO_MANAGED;
 	if (bootstrap)
 		pvo->pvo_vaddr |= PVO_BOOTSTRAP;
 	if (flags & PVO_FAKE)
 		pvo->pvo_vaddr |= PVO_FAKE;
 
 	moea_pte_create(&pvo->pvo_pte.pte, sr, va, pa | pte_lo);
 
 	/*
 	 * Remember if the list was empty and therefore will be the first
 	 * item.
 	 */
 	if (LIST_FIRST(pvo_head) == NULL)
 		first = 1;
 	LIST_INSERT_HEAD(pvo_head, pvo, pvo_vlink);
 
 	if (pvo->pvo_pte.pte.pte_lo & PVO_WIRED)
 		pm->pm_stats.wired_count++;
 	pm->pm_stats.resident_count++;
 
 	/*
 	 * We hope this succeeds but it isn't required.
 	 */
 	i = moea_pte_insert(ptegidx, &pvo->pvo_pte.pte);
 	if (i >= 0) {
 		PVO_PTEGIDX_SET(pvo, i);
 	} else {
 		panic("moea_pvo_enter: overflow");
 		moea_pte_overflow++;
 	}
 	mtx_unlock(&moea_table_mutex);
 
 	return (first ? ENOENT : 0);
 }
 
 static void
 moea_pvo_remove(struct pvo_entry *pvo, int pteidx)
 {
 	struct	pte *pt;
 
 	/*
 	 * If there is an active pte entry, we need to deactivate it (and
 	 * save the ref & cfg bits).
 	 */
 	pt = moea_pvo_to_pte(pvo, pteidx);
 	if (pt != NULL) {
 		moea_pte_unset(pt, &pvo->pvo_pte.pte, pvo->pvo_vaddr);
 		mtx_unlock(&moea_table_mutex);
 		PVO_PTEGIDX_CLR(pvo);
 	} else {
 		moea_pte_overflow--;
 	}
 
 	/*
 	 * Update our statistics.
 	 */
 	pvo->pvo_pmap->pm_stats.resident_count--;
 	if (pvo->pvo_pte.pte.pte_lo & PVO_WIRED)
 		pvo->pvo_pmap->pm_stats.wired_count--;
 
 	/*
 	 * Save the REF/CHG bits into their cache if the page is managed.
 	 */
 	if ((pvo->pvo_vaddr & (PVO_MANAGED|PVO_FAKE)) == PVO_MANAGED) {
 		struct	vm_page *pg;
 
 		pg = PHYS_TO_VM_PAGE(pvo->pvo_pte.pte.pte_lo & PTE_RPGN);
 		if (pg != NULL) {
 			moea_attr_save(pg, pvo->pvo_pte.pte.pte_lo &
 			    (PTE_REF | PTE_CHG));
 		}
 	}
 
 	/*
 	 * Remove this PVO from the PV list.
 	 */
 	LIST_REMOVE(pvo, pvo_vlink);
 
 	/*
 	 * Remove this from the overflow list and return it to the pool
 	 * if we aren't going to reuse it.
 	 */
 	LIST_REMOVE(pvo, pvo_olink);
 	if (!(pvo->pvo_vaddr & PVO_BOOTSTRAP))
 		uma_zfree(pvo->pvo_vaddr & PVO_MANAGED ? moea_mpvo_zone :
 		    moea_upvo_zone, pvo);
 	moea_pvo_entries--;
 	moea_pvo_remove_calls++;
 }
 
 static __inline int
 moea_pvo_pte_index(const struct pvo_entry *pvo, int ptegidx)
 {
 	int	pteidx;
 
 	/*
 	 * We can find the actual pte entry without searching by grabbing
 	 * the PTEG index from 3 unused bits in pte_lo[11:9] and by
 	 * noticing the HID bit.
 	 */
 	pteidx = ptegidx * 8 + PVO_PTEGIDX_GET(pvo);
 	if (pvo->pvo_pte.pte.pte_hi & PTE_HID)
 		pteidx ^= moea_pteg_mask * 8;
 
 	return (pteidx);
 }
 
 static struct pvo_entry *
 moea_pvo_find_va(pmap_t pm, vm_offset_t va, int *pteidx_p)
 {
 	struct	pvo_entry *pvo;
 	int	ptegidx;
 	u_int	sr;
 
 	va &= ~ADDR_POFF;
 	sr = va_to_sr(pm->pm_sr, va);
 	ptegidx = va_to_pteg(sr, va);
 
 	mtx_lock(&moea_table_mutex);
 	LIST_FOREACH(pvo, &moea_pvo_table[ptegidx], pvo_olink) {
 		if (pvo->pvo_pmap == pm && PVO_VADDR(pvo) == va) {
 			if (pteidx_p)
 				*pteidx_p = moea_pvo_pte_index(pvo, ptegidx);
 			break;
 		}
 	}
 	mtx_unlock(&moea_table_mutex);
 
 	return (pvo);
 }
 
 static struct pte *
 moea_pvo_to_pte(const struct pvo_entry *pvo, int pteidx)
 {
 	struct	pte *pt;
 
 	/*
 	 * If we haven't been supplied the ptegidx, calculate it.
 	 */
 	if (pteidx == -1) {
 		int	ptegidx;
 		u_int	sr;
 
 		sr = va_to_sr(pvo->pvo_pmap->pm_sr, pvo->pvo_vaddr);
 		ptegidx = va_to_pteg(sr, pvo->pvo_vaddr);
 		pteidx = moea_pvo_pte_index(pvo, ptegidx);
 	}
 
 	pt = &moea_pteg_table[pteidx >> 3].pt[pteidx & 7];
 	mtx_lock(&moea_table_mutex);
 
 	if ((pvo->pvo_pte.pte.pte_hi & PTE_VALID) && !PVO_PTEGIDX_ISSET(pvo)) {
 		panic("moea_pvo_to_pte: pvo %p has valid pte in pvo but no "
 		    "valid pte index", pvo);
 	}
 
 	if ((pvo->pvo_pte.pte.pte_hi & PTE_VALID) == 0 && PVO_PTEGIDX_ISSET(pvo)) {
 		panic("moea_pvo_to_pte: pvo %p has valid pte index in pvo "
 		    "pvo but no valid pte", pvo);
 	}
 
 	if ((pt->pte_hi ^ (pvo->pvo_pte.pte.pte_hi & ~PTE_VALID)) == PTE_VALID) {
 		if ((pvo->pvo_pte.pte.pte_hi & PTE_VALID) == 0) {
 			panic("moea_pvo_to_pte: pvo %p has valid pte in "
 			    "moea_pteg_table %p but invalid in pvo", pvo, pt);
 		}
 
 		if (((pt->pte_lo ^ pvo->pvo_pte.pte.pte_lo) & ~(PTE_CHG|PTE_REF))
 		    != 0) {
 			panic("moea_pvo_to_pte: pvo %p pte does not match "
 			    "pte %p in moea_pteg_table", pvo, pt);
 		}
 
 		mtx_assert(&moea_table_mutex, MA_OWNED);
 		return (pt);
 	}
 
 	if (pvo->pvo_pte.pte.pte_hi & PTE_VALID) {
 		panic("moea_pvo_to_pte: pvo %p has invalid pte %p in "
 		    "moea_pteg_table but valid in pvo", pvo, pt);
 	}
 
 	mtx_unlock(&moea_table_mutex);
 	return (NULL);
 }
 
 /*
  * XXX: THIS STUFF SHOULD BE IN pte.c?
  */
 int
 moea_pte_spill(vm_offset_t addr)
 {
 	struct	pvo_entry *source_pvo, *victim_pvo;
 	struct	pvo_entry *pvo;
 	int	ptegidx, i, j;
 	u_int	sr;
 	struct	pteg *pteg;
 	struct	pte *pt;
 
 	moea_pte_spills++;
 
 	sr = mfsrin(addr);
 	ptegidx = va_to_pteg(sr, addr);
 
 	/*
 	 * Have to substitute some entry.  Use the primary hash for this.
 	 * Use low bits of timebase as random generator.
 	 */
 	pteg = &moea_pteg_table[ptegidx];
 	mtx_lock(&moea_table_mutex);
 	__asm __volatile("mftb %0" : "=r"(i));
 	i &= 7;
 	pt = &pteg->pt[i];
 
 	source_pvo = NULL;
 	victim_pvo = NULL;
 	LIST_FOREACH(pvo, &moea_pvo_table[ptegidx], pvo_olink) {
 		/*
 		 * We need to find a pvo entry for this address.
 		 */
 		if (source_pvo == NULL &&
 		    moea_pte_match(&pvo->pvo_pte.pte, sr, addr,
 		    pvo->pvo_pte.pte.pte_hi & PTE_HID)) {
 			/*
 			 * Now found an entry to be spilled into the pteg.
 			 * The PTE is now valid, so we know it's active.
 			 */
 			j = moea_pte_insert(ptegidx, &pvo->pvo_pte.pte);
 
 			if (j >= 0) {
 				PVO_PTEGIDX_SET(pvo, j);
 				moea_pte_overflow--;
 				mtx_unlock(&moea_table_mutex);
 				return (1);
 			}
 
 			source_pvo = pvo;
 
 			if (victim_pvo != NULL)
 				break;
 		}
 
 		/*
 		 * We also need the pvo entry of the victim we are replacing
 		 * so save the R & C bits of the PTE.
 		 */
 		if ((pt->pte_hi & PTE_HID) == 0 && victim_pvo == NULL &&
 		    moea_pte_compare(pt, &pvo->pvo_pte.pte)) {
 			victim_pvo = pvo;
 			if (source_pvo != NULL)
 				break;
 		}
 	}
 
 	if (source_pvo == NULL) {
 		mtx_unlock(&moea_table_mutex);
 		return (0);
 	}
 
 	if (victim_pvo == NULL) {
 		if ((pt->pte_hi & PTE_HID) == 0)
 			panic("moea_pte_spill: victim p-pte (%p) has no pvo"
 			    "entry", pt);
 
 		/*
 		 * If this is a secondary PTE, we need to search it's primary
 		 * pvo bucket for the matching PVO.
 		 */
 		LIST_FOREACH(pvo, &moea_pvo_table[ptegidx ^ moea_pteg_mask],
 		    pvo_olink) {
 			/*
 			 * We also need the pvo entry of the victim we are
 			 * replacing so save the R & C bits of the PTE.
 			 */
 			if (moea_pte_compare(pt, &pvo->pvo_pte.pte)) {
 				victim_pvo = pvo;
 				break;
 			}
 		}
 
 		if (victim_pvo == NULL)
 			panic("moea_pte_spill: victim s-pte (%p) has no pvo"
 			    "entry", pt);
 	}
 
 	/*
 	 * We are invalidating the TLB entry for the EA we are replacing even
 	 * though it's valid.  If we don't, we lose any ref/chg bit changes
 	 * contained in the TLB entry.
 	 */
 	source_pvo->pvo_pte.pte.pte_hi &= ~PTE_HID;
 
 	moea_pte_unset(pt, &victim_pvo->pvo_pte.pte, victim_pvo->pvo_vaddr);
 	moea_pte_set(pt, &source_pvo->pvo_pte.pte);
 
 	PVO_PTEGIDX_CLR(victim_pvo);
 	PVO_PTEGIDX_SET(source_pvo, i);
 	moea_pte_replacements++;
 
 	mtx_unlock(&moea_table_mutex);
 	return (1);
 }
 
 static int
 moea_pte_insert(u_int ptegidx, struct pte *pvo_pt)
 {
 	struct	pte *pt;
 	int	i;
 
 	mtx_assert(&moea_table_mutex, MA_OWNED);
 
 	/*
 	 * First try primary hash.
 	 */
 	for (pt = moea_pteg_table[ptegidx].pt, i = 0; i < 8; i++, pt++) {
 		if ((pt->pte_hi & PTE_VALID) == 0) {
 			pvo_pt->pte_hi &= ~PTE_HID;
 			moea_pte_set(pt, pvo_pt);
 			return (i);
 		}
 	}
 
 	/*
 	 * Now try secondary hash.
 	 */
 	ptegidx ^= moea_pteg_mask;
 
 	for (pt = moea_pteg_table[ptegidx].pt, i = 0; i < 8; i++, pt++) {
 		if ((pt->pte_hi & PTE_VALID) == 0) {
 			pvo_pt->pte_hi |= PTE_HID;
 			moea_pte_set(pt, pvo_pt);
 			return (i);
 		}
 	}
 
 	panic("moea_pte_insert: overflow");
 	return (-1);
 }
 
 static boolean_t
 moea_query_bit(vm_page_t m, int ptebit)
 {
 	struct	pvo_entry *pvo;
 	struct	pte *pt;
 
 	if (moea_attr_fetch(m) & ptebit)
 		return (TRUE);
 
 	vm_page_lock_queues();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 
 		/*
 		 * See if we saved the bit off.  If so, cache it and return
 		 * success.
 		 */
 		if (pvo->pvo_pte.pte.pte_lo & ptebit) {
 			moea_attr_save(m, ptebit);
 			vm_page_unlock_queues();
 			return (TRUE);
 		}
 	}
 
 	/*
 	 * No luck, now go through the hard part of looking at the PTEs
 	 * themselves.  Sync so that any pending REF/CHG bits are flushed to
 	 * the PTEs.
 	 */
 	powerpc_sync();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 
 		/*
 		 * See if this pvo has a valid PTE.  if so, fetch the
 		 * REF/CHG bits from the valid PTE.  If the appropriate
 		 * ptebit is set, cache it and return success.
 		 */
 		pt = moea_pvo_to_pte(pvo, -1);
 		if (pt != NULL) {
 			moea_pte_synch(pt, &pvo->pvo_pte.pte);
 			mtx_unlock(&moea_table_mutex);
 			if (pvo->pvo_pte.pte.pte_lo & ptebit) {
 				moea_attr_save(m, ptebit);
 				vm_page_unlock_queues();
 				return (TRUE);
 			}
 		}
 	}
 
 	vm_page_unlock_queues();
 	return (FALSE);
 }
 
 static u_int
 moea_clear_bit(vm_page_t m, int ptebit)
 {
 	u_int	count;
 	struct	pvo_entry *pvo;
 	struct	pte *pt;
 
 	vm_page_lock_queues();
 
 	/*
 	 * Clear the cached value.
 	 */
 	moea_attr_clear(m, ptebit);
 
 	/*
 	 * Sync so that any pending REF/CHG bits are flushed to the PTEs (so
 	 * we can reset the right ones).  note that since the pvo entries and
 	 * list heads are accessed via BAT0 and are never placed in the page
 	 * table, we don't have to worry about further accesses setting the
 	 * REF/CHG bits.
 	 */
 	powerpc_sync();
 
 	/*
 	 * For each pvo entry, clear the pvo's ptebit.  If this pvo has a
 	 * valid pte clear the ptebit from the valid pte.
 	 */
 	count = 0;
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 		pt = moea_pvo_to_pte(pvo, -1);
 		if (pt != NULL) {
 			moea_pte_synch(pt, &pvo->pvo_pte.pte);
 			if (pvo->pvo_pte.pte.pte_lo & ptebit) {
 				count++;
 				moea_pte_clear(pt, PVO_VADDR(pvo), ptebit);
 			}
 			mtx_unlock(&moea_table_mutex);
 		}
 		pvo->pvo_pte.pte.pte_lo &= ~ptebit;
 	}
 
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  * Return true if the physical range is encompassed by the battable[idx]
  */
 static int
 moea_bat_mapped(int idx, vm_offset_t pa, vm_size_t size)
 {
 	u_int prot;
 	u_int32_t start;
 	u_int32_t end;
 	u_int32_t bat_ble;
 
 	/*
 	 * Return immediately if not a valid mapping
 	 */
 	if (!(battable[idx].batu & BAT_Vs))
 		return (EINVAL);
 
 	/*
 	 * The BAT entry must be cache-inhibited, guarded, and r/w
 	 * so it can function as an i/o page
 	 */
 	prot = battable[idx].batl & (BAT_I|BAT_G|BAT_PP_RW);
 	if (prot != (BAT_I|BAT_G|BAT_PP_RW))
 		return (EPERM);	
 
 	/*
 	 * The address should be within the BAT range. Assume that the
 	 * start address in the BAT has the correct alignment (thus
 	 * not requiring masking)
 	 */
 	start = battable[idx].batl & BAT_PBS;
 	bat_ble = (battable[idx].batu & ~(BAT_EBS)) | 0x03;
 	end = start | (bat_ble << 15) | 0x7fff;
 
 	if ((pa < start) || ((pa + size) > end))
 		return (ERANGE);
 
 	return (0);
 }
 
 boolean_t
 moea_dev_direct_mapped(mmu_t mmu, vm_offset_t pa, vm_size_t size)
 {
 	int i;
 
 	/*
 	 * This currently does not work for entries that 
 	 * overlap 256M BAT segments.
 	 */
 
 	for(i = 0; i < 16; i++)
 		if (moea_bat_mapped(i, pa, size) == 0)
 			return (0);
 
 	return (EFAULT);
 }
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  */
 void *
 moea_mapdev(mmu_t mmu, vm_offset_t pa, vm_size_t size)
 {
 
 	return (moea_mapdev_attr(mmu, pa, size, VM_MEMATTR_DEFAULT));
 }
 
 void *
 moea_mapdev_attr(mmu_t mmu, vm_offset_t pa, vm_size_t size, vm_memattr_t ma)
 {
 	vm_offset_t va, tmpva, ppa, offset;
 	int i;
 
 	ppa = trunc_page(pa);
 	offset = pa & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 	
 	/*
 	 * If the physical address lies within a valid BAT table entry,
 	 * return the 1:1 mapping. This currently doesn't work
 	 * for regions that overlap 256M BAT segments.
 	 */
 	for (i = 0; i < 16; i++) {
 		if (moea_bat_mapped(i, pa, size) == 0)
 			return ((void *) pa);
 	}
 
 	va = kmem_alloc_nofault(kernel_map, size);
 	if (!va)
 		panic("moea_mapdev: Couldn't alloc kernel virtual memory");
 
 	for (tmpva = va; size > 0;) {
 		moea_kenter_attr(mmu, tmpva, ppa, ma);
 		tlbie(tmpva);
 		size -= PAGE_SIZE;
 		tmpva += PAGE_SIZE;
 		ppa += PAGE_SIZE;
 	}
 
 	return ((void *)(va + offset));
 }
 
 void
 moea_unmapdev(mmu_t mmu, vm_offset_t va, vm_size_t size)
 {
 	vm_offset_t base, offset;
 
 	/*
 	 * If this is outside kernel virtual space, then it's a
 	 * battable entry and doesn't require unmapping
 	 */
 	if ((va >= VM_MIN_KERNEL_ADDRESS) && (va <= virtual_end)) {
 		base = trunc_page(va);
 		offset = va & PAGE_MASK;
 		size = roundup(offset + size, PAGE_SIZE);
 		kmem_free(kernel_map, base, size);
 	}
 }
 
 static void
 moea_sync_icache(mmu_t mmu, pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 	struct pvo_entry *pvo;
 	vm_offset_t lim;
 	vm_paddr_t pa;
 	vm_size_t len;
 
 	PMAP_LOCK(pm);
 	while (sz > 0) {
 		lim = round_page(va);
 		len = MIN(lim - va, sz);
 		pvo = moea_pvo_find_va(pm, va & ~ADDR_POFF, NULL);
 		if (pvo != NULL) {
 			pa = (pvo->pvo_pte.pte.pte_lo & PTE_RPGN) |
 			    (va & ADDR_POFF);
 			moea_syncicache(pa, len);
 		}
 		va += len;
 		sz -= len;
 	}
 	PMAP_UNLOCK(pm);
 }
Index: head/sys/powerpc/aim/mmu_oea64.c
===================================================================
--- head/sys/powerpc/aim/mmu_oea64.c	(revision 222812)
+++ head/sys/powerpc/aim/mmu_oea64.c	(revision 222813)
@@ -1,2574 +1,2581 @@
 /*-
  * Copyright (c) 2001 The NetBSD Foundation, Inc.
  * All rights reserved.
  *
  * This code is derived from software contributed to The NetBSD Foundation
  * by Matt Thomas <matt@3am-software.com> of Allegro Networks, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *        This product includes software developed by the NetBSD
  *        Foundation, Inc. and its contributors.
  * 4. Neither the name of The NetBSD Foundation nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
  * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
  * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
  * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
  * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  */
 /*-
  * Copyright (C) 1995, 1996 Wolfgang Solfrank.
  * Copyright (C) 1995, 1996 TooLs GmbH.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by TooLs GmbH.
  * 4. The name of TooLs GmbH may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $NetBSD: pmap.c,v 1.28 2000/03/26 20:42:36 kleink Exp $
  */
 /*-
  * Copyright (C) 2001 Benno Rice.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  * Manages physical address maps.
  *
  * In addition to hardware address maps, this module is called upon to
  * provide software-use-only maps which may or may not be stored in the
  * same form as hardware maps.  These pseudo-maps are used to store
  * intermediate results from copy operations to and from address spaces.
  *
  * Since the information managed by this module is also stored by the
  * logical address mapping module, this module may throw away valid virtual
  * to physical mappings at almost any time.  However, invalidations of
  * mappings must be done as requested.
  *
  * In order to cope with hardware architectures which make virtual to
  * physical map invalidates expensive, this module may delay invalidate
  * reduced protection operations until such time as they are actually
  * necessary.  This module is given full information as to which processors
  * are currently using which maps, and to when physical maps must be made
  * correct.
  */
 
 #include "opt_kstack_pages.h"
 
 #include <sys/param.h>
 #include <sys/kernel.h>
+#include <sys/queue.h>
+#include <sys/cpuset.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/msgbuf.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
+#include <sys/sched.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/vmmeter.h>
 
 #include <sys/kdb.h>
 
 #include <dev/ofw/openfirm.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 #include <vm/uma.h>
 
 #include <machine/_inttypes.h>
 #include <machine/cpu.h>
 #include <machine/platform.h>
 #include <machine/frame.h>
 #include <machine/md_var.h>
 #include <machine/psl.h>
 #include <machine/bat.h>
 #include <machine/hid.h>
 #include <machine/pte.h>
 #include <machine/sr.h>
 #include <machine/trap.h>
 #include <machine/mmuvar.h>
 
 #include "mmu_oea64.h"
 #include "mmu_if.h"
 #include "moea64_if.h"
 
 void moea64_release_vsid(uint64_t vsid);
 uintptr_t moea64_get_unique_vsid(void); 
 
 #define DISABLE_TRANS(msr)	msr = mfmsr(); mtmsr(msr & ~PSL_DR)
 #define ENABLE_TRANS(msr)	mtmsr(msr)
 
 #define	VSID_MAKE(sr, hash)	((sr) | (((hash) & 0xfffff) << 4))
 #define	VSID_TO_HASH(vsid)	(((vsid) >> 4) & 0xfffff)
 #define	VSID_HASH_MASK		0x0000007fffffffffULL
 
 #define LOCK_TABLE() mtx_lock(&moea64_table_mutex)
 #define UNLOCK_TABLE() mtx_unlock(&moea64_table_mutex);
 #define ASSERT_TABLE_LOCK() mtx_assert(&moea64_table_mutex, MA_OWNED)
 
 struct ofw_map {
 	cell_t	om_va;
 	cell_t	om_len;
 	cell_t	om_pa_hi;
 	cell_t	om_pa_lo;
 	cell_t	om_mode;
 };
 
 /*
  * Map of physical memory regions.
  */
 static struct	mem_region *regions;
 static struct	mem_region *pregions;
 static u_int	phys_avail_count;
 static int	regions_sz, pregions_sz;
 
 extern void bs_remap_earlyboot(void);
 
 /*
  * Lock for the pteg and pvo tables.
  */
 struct mtx	moea64_table_mutex;
 struct mtx	moea64_slb_mutex;
 
 /*
  * PTEG data.
  */
 u_int		moea64_pteg_count;
 u_int		moea64_pteg_mask;
 
 /*
  * PVO data.
  */
 struct	pvo_head *moea64_pvo_table;		/* pvo entries by pteg index */
 struct	pvo_head moea64_pvo_kunmanaged =	/* list of unmanaged pages */
     LIST_HEAD_INITIALIZER(moea64_pvo_kunmanaged);
 
 uma_zone_t	moea64_upvo_zone; /* zone for pvo entries for unmanaged pages */
 uma_zone_t	moea64_mpvo_zone; /* zone for pvo entries for managed pages */
 
 #define	BPVO_POOL_SIZE	327680
 static struct	pvo_entry *moea64_bpvo_pool;
 static int	moea64_bpvo_pool_index = 0;
 
 #define	VSID_NBPW	(sizeof(u_int32_t) * 8)
 #ifdef __powerpc64__
 #define	NVSIDS		(NPMAPS * 16)
 #define VSID_HASHMASK	0xffffffffUL
 #else
 #define NVSIDS		NPMAPS
 #define VSID_HASHMASK	0xfffffUL
 #endif
 static u_int	moea64_vsid_bitmap[NVSIDS / VSID_NBPW];
 
 static boolean_t moea64_initialized = FALSE;
 
 /*
  * Statistics.
  */
 u_int	moea64_pte_valid = 0;
 u_int	moea64_pte_overflow = 0;
 u_int	moea64_pvo_entries = 0;
 u_int	moea64_pvo_enter_calls = 0;
 u_int	moea64_pvo_remove_calls = 0;
 SYSCTL_INT(_machdep, OID_AUTO, moea64_pte_valid, CTLFLAG_RD, 
     &moea64_pte_valid, 0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea64_pte_overflow, CTLFLAG_RD,
     &moea64_pte_overflow, 0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea64_pvo_entries, CTLFLAG_RD, 
     &moea64_pvo_entries, 0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea64_pvo_enter_calls, CTLFLAG_RD,
     &moea64_pvo_enter_calls, 0, "");
 SYSCTL_INT(_machdep, OID_AUTO, moea64_pvo_remove_calls, CTLFLAG_RD,
     &moea64_pvo_remove_calls, 0, "");
 
 vm_offset_t	moea64_scratchpage_va[2];
 struct pvo_entry *moea64_scratchpage_pvo[2];
 uintptr_t	moea64_scratchpage_pte[2];
 struct	mtx	moea64_scratchpage_mtx;
 
 uint64_t 	moea64_large_page_mask = 0;
 int		moea64_large_page_size = 0;
 int		moea64_large_page_shift = 0;
 
 /*
  * PVO calls.
  */
 static int	moea64_pvo_enter(mmu_t, pmap_t, uma_zone_t, struct pvo_head *,
 		    vm_offset_t, vm_offset_t, uint64_t, int);
 static void	moea64_pvo_remove(mmu_t, struct pvo_entry *);
 static struct	pvo_entry *moea64_pvo_find_va(pmap_t, vm_offset_t);
 
 /*
  * Utility routines.
  */
 static void		moea64_enter_locked(mmu_t, pmap_t, vm_offset_t,
 			    vm_page_t, vm_prot_t, boolean_t);
 static boolean_t	moea64_query_bit(mmu_t, vm_page_t, u_int64_t);
 static u_int		moea64_clear_bit(mmu_t, vm_page_t, u_int64_t);
 static void		moea64_kremove(mmu_t, vm_offset_t);
 static void		moea64_syncicache(mmu_t, pmap_t pmap, vm_offset_t va, 
 			    vm_offset_t pa, vm_size_t sz);
 
 /*
  * Kernel MMU interface
  */
 void moea64_change_wiring(mmu_t, pmap_t, vm_offset_t, boolean_t);
 void moea64_clear_modify(mmu_t, vm_page_t);
 void moea64_clear_reference(mmu_t, vm_page_t);
 void moea64_copy_page(mmu_t, vm_page_t, vm_page_t);
 void moea64_enter(mmu_t, pmap_t, vm_offset_t, vm_page_t, vm_prot_t, boolean_t);
 void moea64_enter_object(mmu_t, pmap_t, vm_offset_t, vm_offset_t, vm_page_t,
     vm_prot_t);
 void moea64_enter_quick(mmu_t, pmap_t, vm_offset_t, vm_page_t, vm_prot_t);
 vm_paddr_t moea64_extract(mmu_t, pmap_t, vm_offset_t);
 vm_page_t moea64_extract_and_hold(mmu_t, pmap_t, vm_offset_t, vm_prot_t);
 void moea64_init(mmu_t);
 boolean_t moea64_is_modified(mmu_t, vm_page_t);
 boolean_t moea64_is_prefaultable(mmu_t, pmap_t, vm_offset_t);
 boolean_t moea64_is_referenced(mmu_t, vm_page_t);
 boolean_t moea64_ts_referenced(mmu_t, vm_page_t);
 vm_offset_t moea64_map(mmu_t, vm_offset_t *, vm_offset_t, vm_offset_t, int);
 boolean_t moea64_page_exists_quick(mmu_t, pmap_t, vm_page_t);
 int moea64_page_wired_mappings(mmu_t, vm_page_t);
 void moea64_pinit(mmu_t, pmap_t);
 void moea64_pinit0(mmu_t, pmap_t);
 void moea64_protect(mmu_t, pmap_t, vm_offset_t, vm_offset_t, vm_prot_t);
 void moea64_qenter(mmu_t, vm_offset_t, vm_page_t *, int);
 void moea64_qremove(mmu_t, vm_offset_t, int);
 void moea64_release(mmu_t, pmap_t);
 void moea64_remove(mmu_t, pmap_t, vm_offset_t, vm_offset_t);
 void moea64_remove_all(mmu_t, vm_page_t);
 void moea64_remove_write(mmu_t, vm_page_t);
 void moea64_zero_page(mmu_t, vm_page_t);
 void moea64_zero_page_area(mmu_t, vm_page_t, int, int);
 void moea64_zero_page_idle(mmu_t, vm_page_t);
 void moea64_activate(mmu_t, struct thread *);
 void moea64_deactivate(mmu_t, struct thread *);
 void *moea64_mapdev(mmu_t, vm_offset_t, vm_size_t);
 void *moea64_mapdev_attr(mmu_t, vm_offset_t, vm_size_t, vm_memattr_t);
 void moea64_unmapdev(mmu_t, vm_offset_t, vm_size_t);
 vm_offset_t moea64_kextract(mmu_t, vm_offset_t);
 void moea64_page_set_memattr(mmu_t, vm_page_t m, vm_memattr_t ma);
 void moea64_kenter_attr(mmu_t, vm_offset_t, vm_offset_t, vm_memattr_t ma);
 void moea64_kenter(mmu_t, vm_offset_t, vm_offset_t);
 boolean_t moea64_dev_direct_mapped(mmu_t, vm_offset_t, vm_size_t);
 static void moea64_sync_icache(mmu_t, pmap_t, vm_offset_t, vm_size_t);
 
 static mmu_method_t moea64_methods[] = {
 	MMUMETHOD(mmu_change_wiring,	moea64_change_wiring),
 	MMUMETHOD(mmu_clear_modify,	moea64_clear_modify),
 	MMUMETHOD(mmu_clear_reference,	moea64_clear_reference),
 	MMUMETHOD(mmu_copy_page,	moea64_copy_page),
 	MMUMETHOD(mmu_enter,		moea64_enter),
 	MMUMETHOD(mmu_enter_object,	moea64_enter_object),
 	MMUMETHOD(mmu_enter_quick,	moea64_enter_quick),
 	MMUMETHOD(mmu_extract,		moea64_extract),
 	MMUMETHOD(mmu_extract_and_hold,	moea64_extract_and_hold),
 	MMUMETHOD(mmu_init,		moea64_init),
 	MMUMETHOD(mmu_is_modified,	moea64_is_modified),
 	MMUMETHOD(mmu_is_prefaultable,	moea64_is_prefaultable),
 	MMUMETHOD(mmu_is_referenced,	moea64_is_referenced),
 	MMUMETHOD(mmu_ts_referenced,	moea64_ts_referenced),
 	MMUMETHOD(mmu_map,     		moea64_map),
 	MMUMETHOD(mmu_page_exists_quick,moea64_page_exists_quick),
 	MMUMETHOD(mmu_page_wired_mappings,moea64_page_wired_mappings),
 	MMUMETHOD(mmu_pinit,		moea64_pinit),
 	MMUMETHOD(mmu_pinit0,		moea64_pinit0),
 	MMUMETHOD(mmu_protect,		moea64_protect),
 	MMUMETHOD(mmu_qenter,		moea64_qenter),
 	MMUMETHOD(mmu_qremove,		moea64_qremove),
 	MMUMETHOD(mmu_release,		moea64_release),
 	MMUMETHOD(mmu_remove,		moea64_remove),
 	MMUMETHOD(mmu_remove_all,      	moea64_remove_all),
 	MMUMETHOD(mmu_remove_write,	moea64_remove_write),
 	MMUMETHOD(mmu_sync_icache,	moea64_sync_icache),
 	MMUMETHOD(mmu_zero_page,       	moea64_zero_page),
 	MMUMETHOD(mmu_zero_page_area,	moea64_zero_page_area),
 	MMUMETHOD(mmu_zero_page_idle,	moea64_zero_page_idle),
 	MMUMETHOD(mmu_activate,		moea64_activate),
 	MMUMETHOD(mmu_deactivate,      	moea64_deactivate),
 	MMUMETHOD(mmu_page_set_memattr,	moea64_page_set_memattr),
 
 	/* Internal interfaces */
 	MMUMETHOD(mmu_mapdev,		moea64_mapdev),
 	MMUMETHOD(mmu_mapdev_attr,	moea64_mapdev_attr),
 	MMUMETHOD(mmu_unmapdev,		moea64_unmapdev),
 	MMUMETHOD(mmu_kextract,		moea64_kextract),
 	MMUMETHOD(mmu_kenter,		moea64_kenter),
 	MMUMETHOD(mmu_kenter_attr,	moea64_kenter_attr),
 	MMUMETHOD(mmu_dev_direct_mapped,moea64_dev_direct_mapped),
 
 	{ 0, 0 }
 };
 
 MMU_DEF(oea64_mmu, "mmu_oea64_base", moea64_methods, 0);
 
 static __inline u_int
 va_to_pteg(uint64_t vsid, vm_offset_t addr, int large)
 {
 	uint64_t hash;
 	int shift;
 
 	shift = large ? moea64_large_page_shift : ADDR_PIDX_SHFT;
 	hash = (vsid & VSID_HASH_MASK) ^ (((uint64_t)addr & ADDR_PIDX) >>
 	    shift);
 	return (hash & moea64_pteg_mask);
 }
 
 static __inline struct pvo_head *
 vm_page_to_pvoh(vm_page_t m)
 {
 
 	return (&m->md.mdpg_pvoh);
 }
 
 static __inline void
 moea64_attr_clear(vm_page_t m, u_int64_t ptebit)
 {
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	m->md.mdpg_attrs &= ~ptebit;
 }
 
 static __inline u_int64_t
 moea64_attr_fetch(vm_page_t m)
 {
 
 	return (m->md.mdpg_attrs);
 }
 
 static __inline void
 moea64_attr_save(vm_page_t m, u_int64_t ptebit)
 {
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	m->md.mdpg_attrs |= ptebit;
 }
 
 static __inline void
 moea64_pte_create(struct lpte *pt, uint64_t vsid, vm_offset_t va, 
     uint64_t pte_lo, int flags)
 {
 
 	ASSERT_TABLE_LOCK();
 
 	/*
 	 * Construct a PTE.  Default to IMB initially.  Valid bit only gets
 	 * set when the real pte is set in memory.
 	 *
 	 * Note: Don't set the valid bit for correct operation of tlb update.
 	 */
 	pt->pte_hi = (vsid << LPTE_VSID_SHIFT) |
 	    (((uint64_t)(va & ADDR_PIDX) >> ADDR_API_SHFT64) & LPTE_API);
 
 	if (flags & PVO_LARGE)
 		pt->pte_hi |= LPTE_BIG;
 
 	pt->pte_lo = pte_lo;
 }
 
 static __inline uint64_t
 moea64_calc_wimg(vm_offset_t pa, vm_memattr_t ma)
 {
 	uint64_t pte_lo;
 	int i;
 
 	if (ma != VM_MEMATTR_DEFAULT) {
 		switch (ma) {
 		case VM_MEMATTR_UNCACHEABLE:
 			return (LPTE_I | LPTE_G);
 		case VM_MEMATTR_WRITE_COMBINING:
 		case VM_MEMATTR_WRITE_BACK:
 		case VM_MEMATTR_PREFETCHABLE:
 			return (LPTE_I);
 		case VM_MEMATTR_WRITE_THROUGH:
 			return (LPTE_W | LPTE_M);
 		}
 	}
 
 	/*
 	 * Assume the page is cache inhibited and access is guarded unless
 	 * it's in our available memory array.
 	 */
 	pte_lo = LPTE_I | LPTE_G;
 	for (i = 0; i < pregions_sz; i++) {
 		if ((pa >= pregions[i].mr_start) &&
 		    (pa < (pregions[i].mr_start + pregions[i].mr_size))) {
 			pte_lo &= ~(LPTE_I | LPTE_G);
 			pte_lo |= LPTE_M;
 			break;
 		}
 	}
 
 	return pte_lo;
 }
 
 /*
  * Quick sort callout for comparing memory regions.
  */
 static int	om_cmp(const void *a, const void *b);
 
 static int
 om_cmp(const void *a, const void *b)
 {
 	const struct	ofw_map *mapa;
 	const struct	ofw_map *mapb;
 
 	mapa = a;
 	mapb = b;
 	if (mapa->om_pa_hi < mapb->om_pa_hi)
 		return (-1);
 	else if (mapa->om_pa_hi > mapb->om_pa_hi)
 		return (1);
 	else if (mapa->om_pa_lo < mapb->om_pa_lo)
 		return (-1);
 	else if (mapa->om_pa_lo > mapb->om_pa_lo)
 		return (1);
 	else
 		return (0);
 }
 
 static void
 moea64_add_ofw_mappings(mmu_t mmup, phandle_t mmu, size_t sz)
 {
 	struct ofw_map	translations[sz/sizeof(struct ofw_map)];
 	register_t	msr;
 	vm_offset_t	off;
 	vm_paddr_t	pa_base;
 	int		i;
 
 	bzero(translations, sz);
 	if (OF_getprop(mmu, "translations", translations, sz) == -1)
 		panic("moea64_bootstrap: can't get ofw translations");
 
 	CTR0(KTR_PMAP, "moea64_add_ofw_mappings: translations");
 	sz /= sizeof(*translations);
 	qsort(translations, sz, sizeof (*translations), om_cmp);
 
 	for (i = 0; i < sz; i++) {
 		CTR3(KTR_PMAP, "translation: pa=%#x va=%#x len=%#x",
 		    (uint32_t)(translations[i].om_pa_lo), translations[i].om_va,
 		    translations[i].om_len);
 
 		if (translations[i].om_pa_lo % PAGE_SIZE)
 			panic("OFW translation not page-aligned!");
 
 		pa_base = translations[i].om_pa_lo;
 
 	      #ifdef __powerpc64__
 		pa_base += (vm_offset_t)translations[i].om_pa_hi << 32;
 	      #else
 		if (translations[i].om_pa_hi)
 			panic("OFW translations above 32-bit boundary!");
 	      #endif
 
 		/* Now enter the pages for this mapping */
 
 		DISABLE_TRANS(msr);
 		for (off = 0; off < translations[i].om_len; off += PAGE_SIZE) {
 			if (moea64_pvo_find_va(kernel_pmap,
 			    translations[i].om_va + off) != NULL)
 				continue;
 
 			moea64_kenter(mmup, translations[i].om_va + off,
 			    pa_base + off);
 		}
 		ENABLE_TRANS(msr);
 	}
 }
 
 #ifdef __powerpc64__
 static void
 moea64_probe_large_page(void)
 {
 	uint16_t pvr = mfpvr() >> 16;
 
 	switch (pvr) {
 	case IBM970:
 	case IBM970FX:
 	case IBM970MP:
 		powerpc_sync(); isync();
 		mtspr(SPR_HID4, mfspr(SPR_HID4) & ~HID4_970_DISABLE_LG_PG);
 		powerpc_sync(); isync();
 		
 		/* FALLTHROUGH */
 	case IBMCELLBE:
 		moea64_large_page_size = 0x1000000; /* 16 MB */
 		moea64_large_page_shift = 24;
 		break;
 	default:
 		moea64_large_page_size = 0;
 	}
 
 	moea64_large_page_mask = moea64_large_page_size - 1;
 }
 
 static void
 moea64_bootstrap_slb_prefault(vm_offset_t va, int large)
 {
 	struct slb *cache;
 	struct slb entry;
 	uint64_t esid, slbe;
 	uint64_t i;
 
 	cache = PCPU_GET(slb);
 	esid = va >> ADDR_SR_SHFT;
 	slbe = (esid << SLBE_ESID_SHIFT) | SLBE_VALID;
 
 	for (i = 0; i < 64; i++) {
 		if (cache[i].slbe == (slbe | i))
 			return;
 	}
 
 	entry.slbe = slbe;
 	entry.slbv = KERNEL_VSID(esid) << SLBV_VSID_SHIFT;
 	if (large)
 		entry.slbv |= SLBV_L;
 
 	slb_insert_kernel(entry.slbe, entry.slbv);
 }
 #endif
 
 static void
 moea64_setup_direct_map(mmu_t mmup, vm_offset_t kernelstart,
     vm_offset_t kernelend)
 {
 	register_t msr;
 	vm_paddr_t pa;
 	vm_offset_t size, off;
 	uint64_t pte_lo;
 	int i;
 
 	if (moea64_large_page_size == 0) 
 		hw_direct_map = 0;
 
 	DISABLE_TRANS(msr);
 	if (hw_direct_map) {
 		PMAP_LOCK(kernel_pmap);
 		for (i = 0; i < pregions_sz; i++) {
 		  for (pa = pregions[i].mr_start; pa < pregions[i].mr_start +
 		     pregions[i].mr_size; pa += moea64_large_page_size) {
 			pte_lo = LPTE_M;
 
 			/*
 			 * Set memory access as guarded if prefetch within
 			 * the page could exit the available physmem area.
 			 */
 			if (pa & moea64_large_page_mask) {
 				pa &= moea64_large_page_mask;
 				pte_lo |= LPTE_G;
 			}
 			if (pa + moea64_large_page_size >
 			    pregions[i].mr_start + pregions[i].mr_size)
 				pte_lo |= LPTE_G;
 
 			moea64_pvo_enter(mmup, kernel_pmap, moea64_upvo_zone,
 				    &moea64_pvo_kunmanaged, pa, pa,
 				    pte_lo, PVO_WIRED | PVO_LARGE);
 		  }
 		}
 		PMAP_UNLOCK(kernel_pmap);
 	} else {
 		size = sizeof(struct pvo_head) * moea64_pteg_count;
 		off = (vm_offset_t)(moea64_pvo_table);
 		for (pa = off; pa < off + size; pa += PAGE_SIZE) 
 			moea64_kenter(mmup, pa, pa);
 		size = BPVO_POOL_SIZE*sizeof(struct pvo_entry);
 		off = (vm_offset_t)(moea64_bpvo_pool);
 		for (pa = off; pa < off + size; pa += PAGE_SIZE) 
 		moea64_kenter(mmup, pa, pa);
 
 		/*
 		 * Map certain important things, like ourselves.
 		 *
 		 * NOTE: We do not map the exception vector space. That code is
 		 * used only in real mode, and leaving it unmapped allows us to
 		 * catch NULL pointer deferences, instead of making NULL a valid
 		 * address.
 		 */
 
 		for (pa = kernelstart & ~PAGE_MASK; pa < kernelend;
 		    pa += PAGE_SIZE) 
 			moea64_kenter(mmup, pa, pa);
 	}
 	ENABLE_TRANS(msr);
 }
 
 void
 moea64_early_bootstrap(mmu_t mmup, vm_offset_t kernelstart, vm_offset_t kernelend)
 {
 	int		i, j;
 	vm_size_t	physsz, hwphyssz;
 
 #ifndef __powerpc64__
 	/* We don't have a direct map since there is no BAT */
 	hw_direct_map = 0;
 
 	/* Make sure battable is zero, since we have no BAT */
 	for (i = 0; i < 16; i++) {
 		battable[i].batu = 0;
 		battable[i].batl = 0;
 	}
 #else
 	moea64_probe_large_page();
 
 	/* Use a direct map if we have large page support */
 	if (moea64_large_page_size > 0)
 		hw_direct_map = 1;
 	else
 		hw_direct_map = 0;
 #endif
 
 	/* Get physical memory regions from firmware */
 	mem_regions(&pregions, &pregions_sz, &regions, &regions_sz);
 	CTR0(KTR_PMAP, "moea64_bootstrap: physical memory");
 
 	if (sizeof(phys_avail)/sizeof(phys_avail[0]) < regions_sz)
 		panic("moea64_bootstrap: phys_avail too small");
 
 	phys_avail_count = 0;
 	physsz = 0;
 	hwphyssz = 0;
 	TUNABLE_ULONG_FETCH("hw.physmem", (u_long *) &hwphyssz);
 	for (i = 0, j = 0; i < regions_sz; i++, j += 2) {
 		CTR3(KTR_PMAP, "region: %#x - %#x (%#x)", regions[i].mr_start,
 		    regions[i].mr_start + regions[i].mr_size,
 		    regions[i].mr_size);
 		if (hwphyssz != 0 &&
 		    (physsz + regions[i].mr_size) >= hwphyssz) {
 			if (physsz < hwphyssz) {
 				phys_avail[j] = regions[i].mr_start;
 				phys_avail[j + 1] = regions[i].mr_start +
 				    hwphyssz - physsz;
 				physsz = hwphyssz;
 				phys_avail_count++;
 			}
 			break;
 		}
 		phys_avail[j] = regions[i].mr_start;
 		phys_avail[j + 1] = regions[i].mr_start + regions[i].mr_size;
 		phys_avail_count++;
 		physsz += regions[i].mr_size;
 	}
 
 	/* Check for overlap with the kernel and exception vectors */
 	for (j = 0; j < 2*phys_avail_count; j+=2) {
 		if (phys_avail[j] < EXC_LAST)
 			phys_avail[j] += EXC_LAST;
 
 		if (kernelstart >= phys_avail[j] &&
 		    kernelstart < phys_avail[j+1]) {
 			if (kernelend < phys_avail[j+1]) {
 				phys_avail[2*phys_avail_count] =
 				    (kernelend & ~PAGE_MASK) + PAGE_SIZE;
 				phys_avail[2*phys_avail_count + 1] =
 				    phys_avail[j+1];
 				phys_avail_count++;
 			}
 
 			phys_avail[j+1] = kernelstart & ~PAGE_MASK;
 		}
 
 		if (kernelend >= phys_avail[j] &&
 		    kernelend < phys_avail[j+1]) {
 			if (kernelstart > phys_avail[j]) {
 				phys_avail[2*phys_avail_count] = phys_avail[j];
 				phys_avail[2*phys_avail_count + 1] =
 				    kernelstart & ~PAGE_MASK;
 				phys_avail_count++;
 			}
 
 			phys_avail[j] = (kernelend & ~PAGE_MASK) + PAGE_SIZE;
 		}
 	}
 
 	physmem = btoc(physsz);
 
 #ifdef PTEGCOUNT
 	moea64_pteg_count = PTEGCOUNT;
 #else
 	moea64_pteg_count = 0x1000;
 
 	while (moea64_pteg_count < physmem)
 		moea64_pteg_count <<= 1;
 
 	moea64_pteg_count >>= 1;
 #endif /* PTEGCOUNT */
 }
 
 void
 moea64_mid_bootstrap(mmu_t mmup, vm_offset_t kernelstart, vm_offset_t kernelend)
 {
 	vm_size_t	size;
 	register_t	msr;
 	int		i;
 
 	/*
 	 * Set PTEG mask
 	 */
 	moea64_pteg_mask = moea64_pteg_count - 1;
 
 	/*
 	 * Allocate pv/overflow lists.
 	 */
 	size = sizeof(struct pvo_head) * moea64_pteg_count;
 
 	moea64_pvo_table = (struct pvo_head *)moea64_bootstrap_alloc(size,
 	    PAGE_SIZE);
 	CTR1(KTR_PMAP, "moea64_bootstrap: PVO table at %p", moea64_pvo_table);
 
 	DISABLE_TRANS(msr);
 	for (i = 0; i < moea64_pteg_count; i++)
 		LIST_INIT(&moea64_pvo_table[i]);
 	ENABLE_TRANS(msr);
 
 	/*
 	 * Initialize the lock that synchronizes access to the pteg and pvo
 	 * tables.
 	 */
 	mtx_init(&moea64_table_mutex, "pmap table", NULL, MTX_DEF |
 	    MTX_RECURSE);
 	mtx_init(&moea64_slb_mutex, "SLB table", NULL, MTX_DEF);
 
 	/*
 	 * Initialise the unmanaged pvo pool.
 	 */
 	moea64_bpvo_pool = (struct pvo_entry *)moea64_bootstrap_alloc(
 		BPVO_POOL_SIZE*sizeof(struct pvo_entry), 0);
 	moea64_bpvo_pool_index = 0;
 
 	/*
 	 * Make sure kernel vsid is allocated as well as VSID 0.
 	 */
 	#ifndef __powerpc64__
 	moea64_vsid_bitmap[(KERNEL_VSIDBITS & (NVSIDS - 1)) / VSID_NBPW]
 		|= 1 << (KERNEL_VSIDBITS % VSID_NBPW);
 	moea64_vsid_bitmap[0] |= 1;
 	#endif
 
 	/*
 	 * Initialize the kernel pmap (which is statically allocated).
 	 */
 	#ifdef __powerpc64__
 	for (i = 0; i < 64; i++) {
 		pcpup->pc_slb[i].slbv = 0;
 		pcpup->pc_slb[i].slbe = 0;
 	}
 	#else
 	for (i = 0; i < 16; i++) 
 		kernel_pmap->pm_sr[i] = EMPTY_SEGMENT + i;
 	#endif
 
 	kernel_pmap->pmap_phys = kernel_pmap;
-	kernel_pmap->pm_active = ~0;
+	CPU_FILL(&kernel_pmap->pm_active);
 
 	PMAP_LOCK_INIT(kernel_pmap);
 
 	/*
 	 * Now map in all the other buffers we allocated earlier
 	 */
 
 	moea64_setup_direct_map(mmup, kernelstart, kernelend);
 }
 
 void
 moea64_late_bootstrap(mmu_t mmup, vm_offset_t kernelstart, vm_offset_t kernelend)
 {
 	ihandle_t	mmui;
 	phandle_t	chosen;
 	phandle_t	mmu;
 	size_t		sz;
 	int		i;
 	vm_offset_t	pa, va;
 	void		*dpcpu;
 
 	/*
 	 * Set up the Open Firmware pmap and add its mappings if not in real
 	 * mode.
 	 */
 
 	chosen = OF_finddevice("/chosen");
 	if (chosen != -1 && OF_getprop(chosen, "mmu", &mmui, 4) != -1) {
 	    mmu = OF_instance_to_package(mmui);
 	    if (mmu == -1 || (sz = OF_getproplen(mmu, "translations")) == -1)
 		sz = 0;
 	    if (sz > 6144 /* tmpstksz - 2 KB headroom */)
 		panic("moea64_bootstrap: too many ofw translations");
 
 	    if (sz > 0)
 		moea64_add_ofw_mappings(mmup, mmu, sz);
 	}
 
 	/*
 	 * Calculate the last available physical address.
 	 */
 	for (i = 0; phys_avail[i + 2] != 0; i += 2)
 		;
 	Maxmem = powerpc_btop(phys_avail[i + 1]);
 
 	/*
 	 * Initialize MMU and remap early physical mappings
 	 */
 	MMU_CPU_BOOTSTRAP(mmup,0);
 	mtmsr(mfmsr() | PSL_DR | PSL_IR);
 	pmap_bootstrapped++;
 	bs_remap_earlyboot();
 
 	/*
 	 * Set the start and end of kva.
 	 */
 	virtual_avail = VM_MIN_KERNEL_ADDRESS;
 	virtual_end = VM_MAX_SAFE_KERNEL_ADDRESS; 
 
 	/*
 	 * Map the entire KVA range into the SLB. We must not fault there.
 	 */
 	#ifdef __powerpc64__
 	for (va = virtual_avail; va < virtual_end; va += SEGMENT_LENGTH)
 		moea64_bootstrap_slb_prefault(va, 0);
 	#endif
 
 	/*
 	 * Figure out how far we can extend virtual_end into segment 16
 	 * without running into existing mappings. Segment 16 is guaranteed
 	 * to contain neither RAM nor devices (at least on Apple hardware),
 	 * but will generally contain some OFW mappings we should not
 	 * step on.
 	 */
 
 	#ifndef __powerpc64__	/* KVA is in high memory on PPC64 */
 	PMAP_LOCK(kernel_pmap);
 	while (virtual_end < VM_MAX_KERNEL_ADDRESS &&
 	    moea64_pvo_find_va(kernel_pmap, virtual_end+1) == NULL)
 		virtual_end += PAGE_SIZE;
 	PMAP_UNLOCK(kernel_pmap);
 	#endif
 
 	/*
 	 * Allocate a kernel stack with a guard page for thread0 and map it
 	 * into the kernel page map.
 	 */
 	pa = moea64_bootstrap_alloc(KSTACK_PAGES * PAGE_SIZE, PAGE_SIZE);
 	va = virtual_avail + KSTACK_GUARD_PAGES * PAGE_SIZE;
 	virtual_avail = va + KSTACK_PAGES * PAGE_SIZE;
 	CTR2(KTR_PMAP, "moea64_bootstrap: kstack0 at %#x (%#x)", pa, va);
 	thread0.td_kstack = va;
 	thread0.td_kstack_pages = KSTACK_PAGES;
 	for (i = 0; i < KSTACK_PAGES; i++) {
 		moea64_kenter(mmup, va, pa);
 		pa += PAGE_SIZE;
 		va += PAGE_SIZE;
 	}
 
 	/*
 	 * Allocate virtual address space for the message buffer.
 	 */
 	pa = msgbuf_phys = moea64_bootstrap_alloc(msgbufsize, PAGE_SIZE);
 	msgbufp = (struct msgbuf *)virtual_avail;
 	va = virtual_avail;
 	virtual_avail += round_page(msgbufsize);
 	while (va < virtual_avail) {
 		moea64_kenter(mmup, va, pa);
 		pa += PAGE_SIZE;
 		va += PAGE_SIZE;
 	}
 
 	/*
 	 * Allocate virtual address space for the dynamic percpu area.
 	 */
 	pa = moea64_bootstrap_alloc(DPCPU_SIZE, PAGE_SIZE);
 	dpcpu = (void *)virtual_avail;
 	va = virtual_avail;
 	virtual_avail += DPCPU_SIZE;
 	while (va < virtual_avail) {
 		moea64_kenter(mmup, va, pa);
 		pa += PAGE_SIZE;
 		va += PAGE_SIZE;
 	}
 	dpcpu_init(dpcpu, 0);
 
 	/*
 	 * Allocate some things for page zeroing. We put this directly
 	 * in the page table, marked with LPTE_LOCKED, to avoid any
 	 * of the PVO book-keeping or other parts of the VM system
 	 * from even knowing that this hack exists.
 	 */
 
 	if (!hw_direct_map) {
 		mtx_init(&moea64_scratchpage_mtx, "pvo zero page", NULL,
 		    MTX_DEF);
 		for (i = 0; i < 2; i++) {
 			moea64_scratchpage_va[i] = (virtual_end+1) - PAGE_SIZE;
 			virtual_end -= PAGE_SIZE;
 
 			moea64_kenter(mmup, moea64_scratchpage_va[i], 0);
 
 			moea64_scratchpage_pvo[i] = moea64_pvo_find_va(
 			    kernel_pmap, (vm_offset_t)moea64_scratchpage_va[i]);
 			LOCK_TABLE();
 			moea64_scratchpage_pte[i] = MOEA64_PVO_TO_PTE(
 			    mmup, moea64_scratchpage_pvo[i]);
 			moea64_scratchpage_pvo[i]->pvo_pte.lpte.pte_hi
 			    |= LPTE_LOCKED;
 			MOEA64_PTE_CHANGE(mmup, moea64_scratchpage_pte[i],
 			    &moea64_scratchpage_pvo[i]->pvo_pte.lpte,
 			    moea64_scratchpage_pvo[i]->pvo_vpn);
 			UNLOCK_TABLE();
 		}
 	}
 }
 
 /*
  * Activate a user pmap.  The pmap must be activated before its address
  * space can be accessed in any way.
  */
 void
 moea64_activate(mmu_t mmu, struct thread *td)
 {
 	pmap_t	pm;
 
 	pm = &td->td_proc->p_vmspace->vm_pmap;
-	pm->pm_active |= PCPU_GET(cpumask);
+	sched_pin();
+	CPU_OR(&pm->pm_active, PCPU_PTR(cpumask));
+	sched_unpin();
 
 	#ifdef __powerpc64__
 	PCPU_SET(userslb, pm->pm_slb);
 	#else
 	PCPU_SET(curpmap, pm->pmap_phys);
 	#endif
 }
 
 void
 moea64_deactivate(mmu_t mmu, struct thread *td)
 {
 	pmap_t	pm;
 
 	pm = &td->td_proc->p_vmspace->vm_pmap;
-	pm->pm_active &= ~(PCPU_GET(cpumask));
+	sched_pin();
+	CPU_NAND(&pm->pm_active, PCPU_PTR(cpumask));
+	sched_unpin();
 	#ifdef __powerpc64__
 	PCPU_SET(userslb, NULL);
 	#else
 	PCPU_SET(curpmap, NULL);
 	#endif
 }
 
 void
 moea64_change_wiring(mmu_t mmu, pmap_t pm, vm_offset_t va, boolean_t wired)
 {
 	struct	pvo_entry *pvo;
 	uintptr_t pt;
 	uint64_t vsid;
 	int	i, ptegidx;
 
 	PMAP_LOCK(pm);
 	pvo = moea64_pvo_find_va(pm, va & ~ADDR_POFF);
 
 	if (pvo != NULL) {
 		LOCK_TABLE();
 		pt = MOEA64_PVO_TO_PTE(mmu, pvo);
 
 		if (wired) {
 			if ((pvo->pvo_vaddr & PVO_WIRED) == 0)
 				pm->pm_stats.wired_count++;
 			pvo->pvo_vaddr |= PVO_WIRED;
 			pvo->pvo_pte.lpte.pte_hi |= LPTE_WIRED;
 		} else {
 			if ((pvo->pvo_vaddr & PVO_WIRED) != 0)
 				pm->pm_stats.wired_count--;
 			pvo->pvo_vaddr &= ~PVO_WIRED;
 			pvo->pvo_pte.lpte.pte_hi &= ~LPTE_WIRED;
 		}
 
 		if (pt != -1) {
 			/* Update wiring flag in page table. */
 			MOEA64_PTE_CHANGE(mmu, pt, &pvo->pvo_pte.lpte,
 			    pvo->pvo_vpn);
 		} else if (wired) {
 			/*
 			 * If we are wiring the page, and it wasn't in the
 			 * page table before, add it.
 			 */
 			vsid = PVO_VSID(pvo);
 			ptegidx = va_to_pteg(vsid, PVO_VADDR(pvo),
 			    pvo->pvo_vaddr & PVO_LARGE);
 
 			i = MOEA64_PTE_INSERT(mmu, ptegidx, &pvo->pvo_pte.lpte);
 			
 			if (i >= 0) {
 				PVO_PTEGIDX_CLR(pvo);
 				PVO_PTEGIDX_SET(pvo, i);
 			}
 		}
 			
 		UNLOCK_TABLE();
 	}
 	PMAP_UNLOCK(pm);
 }
 
 /*
  * This goes through and sets the physical address of our
  * special scratch PTE to the PA we want to zero or copy. Because
  * of locking issues (this can get called in pvo_enter() by
  * the UMA allocator), we can't use most other utility functions here
  */
 
 static __inline
 void moea64_set_scratchpage_pa(mmu_t mmup, int which, vm_offset_t pa) {
 
 	KASSERT(!hw_direct_map, ("Using OEA64 scratchpage with a direct map!"));
 	mtx_assert(&moea64_scratchpage_mtx, MA_OWNED);
 
 	moea64_scratchpage_pvo[which]->pvo_pte.lpte.pte_lo &=
 	    ~(LPTE_WIMG | LPTE_RPGN);
 	moea64_scratchpage_pvo[which]->pvo_pte.lpte.pte_lo |=
 	    moea64_calc_wimg(pa, VM_MEMATTR_DEFAULT) | (uint64_t)pa;
 	MOEA64_PTE_CHANGE(mmup, moea64_scratchpage_pte[which],
 	    &moea64_scratchpage_pvo[which]->pvo_pte.lpte,
 	    moea64_scratchpage_pvo[which]->pvo_vpn);
 	isync();
 }
 
 void
 moea64_copy_page(mmu_t mmu, vm_page_t msrc, vm_page_t mdst)
 {
 	vm_offset_t	dst;
 	vm_offset_t	src;
 
 	dst = VM_PAGE_TO_PHYS(mdst);
 	src = VM_PAGE_TO_PHYS(msrc);
 
 	if (hw_direct_map) {
 		kcopy((void *)src, (void *)dst, PAGE_SIZE);
 	} else {
 		mtx_lock(&moea64_scratchpage_mtx);
 
 		moea64_set_scratchpage_pa(mmu, 0, src);
 		moea64_set_scratchpage_pa(mmu, 1, dst);
 
 		kcopy((void *)moea64_scratchpage_va[0], 
 		    (void *)moea64_scratchpage_va[1], PAGE_SIZE);
 
 		mtx_unlock(&moea64_scratchpage_mtx);
 	}
 }
 
 void
 moea64_zero_page_area(mmu_t mmu, vm_page_t m, int off, int size)
 {
 	vm_offset_t pa = VM_PAGE_TO_PHYS(m);
 
 	if (size + off > PAGE_SIZE)
 		panic("moea64_zero_page: size + off > PAGE_SIZE");
 
 	if (hw_direct_map) {
 		bzero((caddr_t)pa + off, size);
 	} else {
 		mtx_lock(&moea64_scratchpage_mtx);
 		moea64_set_scratchpage_pa(mmu, 0, pa);
 		bzero((caddr_t)moea64_scratchpage_va[0] + off, size);
 		mtx_unlock(&moea64_scratchpage_mtx);
 	}
 }
 
 /*
  * Zero a page of physical memory by temporarily mapping it
  */
 void
 moea64_zero_page(mmu_t mmu, vm_page_t m)
 {
 	vm_offset_t pa = VM_PAGE_TO_PHYS(m);
 	vm_offset_t va, off;
 
 	if (!hw_direct_map) {
 		mtx_lock(&moea64_scratchpage_mtx);
 
 		moea64_set_scratchpage_pa(mmu, 0, pa);
 		va = moea64_scratchpage_va[0];
 	} else {
 		va = pa;
 	}
 
 	for (off = 0; off < PAGE_SIZE; off += cacheline_size)
 		__asm __volatile("dcbz 0,%0" :: "r"(va + off));
 
 	if (!hw_direct_map)
 		mtx_unlock(&moea64_scratchpage_mtx);
 }
 
 void
 moea64_zero_page_idle(mmu_t mmu, vm_page_t m)
 {
 
 	moea64_zero_page(mmu, m);
 }
 
 /*
  * Map the given physical page at the specified virtual address in the
  * target pmap with the protection requested.  If specified the page
  * will be wired down.
  */
 void
 moea64_enter(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_page_t m, 
     vm_prot_t prot, boolean_t wired)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	moea64_enter_locked(mmu, pmap, va, m, prot, wired);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Map the given physical page at the specified virtual address in the
  * target pmap with the protection requested.  If specified the page
  * will be wired down.
  *
  * The page queues and pmap must be locked.
  */
 
 static void
 moea64_enter_locked(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 	struct		pvo_head *pvo_head;
 	uma_zone_t	zone;
 	vm_page_t	pg;
 	uint64_t	pte_lo;
 	u_int		pvo_flags;
 	int		error;
 
 	if (!moea64_initialized) {
 		pvo_head = &moea64_pvo_kunmanaged;
 		pg = NULL;
 		zone = moea64_upvo_zone;
 		pvo_flags = 0;
 	} else {
 		pvo_head = vm_page_to_pvoh(m);
 		pg = m;
 		zone = moea64_mpvo_zone;
 		pvo_flags = PVO_MANAGED;
 	}
 
 	if (pmap_bootstrapped)
 		mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 	    (m->oflags & VPO_BUSY) != 0 || VM_OBJECT_LOCKED(m->object),
 	    ("moea64_enter_locked: page %p is not busy", m));
 
 	/* XXX change the pvo head for fake pages */
 	if ((m->flags & PG_FICTITIOUS) == PG_FICTITIOUS) {
 		pvo_flags &= ~PVO_MANAGED;
 		pvo_head = &moea64_pvo_kunmanaged;
 		zone = moea64_upvo_zone;
 	}
 
 	pte_lo = moea64_calc_wimg(VM_PAGE_TO_PHYS(m), pmap_page_get_memattr(m));
 
 	if (prot & VM_PROT_WRITE) {
 		pte_lo |= LPTE_BW;
 		if (pmap_bootstrapped &&
 		    (m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0)
 			vm_page_flag_set(m, PG_WRITEABLE);
 	} else
 		pte_lo |= LPTE_BR;
 
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		pte_lo |= LPTE_NOEXEC;
 
 	if (wired)
 		pvo_flags |= PVO_WIRED;
 
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		pvo_flags |= PVO_FAKE;
 
 	error = moea64_pvo_enter(mmu, pmap, zone, pvo_head, va,
 	    VM_PAGE_TO_PHYS(m), pte_lo, pvo_flags);
 
 	/*
 	 * Flush the page from the instruction cache if this page is
 	 * mapped executable and cacheable.
 	 */
 	if ((pte_lo & (LPTE_I | LPTE_G | LPTE_NOEXEC)) == 0)
 		moea64_syncicache(mmu, pmap, va, VM_PAGE_TO_PHYS(m), PAGE_SIZE);
 }
 
 static void
 moea64_syncicache(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_offset_t pa,
     vm_size_t sz)
 {
 
 	/*
 	 * This is much trickier than on older systems because
 	 * we can't sync the icache on physical addresses directly
 	 * without a direct map. Instead we check a couple of cases
 	 * where the memory is already mapped in and, failing that,
 	 * use the same trick we use for page zeroing to create
 	 * a temporary mapping for this physical address.
 	 */
 
 	if (!pmap_bootstrapped) {
 		/*
 		 * If PMAP is not bootstrapped, we are likely to be
 		 * in real mode.
 		 */
 		__syncicache((void *)pa, sz);
 	} else if (pmap == kernel_pmap) {
 		__syncicache((void *)va, sz);
 	} else if (hw_direct_map) {
 		__syncicache((void *)pa, sz);
 	} else {
 		/* Use the scratch page to set up a temp mapping */
 
 		mtx_lock(&moea64_scratchpage_mtx);
 
 		moea64_set_scratchpage_pa(mmu, 1, pa & ~ADDR_POFF);
 		__syncicache((void *)(moea64_scratchpage_va[1] + 
 		    (va & ADDR_POFF)), sz);
 
 		mtx_unlock(&moea64_scratchpage_mtx);
 	}
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 moea64_enter_object(mmu_t mmu, pmap_t pm, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	vm_page_t m;
 	vm_pindex_t diff, psize;
 
 	psize = atop(end - start);
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		moea64_enter_locked(mmu, pm, start + ptoa(diff), m, prot &
 		    (VM_PROT_READ | VM_PROT_EXECUTE), FALSE);
 		m = TAILQ_NEXT(m, listq);
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 void
 moea64_enter_quick(mmu_t mmu, pmap_t pm, vm_offset_t va, vm_page_t m,
     vm_prot_t prot)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	moea64_enter_locked(mmu, pm, va, m,
 	    prot & (VM_PROT_READ | VM_PROT_EXECUTE), FALSE);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 vm_paddr_t
 moea64_extract(mmu_t mmu, pmap_t pm, vm_offset_t va)
 {
 	struct	pvo_entry *pvo;
 	vm_paddr_t pa;
 
 	PMAP_LOCK(pm);
 	pvo = moea64_pvo_find_va(pm, va);
 	if (pvo == NULL)
 		pa = 0;
 	else
 		pa = (pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN) |
 		    (va - PVO_VADDR(pvo));
 	PMAP_UNLOCK(pm);
 	return (pa);
 }
 
 /*
  * Atomically extract and hold the physical page with the given
  * pmap and virtual address pair if that mapping permits the given
  * protection.
  */
 vm_page_t
 moea64_extract_and_hold(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_prot_t prot)
 {
 	struct	pvo_entry *pvo;
 	vm_page_t m;
         vm_paddr_t pa;
         
 	m = NULL;
 	pa = 0;
 	PMAP_LOCK(pmap);
 retry:
 	pvo = moea64_pvo_find_va(pmap, va & ~ADDR_POFF);
 	if (pvo != NULL && (pvo->pvo_pte.lpte.pte_hi & LPTE_VALID) &&
 	    ((pvo->pvo_pte.lpte.pte_lo & LPTE_PP) == LPTE_RW ||
 	     (prot & VM_PROT_WRITE) == 0)) {
 		if (vm_page_pa_tryrelock(pmap,
 			pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN, &pa))
 			goto retry;
 		m = PHYS_TO_VM_PAGE(pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN);
 		vm_page_hold(m);
 	}
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pmap);
 	return (m);
 }
 
 static mmu_t installed_mmu;
 
 static void *
 moea64_uma_page_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) 
 {
 	/*
 	 * This entire routine is a horrible hack to avoid bothering kmem
 	 * for new KVA addresses. Because this can get called from inside
 	 * kmem allocation routines, calling kmem for a new address here
 	 * can lead to multiply locking non-recursive mutexes.
 	 */
 	static vm_pindex_t color;
         vm_offset_t va;
 
         vm_page_t m;
         int pflags, needed_lock;
 
 	*flags = UMA_SLAB_PRIV;
 	needed_lock = !PMAP_LOCKED(kernel_pmap);
 
 	if (needed_lock)
 		PMAP_LOCK(kernel_pmap);
 
         if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
                 pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
         else
                 pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
         if (wait & M_ZERO)
                 pflags |= VM_ALLOC_ZERO;
 
         for (;;) {
                 m = vm_page_alloc(NULL, color++, pflags | VM_ALLOC_NOOBJ);
                 if (m == NULL) {
                         if (wait & M_NOWAIT)
                                 return (NULL);
                         VM_WAIT;
                 } else
                         break;
         }
 
 	va = VM_PAGE_TO_PHYS(m);
 
 	moea64_pvo_enter(installed_mmu, kernel_pmap, moea64_upvo_zone,
 	    &moea64_pvo_kunmanaged, va, VM_PAGE_TO_PHYS(m), LPTE_M,
 	    PVO_WIRED | PVO_BOOTSTRAP);
 
 	if (needed_lock)
 		PMAP_UNLOCK(kernel_pmap);
 	
 	if ((wait & M_ZERO) && (m->flags & PG_ZERO) == 0)
                 bzero((void *)va, PAGE_SIZE);
 
 	return (void *)va;
 }
 
 void
 moea64_init(mmu_t mmu)
 {
 
 	CTR0(KTR_PMAP, "moea64_init");
 
 	moea64_upvo_zone = uma_zcreate("UPVO entry", sizeof (struct pvo_entry),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR,
 	    UMA_ZONE_VM | UMA_ZONE_NOFREE);
 	moea64_mpvo_zone = uma_zcreate("MPVO entry", sizeof(struct pvo_entry),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR,
 	    UMA_ZONE_VM | UMA_ZONE_NOFREE);
 
 	if (!hw_direct_map) {
 		installed_mmu = mmu;
 		uma_zone_set_allocf(moea64_upvo_zone,moea64_uma_page_alloc);
 		uma_zone_set_allocf(moea64_mpvo_zone,moea64_uma_page_alloc);
 	}
 
 	moea64_initialized = TRUE;
 }
 
 boolean_t
 moea64_is_referenced(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea64_is_referenced: page %p is not managed", m));
 	return (moea64_query_bit(mmu, m, PTE_REF));
 }
 
 boolean_t
 moea64_is_modified(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea64_is_modified: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no PTEs can have LPTE_CHG set.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return (FALSE);
 	return (moea64_query_bit(mmu, m, LPTE_CHG));
 }
 
 boolean_t
 moea64_is_prefaultable(mmu_t mmu, pmap_t pmap, vm_offset_t va)
 {
 	struct pvo_entry *pvo;
 	boolean_t rv;
 
 	PMAP_LOCK(pmap);
 	pvo = moea64_pvo_find_va(pmap, va & ~ADDR_POFF);
 	rv = pvo == NULL || (pvo->pvo_pte.lpte.pte_hi & LPTE_VALID) == 0;
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 void
 moea64_clear_reference(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea64_clear_reference: page %p is not managed", m));
 	moea64_clear_bit(mmu, m, LPTE_REF);
 }
 
 void
 moea64_clear_modify(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea64_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("moea64_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no PTEs can have LPTE_CHG
 	 * set.  If the object containing the page is locked and the page is
 	 * not VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	moea64_clear_bit(mmu, m, LPTE_CHG);
 }
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 void
 moea64_remove_write(mmu_t mmu, vm_page_t m)
 {
 	struct	pvo_entry *pvo;
 	uintptr_t pt;
 	pmap_t	pmap;
 	uint64_t lo;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea64_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	lo = moea64_attr_fetch(m);
 	powerpc_sync();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 		pmap = pvo->pvo_pmap;
 		PMAP_LOCK(pmap);
 		LOCK_TABLE();
 		if ((pvo->pvo_pte.lpte.pte_lo & LPTE_PP) != LPTE_BR) {
 			pt = MOEA64_PVO_TO_PTE(mmu, pvo);
 			pvo->pvo_pte.lpte.pte_lo &= ~LPTE_PP;
 			pvo->pvo_pte.lpte.pte_lo |= LPTE_BR;
 			if (pt != -1) {
 				MOEA64_PTE_SYNCH(mmu, pt, &pvo->pvo_pte.lpte);
 				lo |= pvo->pvo_pte.lpte.pte_lo;
 				pvo->pvo_pte.lpte.pte_lo &= ~LPTE_CHG;
 				MOEA64_PTE_CHANGE(mmu, pt,
 				    &pvo->pvo_pte.lpte, pvo->pvo_vpn);
 				if (pvo->pvo_pmap == kernel_pmap)
 					isync();
 			}
 		}
 		UNLOCK_TABLE();
 		PMAP_UNLOCK(pmap);
 	}
 	if ((lo & LPTE_CHG) != 0) {
 		moea64_attr_clear(m, LPTE_CHG);
 		vm_page_dirty(m);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 /*
  *	moea64_ts_referenced:
  *
  *	Return a count of reference bits for a page, clearing those bits.
  *	It is not necessary for every reference bit to be cleared, but it
  *	is necessary that 0 only be returned when there are truly no
  *	reference bits set.
  *
  *	XXX: The exact number of bits to check and clear is a matter that
  *	should be tested and standardized at some point in the future for
  *	optimal aging of shared pages.
  */
 boolean_t
 moea64_ts_referenced(mmu_t mmu, vm_page_t m)
 {
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea64_ts_referenced: page %p is not managed", m));
 	return (moea64_clear_bit(mmu, m, LPTE_REF));
 }
 
 /*
  * Modify the WIMG settings of all mappings for a page.
  */
 void
 moea64_page_set_memattr(mmu_t mmu, vm_page_t m, vm_memattr_t ma)
 {
 	struct	pvo_entry *pvo;
 	struct  pvo_head *pvo_head;
 	uintptr_t pt;
 	pmap_t	pmap;
 	uint64_t lo;
 
 	if (m->flags & PG_FICTITIOUS) {
 		m->md.mdpg_cache_attrs = ma;
 		return;
 	}
 
 	vm_page_lock_queues();
 	pvo_head = vm_page_to_pvoh(m);
 	lo = moea64_calc_wimg(VM_PAGE_TO_PHYS(m), ma);
 	LIST_FOREACH(pvo, pvo_head, pvo_vlink) {
 		pmap = pvo->pvo_pmap;
 		PMAP_LOCK(pmap);
 		LOCK_TABLE();
 		pt = MOEA64_PVO_TO_PTE(mmu, pvo);
 		pvo->pvo_pte.lpte.pte_lo &= ~LPTE_WIMG;
 		pvo->pvo_pte.lpte.pte_lo |= lo;
 		if (pt != -1) {
 			MOEA64_PTE_CHANGE(mmu, pt, &pvo->pvo_pte.lpte,
 			    pvo->pvo_vpn);
 			if (pvo->pvo_pmap == kernel_pmap)
 				isync();
 		}
 		UNLOCK_TABLE();
 		PMAP_UNLOCK(pmap);
 	}
 	m->md.mdpg_cache_attrs = ma;
 	vm_page_unlock_queues();
 }
 
 /*
  * Map a wired page into kernel virtual address space.
  */
 void
 moea64_kenter_attr(mmu_t mmu, vm_offset_t va, vm_offset_t pa, vm_memattr_t ma)
 {
 	uint64_t	pte_lo;
 	int		error;	
 
 	pte_lo = moea64_calc_wimg(pa, ma);
 
 	PMAP_LOCK(kernel_pmap);
 	error = moea64_pvo_enter(mmu, kernel_pmap, moea64_upvo_zone,
 	    &moea64_pvo_kunmanaged, va, pa, pte_lo, PVO_WIRED);
 
 	if (error != 0 && error != ENOENT)
 		panic("moea64_kenter: failed to enter va %#zx pa %#zx: %d", va,
 		    pa, error);
 
 	/*
 	 * Flush the memory from the instruction cache.
 	 */
 	if ((pte_lo & (LPTE_I | LPTE_G)) == 0)
 		__syncicache((void *)va, PAGE_SIZE);
 	PMAP_UNLOCK(kernel_pmap);
 }
 
 void
 moea64_kenter(mmu_t mmu, vm_offset_t va, vm_offset_t pa)
 {
 
 	moea64_kenter_attr(mmu, va, pa, VM_MEMATTR_DEFAULT);
 }
 
 /*
  * Extract the physical page address associated with the given kernel virtual
  * address.
  */
 vm_offset_t
 moea64_kextract(mmu_t mmu, vm_offset_t va)
 {
 	struct		pvo_entry *pvo;
 	vm_paddr_t pa;
 
 	/*
 	 * Shortcut the direct-mapped case when applicable.  We never put
 	 * anything but 1:1 mappings below VM_MIN_KERNEL_ADDRESS.
 	 */
 	if (va < VM_MIN_KERNEL_ADDRESS)
 		return (va);
 
 	PMAP_LOCK(kernel_pmap);
 	pvo = moea64_pvo_find_va(kernel_pmap, va);
 	KASSERT(pvo != NULL, ("moea64_kextract: no addr found for %#" PRIxPTR,
 	    va));
 	pa = (pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN) + (va - PVO_VADDR(pvo));
 	PMAP_UNLOCK(kernel_pmap);
 	return (pa);
 }
 
 /*
  * Remove a wired page from kernel virtual address space.
  */
 void
 moea64_kremove(mmu_t mmu, vm_offset_t va)
 {
 	moea64_remove(mmu, kernel_pmap, va, va + PAGE_SIZE);
 }
 
 /*
  * Map a range of physical addresses into kernel virtual address space.
  *
  * The value passed in *virt is a suggested virtual address for the mapping.
  * Architectures which can support a direct-mapped physical to virtual region
  * can return the appropriate address within that region, leaving '*virt'
  * unchanged.  We cannot and therefore do not; *virt is updated with the
  * first usable address after the mapped region.
  */
 vm_offset_t
 moea64_map(mmu_t mmu, vm_offset_t *virt, vm_offset_t pa_start,
     vm_offset_t pa_end, int prot)
 {
 	vm_offset_t	sva, va;
 
 	sva = *virt;
 	va = sva;
 	for (; pa_start < pa_end; pa_start += PAGE_SIZE, va += PAGE_SIZE)
 		moea64_kenter(mmu, va, pa_start);
 	*virt = va;
 
 	return (sva);
 }
 
 /*
  * Returns true if the pmap's pv is one of the first
  * 16 pvs linked to from this page.  This count may
  * be changed upwards or downwards in the future; it
  * is only necessary that true be returned for a small
  * subset of pmaps for proper page aging.
  */
 boolean_t
 moea64_page_exists_quick(mmu_t mmu, pmap_t pmap, vm_page_t m)
 {
         int loops;
 	struct pvo_entry *pvo;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("moea64_page_exists_quick: page %p is not managed", m));
 	loops = 0;
 	rv = FALSE;
 	vm_page_lock_queues();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 		if (pvo->pvo_pmap == pmap) {
 			rv = TRUE;
 			break;
 		}
 		if (++loops >= 16)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Return the number of managed mappings to the given physical page
  * that are wired.
  */
 int
 moea64_page_wired_mappings(mmu_t mmu, vm_page_t m)
 {
 	struct pvo_entry *pvo;
 	int count;
 
 	count = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink)
 		if ((pvo->pvo_vaddr & PVO_WIRED) != 0)
 			count++;
 	vm_page_unlock_queues();
 	return (count);
 }
 
 static uintptr_t	moea64_vsidcontext;
 
 uintptr_t
 moea64_get_unique_vsid(void) {
 	u_int entropy;
 	register_t hash;
 	uint32_t mask;
 	int i;
 
 	entropy = 0;
 	__asm __volatile("mftb %0" : "=r"(entropy));
 
 	mtx_lock(&moea64_slb_mutex);
 	for (i = 0; i < NVSIDS; i += VSID_NBPW) {
 		u_int	n;
 
 		/*
 		 * Create a new value by mutiplying by a prime and adding in
 		 * entropy from the timebase register.  This is to make the
 		 * VSID more random so that the PT hash function collides
 		 * less often.  (Note that the prime casues gcc to do shifts
 		 * instead of a multiply.)
 		 */
 		moea64_vsidcontext = (moea64_vsidcontext * 0x1105) + entropy;
 		hash = moea64_vsidcontext & (NVSIDS - 1);
 		if (hash == 0)		/* 0 is special, avoid it */
 			continue;
 		n = hash >> 5;
 		mask = 1 << (hash & (VSID_NBPW - 1));
 		hash = (moea64_vsidcontext & VSID_HASHMASK);
 		if (moea64_vsid_bitmap[n] & mask) {	/* collision? */
 			/* anything free in this bucket? */
 			if (moea64_vsid_bitmap[n] == 0xffffffff) {
 				entropy = (moea64_vsidcontext >> 20);
 				continue;
 			}
 			i = ffs(~moea64_vsid_bitmap[n]) - 1;
 			mask = 1 << i;
 			hash &= VSID_HASHMASK & ~(VSID_NBPW - 1);
 			hash |= i;
 		}
 		KASSERT(!(moea64_vsid_bitmap[n] & mask),
 		    ("Allocating in-use VSID %#zx\n", hash));
 		moea64_vsid_bitmap[n] |= mask;
 		mtx_unlock(&moea64_slb_mutex);
 		return (hash);
 	}
 
 	mtx_unlock(&moea64_slb_mutex);
 	panic("%s: out of segments",__func__);
 }
 
 #ifdef __powerpc64__
 void
 moea64_pinit(mmu_t mmu, pmap_t pmap)
 {
 	PMAP_LOCK_INIT(pmap);
 
 	pmap->pm_slb_tree_root = slb_alloc_tree();
 	pmap->pm_slb = slb_alloc_user_cache();
 	pmap->pm_slb_len = 0;
 }
 #else
 void
 moea64_pinit(mmu_t mmu, pmap_t pmap)
 {
 	int	i;
 	uint32_t hash;
 
 	PMAP_LOCK_INIT(pmap);
 
 	if (pmap_bootstrapped)
 		pmap->pmap_phys = (pmap_t)moea64_kextract(mmu,
 		    (vm_offset_t)pmap);
 	else
 		pmap->pmap_phys = pmap;
 
 	/*
 	 * Allocate some segment registers for this pmap.
 	 */
 	hash = moea64_get_unique_vsid();
 
 	for (i = 0; i < 16; i++) 
 		pmap->pm_sr[i] = VSID_MAKE(i, hash);
 
 	KASSERT(pmap->pm_sr[0] != 0, ("moea64_pinit: pm_sr[0] = 0"));
 }
 #endif
 
 /*
  * Initialize the pmap associated with process 0.
  */
 void
 moea64_pinit0(mmu_t mmu, pmap_t pm)
 {
 	moea64_pinit(mmu, pm);
 	bzero(&pm->pm_stats, sizeof(pm->pm_stats));
 }
 
 /*
  * Set the physical protection on the specified range of this map as requested.
  */
 void
 moea64_protect(mmu_t mmu, pmap_t pm, vm_offset_t sva, vm_offset_t eva,
     vm_prot_t prot)
 {
 	struct	pvo_entry *pvo;
 	uintptr_t pt;
 
 	CTR4(KTR_PMAP, "moea64_protect: pm=%p sva=%#x eva=%#x prot=%#x", pm, sva,
 	    eva, prot);
 
 
 	KASSERT(pm == &curproc->p_vmspace->vm_pmap || pm == kernel_pmap,
 	    ("moea64_protect: non current pmap"));
 
 	if ((prot & VM_PROT_READ) == VM_PROT_NONE) {
 		moea64_remove(mmu, pm, sva, eva);
 		return;
 	}
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	for (; sva < eva; sva += PAGE_SIZE) {
 		pvo = moea64_pvo_find_va(pm, sva);
 		if (pvo == NULL)
 			continue;
 
 		/*
 		 * Grab the PTE pointer before we diddle with the cached PTE
 		 * copy.
 		 */
 		LOCK_TABLE();
 		pt = MOEA64_PVO_TO_PTE(mmu, pvo);
 
 		/*
 		 * Change the protection of the page.
 		 */
 		pvo->pvo_pte.lpte.pte_lo &= ~LPTE_PP;
 		pvo->pvo_pte.lpte.pte_lo |= LPTE_BR;
 		pvo->pvo_pte.lpte.pte_lo &= ~LPTE_NOEXEC;
 		if ((prot & VM_PROT_EXECUTE) == 0) 
 			pvo->pvo_pte.lpte.pte_lo |= LPTE_NOEXEC;
 
 		/*
 		 * If the PVO is in the page table, update that pte as well.
 		 */
 		if (pt != -1) {
 			MOEA64_PTE_CHANGE(mmu, pt, &pvo->pvo_pte.lpte,
 			    pvo->pvo_vpn);
 			if ((pvo->pvo_pte.lpte.pte_lo & 
 			    (LPTE_I | LPTE_G | LPTE_NOEXEC)) == 0) {
 				moea64_syncicache(mmu, pm, sva,
 				    pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN,
 				    PAGE_SIZE);
 			}
 		}
 		UNLOCK_TABLE();
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 /*
  * Map a list of wired pages into kernel virtual address space.  This is
  * intended for temporary mappings which do not need page modification or
  * references recorded.  Existing mappings in the region are overwritten.
  */
 void
 moea64_qenter(mmu_t mmu, vm_offset_t va, vm_page_t *m, int count)
 {
 	while (count-- > 0) {
 		moea64_kenter(mmu, va, VM_PAGE_TO_PHYS(*m));
 		va += PAGE_SIZE;
 		m++;
 	}
 }
 
 /*
  * Remove page mappings from kernel virtual address space.  Intended for
  * temporary mappings entered by moea64_qenter.
  */
 void
 moea64_qremove(mmu_t mmu, vm_offset_t va, int count)
 {
 	while (count-- > 0) {
 		moea64_kremove(mmu, va);
 		va += PAGE_SIZE;
 	}
 }
 
 void
 moea64_release_vsid(uint64_t vsid)
 {
 	int idx, mask;
 
 	mtx_lock(&moea64_slb_mutex);
 	idx = vsid & (NVSIDS-1);
 	mask = 1 << (idx % VSID_NBPW);
 	idx /= VSID_NBPW;
 	KASSERT(moea64_vsid_bitmap[idx] & mask,
 	    ("Freeing unallocated VSID %#jx", vsid));
 	moea64_vsid_bitmap[idx] &= ~mask;
 	mtx_unlock(&moea64_slb_mutex);
 }
 	
 
 void
 moea64_release(mmu_t mmu, pmap_t pmap)
 {
         
 	/*
 	 * Free segment registers' VSIDs
 	 */
     #ifdef __powerpc64__
 	slb_free_tree(pmap);
 	slb_free_user_cache(pmap->pm_slb);
     #else
 	KASSERT(pmap->pm_sr[0] != 0, ("moea64_release: pm_sr[0] = 0"));
 
 	moea64_release_vsid(VSID_TO_HASH(pmap->pm_sr[0]));
     #endif
 
 	PMAP_LOCK_DESTROY(pmap);
 }
 
 /*
  * Remove the given range of addresses from the specified map.
  */
 void
 moea64_remove(mmu_t mmu, pmap_t pm, vm_offset_t sva, vm_offset_t eva)
 {
 	struct	pvo_entry *pvo;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	for (; sva < eva; sva += PAGE_SIZE) {
 		pvo = moea64_pvo_find_va(pm, sva);
 		if (pvo != NULL)
 			moea64_pvo_remove(mmu, pvo);
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 /*
  * Remove physical page from all pmaps in which it resides. moea64_pvo_remove()
  * will reflect changes in pte's back to the vm_page.
  */
 void
 moea64_remove_all(mmu_t mmu, vm_page_t m)
 {
 	struct  pvo_head *pvo_head;
 	struct	pvo_entry *pvo, *next_pvo;
 	pmap_t	pmap;
 
 	vm_page_lock_queues();
 	pvo_head = vm_page_to_pvoh(m);
 	for (pvo = LIST_FIRST(pvo_head); pvo != NULL; pvo = next_pvo) {
 		next_pvo = LIST_NEXT(pvo, pvo_vlink);
 
 		pmap = pvo->pvo_pmap;
 		PMAP_LOCK(pmap);
 		moea64_pvo_remove(mmu, pvo);
 		PMAP_UNLOCK(pmap);
 	}
 	if ((m->flags & PG_WRITEABLE) && moea64_is_modified(mmu, m)) {
 		moea64_attr_clear(m, LPTE_CHG);
 		vm_page_dirty(m);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 /*
  * Allocate a physical page of memory directly from the phys_avail map.
  * Can only be called from moea64_bootstrap before avail start and end are
  * calculated.
  */
 vm_offset_t
 moea64_bootstrap_alloc(vm_size_t size, u_int align)
 {
 	vm_offset_t	s, e;
 	int		i, j;
 
 	size = round_page(size);
 	for (i = 0; phys_avail[i + 1] != 0; i += 2) {
 		if (align != 0)
 			s = (phys_avail[i] + align - 1) & ~(align - 1);
 		else
 			s = phys_avail[i];
 		e = s + size;
 
 		if (s < phys_avail[i] || e > phys_avail[i + 1])
 			continue;
 
 		if (s + size > platform_real_maxaddr())
 			continue;
 
 		if (s == phys_avail[i]) {
 			phys_avail[i] += size;
 		} else if (e == phys_avail[i + 1]) {
 			phys_avail[i + 1] -= size;
 		} else {
 			for (j = phys_avail_count * 2; j > i; j -= 2) {
 				phys_avail[j] = phys_avail[j - 2];
 				phys_avail[j + 1] = phys_avail[j - 1];
 			}
 
 			phys_avail[i + 3] = phys_avail[i + 1];
 			phys_avail[i + 1] = s;
 			phys_avail[i + 2] = e;
 			phys_avail_count++;
 		}
 
 		return (s);
 	}
 	panic("moea64_bootstrap_alloc: could not allocate memory");
 }
 
 static int
 moea64_pvo_enter(mmu_t mmu, pmap_t pm, uma_zone_t zone,
     struct pvo_head *pvo_head, vm_offset_t va, vm_offset_t pa,
     uint64_t pte_lo, int flags)
 {
 	struct	 pvo_entry *pvo;
 	uint64_t vsid;
 	int	 first;
 	u_int	 ptegidx;
 	int	 i;
 	int      bootstrap;
 
 	/*
 	 * One nasty thing that can happen here is that the UMA calls to
 	 * allocate new PVOs need to map more memory, which calls pvo_enter(),
 	 * which calls UMA...
 	 *
 	 * We break the loop by detecting recursion and allocating out of
 	 * the bootstrap pool.
 	 */
 
 	first = 0;
 	bootstrap = (flags & PVO_BOOTSTRAP);
 
 	if (!moea64_initialized)
 		bootstrap = 1;
 
 	/*
 	 * Compute the PTE Group index.
 	 */
 	va &= ~ADDR_POFF;
 	vsid = va_to_vsid(pm, va);
 	ptegidx = va_to_pteg(vsid, va, flags & PVO_LARGE);
 
 	/*
 	 * Remove any existing mapping for this page.  Reuse the pvo entry if
 	 * there is a mapping.
 	 */
 	LOCK_TABLE();
 
 	moea64_pvo_enter_calls++;
 
 	LIST_FOREACH(pvo, &moea64_pvo_table[ptegidx], pvo_olink) {
 		if (pvo->pvo_pmap == pm && PVO_VADDR(pvo) == va) {
 			if ((pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN) == pa &&
 			    (pvo->pvo_pte.lpte.pte_lo & (LPTE_NOEXEC | LPTE_PP))
 			    == (pte_lo & (LPTE_NOEXEC | LPTE_PP))) {
 			    	if (!(pvo->pvo_pte.lpte.pte_hi & LPTE_VALID)) {
 					/* Re-insert if spilled */
 					i = MOEA64_PTE_INSERT(mmu, ptegidx,
 					    &pvo->pvo_pte.lpte);
 					if (i >= 0)
 						PVO_PTEGIDX_SET(pvo, i);
 					moea64_pte_overflow--;
 				}
 				UNLOCK_TABLE();
 				return (0);
 			}
 			moea64_pvo_remove(mmu, pvo);
 			break;
 		}
 	}
 
 	/*
 	 * If we aren't overwriting a mapping, try to allocate.
 	 */
 	if (bootstrap) {
 		if (moea64_bpvo_pool_index >= BPVO_POOL_SIZE) {
 			panic("moea64_enter: bpvo pool exhausted, %d, %d, %zd",
 			      moea64_bpvo_pool_index, BPVO_POOL_SIZE, 
 			      BPVO_POOL_SIZE * sizeof(struct pvo_entry));
 		}
 		pvo = &moea64_bpvo_pool[moea64_bpvo_pool_index];
 		moea64_bpvo_pool_index++;
 		bootstrap = 1;
 	} else {
 		/*
 		 * Note: drop the table lock around the UMA allocation in
 		 * case the UMA allocator needs to manipulate the page
 		 * table. The mapping we are working with is already
 		 * protected by the PMAP lock.
 		 */
 		UNLOCK_TABLE();
 		pvo = uma_zalloc(zone, M_NOWAIT);
 		LOCK_TABLE();
 	}
 
 	if (pvo == NULL) {
 		UNLOCK_TABLE();
 		return (ENOMEM);
 	}
 
 	moea64_pvo_entries++;
 	pvo->pvo_vaddr = va;
 	pvo->pvo_vpn = (uint64_t)((va & ADDR_PIDX) >> ADDR_PIDX_SHFT)
 	    | (vsid << 16);
 	pvo->pvo_pmap = pm;
 	LIST_INSERT_HEAD(&moea64_pvo_table[ptegidx], pvo, pvo_olink);
 	pvo->pvo_vaddr &= ~ADDR_POFF;
 
 	if (flags & PVO_WIRED)
 		pvo->pvo_vaddr |= PVO_WIRED;
 	if (pvo_head != &moea64_pvo_kunmanaged)
 		pvo->pvo_vaddr |= PVO_MANAGED;
 	if (bootstrap)
 		pvo->pvo_vaddr |= PVO_BOOTSTRAP;
 	if (flags & PVO_FAKE)
 		pvo->pvo_vaddr |= PVO_FAKE;
 	if (flags & PVO_LARGE)
 		pvo->pvo_vaddr |= PVO_LARGE;
 
 	moea64_pte_create(&pvo->pvo_pte.lpte, vsid, va, 
 	    (uint64_t)(pa) | pte_lo, flags);
 
 	/*
 	 * Remember if the list was empty and therefore will be the first
 	 * item.
 	 */
 	if (LIST_FIRST(pvo_head) == NULL)
 		first = 1;
 	LIST_INSERT_HEAD(pvo_head, pvo, pvo_vlink);
 
 	if (pvo->pvo_vaddr & PVO_WIRED) {
 		pvo->pvo_pte.lpte.pte_hi |= LPTE_WIRED;
 		pm->pm_stats.wired_count++;
 	}
 	pm->pm_stats.resident_count++;
 
 	/*
 	 * We hope this succeeds but it isn't required.
 	 */
 	i = MOEA64_PTE_INSERT(mmu, ptegidx, &pvo->pvo_pte.lpte);
 	if (i >= 0) {
 		PVO_PTEGIDX_SET(pvo, i);
 	} else {
 		panic("moea64_pvo_enter: overflow");
 		moea64_pte_overflow++;
 	}
 
 	if (pm == kernel_pmap)
 		isync();
 
 	UNLOCK_TABLE();
 
 #ifdef __powerpc64__
 	/*
 	 * Make sure all our bootstrap mappings are in the SLB as soon
 	 * as virtual memory is switched on.
 	 */
 	if (!pmap_bootstrapped)
 		moea64_bootstrap_slb_prefault(va, flags & PVO_LARGE);
 #endif
 
 	return (first ? ENOENT : 0);
 }
 
 static void
 moea64_pvo_remove(mmu_t mmu, struct pvo_entry *pvo)
 {
 	uintptr_t pt;
 
 	/*
 	 * If there is an active pte entry, we need to deactivate it (and
 	 * save the ref & cfg bits).
 	 */
 	LOCK_TABLE();
 	pt = MOEA64_PVO_TO_PTE(mmu, pvo);
 	if (pt != -1) {
 		MOEA64_PTE_UNSET(mmu, pt, &pvo->pvo_pte.lpte, pvo->pvo_vpn);
 		PVO_PTEGIDX_CLR(pvo);
 	} else {
 		moea64_pte_overflow--;
 	}
 
 	/*
 	 * Update our statistics.
 	 */
 	pvo->pvo_pmap->pm_stats.resident_count--;
 	if (pvo->pvo_vaddr & PVO_WIRED)
 		pvo->pvo_pmap->pm_stats.wired_count--;
 
 	/*
 	 * Save the REF/CHG bits into their cache if the page is managed.
 	 */
 	if ((pvo->pvo_vaddr & (PVO_MANAGED|PVO_FAKE)) == PVO_MANAGED) {
 		struct	vm_page *pg;
 
 		pg = PHYS_TO_VM_PAGE(pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN);
 		if (pg != NULL) {
 			moea64_attr_save(pg, pvo->pvo_pte.lpte.pte_lo &
 			    (LPTE_REF | LPTE_CHG));
 		}
 	}
 
 	/*
 	 * Remove this PVO from the PV list.
 	 */
 	LIST_REMOVE(pvo, pvo_vlink);
 
 	/*
 	 * Remove this from the overflow list and return it to the pool
 	 * if we aren't going to reuse it.
 	 */
 	LIST_REMOVE(pvo, pvo_olink);
 
 	moea64_pvo_entries--;
 	moea64_pvo_remove_calls++;
 
 	UNLOCK_TABLE();
 
 	if (!(pvo->pvo_vaddr & PVO_BOOTSTRAP))
 		uma_zfree((pvo->pvo_vaddr & PVO_MANAGED) ? moea64_mpvo_zone :
 		    moea64_upvo_zone, pvo);
 }
 
 static struct pvo_entry *
 moea64_pvo_find_va(pmap_t pm, vm_offset_t va)
 {
 	struct		pvo_entry *pvo;
 	int		ptegidx;
 	uint64_t	vsid;
 	#ifdef __powerpc64__
 	uint64_t	slbv;
 
 	if (pm == kernel_pmap) {
 		slbv = kernel_va_to_slbv(va);
 	} else {
 		struct slb *slb;
 		slb = user_va_to_slb_entry(pm, va);
 		/* The page is not mapped if the segment isn't */
 		if (slb == NULL)
 			return NULL;
 		slbv = slb->slbv;
 	}
 
 	vsid = (slbv & SLBV_VSID_MASK) >> SLBV_VSID_SHIFT;
 	if (slbv & SLBV_L)
 		va &= ~moea64_large_page_mask;
 	else
 		va &= ~ADDR_POFF;
 	ptegidx = va_to_pteg(vsid, va, slbv & SLBV_L);
 	#else
 	va &= ~ADDR_POFF;
 	vsid = va_to_vsid(pm, va);
 	ptegidx = va_to_pteg(vsid, va, 0);
 	#endif
 
 	LOCK_TABLE();
 	LIST_FOREACH(pvo, &moea64_pvo_table[ptegidx], pvo_olink) {
 		if (pvo->pvo_pmap == pm && PVO_VADDR(pvo) == va)
 			break;
 	}
 	UNLOCK_TABLE();
 
 	return (pvo);
 }
 
 static boolean_t
 moea64_query_bit(mmu_t mmu, vm_page_t m, u_int64_t ptebit)
 {
 	struct	pvo_entry *pvo;
 	uintptr_t pt;
 
 	if (moea64_attr_fetch(m) & ptebit)
 		return (TRUE);
 
 	vm_page_lock_queues();
 
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 
 		/*
 		 * See if we saved the bit off.  If so, cache it and return
 		 * success.
 		 */
 		if (pvo->pvo_pte.lpte.pte_lo & ptebit) {
 			moea64_attr_save(m, ptebit);
 			vm_page_unlock_queues();
 			return (TRUE);
 		}
 	}
 
 	/*
 	 * No luck, now go through the hard part of looking at the PTEs
 	 * themselves.  Sync so that any pending REF/CHG bits are flushed to
 	 * the PTEs.
 	 */
 	powerpc_sync();
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 
 		/*
 		 * See if this pvo has a valid PTE.  if so, fetch the
 		 * REF/CHG bits from the valid PTE.  If the appropriate
 		 * ptebit is set, cache it and return success.
 		 */
 		LOCK_TABLE();
 		pt = MOEA64_PVO_TO_PTE(mmu, pvo);
 		if (pt != -1) {
 			MOEA64_PTE_SYNCH(mmu, pt, &pvo->pvo_pte.lpte);
 			if (pvo->pvo_pte.lpte.pte_lo & ptebit) {
 				UNLOCK_TABLE();
 
 				moea64_attr_save(m, ptebit);
 				vm_page_unlock_queues();
 				return (TRUE);
 			}
 		}
 		UNLOCK_TABLE();
 	}
 
 	vm_page_unlock_queues();
 	return (FALSE);
 }
 
 static u_int
 moea64_clear_bit(mmu_t mmu, vm_page_t m, u_int64_t ptebit)
 {
 	u_int	count;
 	struct	pvo_entry *pvo;
 	uintptr_t pt;
 
 	vm_page_lock_queues();
 
 	/*
 	 * Clear the cached value.
 	 */
 	moea64_attr_clear(m, ptebit);
 
 	/*
 	 * Sync so that any pending REF/CHG bits are flushed to the PTEs (so
 	 * we can reset the right ones).  note that since the pvo entries and
 	 * list heads are accessed via BAT0 and are never placed in the page
 	 * table, we don't have to worry about further accesses setting the
 	 * REF/CHG bits.
 	 */
 	powerpc_sync();
 
 	/*
 	 * For each pvo entry, clear the pvo's ptebit.  If this pvo has a
 	 * valid pte clear the ptebit from the valid pte.
 	 */
 	count = 0;
 	LIST_FOREACH(pvo, vm_page_to_pvoh(m), pvo_vlink) {
 
 		LOCK_TABLE();
 		pt = MOEA64_PVO_TO_PTE(mmu, pvo);
 		if (pt != -1) {
 			MOEA64_PTE_SYNCH(mmu, pt, &pvo->pvo_pte.lpte);
 			if (pvo->pvo_pte.lpte.pte_lo & ptebit) {
 				count++;
 				MOEA64_PTE_CLEAR(mmu, pt, &pvo->pvo_pte.lpte,
 				    pvo->pvo_vpn, ptebit);
 			}
 		}
 		pvo->pvo_pte.lpte.pte_lo &= ~ptebit;
 		UNLOCK_TABLE();
 	}
 
 	vm_page_unlock_queues();
 	return (count);
 }
 
 boolean_t
 moea64_dev_direct_mapped(mmu_t mmu, vm_offset_t pa, vm_size_t size)
 {
 	struct pvo_entry *pvo;
 	vm_offset_t ppa;
 	int error = 0;
 
 	PMAP_LOCK(kernel_pmap);
 	for (ppa = pa & ~ADDR_POFF; ppa < pa + size; ppa += PAGE_SIZE) {
 		pvo = moea64_pvo_find_va(kernel_pmap, ppa);
 		if (pvo == NULL ||
 		    (pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN) != ppa) {
 			error = EFAULT;
 			break;
 		}
 	}
 	PMAP_UNLOCK(kernel_pmap);
 
 	return (error);
 }
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  */
 void *
 moea64_mapdev_attr(mmu_t mmu, vm_offset_t pa, vm_size_t size, vm_memattr_t ma)
 {
 	vm_offset_t va, tmpva, ppa, offset;
 
 	ppa = trunc_page(pa);
 	offset = pa & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 
 	va = kmem_alloc_nofault(kernel_map, size);
 
 	if (!va)
 		panic("moea64_mapdev: Couldn't alloc kernel virtual memory");
 
 	for (tmpva = va; size > 0;) {
 		moea64_kenter_attr(mmu, tmpva, ppa, ma);
 		size -= PAGE_SIZE;
 		tmpva += PAGE_SIZE;
 		ppa += PAGE_SIZE;
 	}
 
 	return ((void *)(va + offset));
 }
 
 void *
 moea64_mapdev(mmu_t mmu, vm_offset_t pa, vm_size_t size)
 {
 
 	return moea64_mapdev_attr(mmu, pa, size, VM_MEMATTR_DEFAULT);
 }
 
 void
 moea64_unmapdev(mmu_t mmu, vm_offset_t va, vm_size_t size)
 {
 	vm_offset_t base, offset;
 
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = roundup(offset + size, PAGE_SIZE);
 
 	kmem_free(kernel_map, base, size);
 }
 
 void
 moea64_sync_icache(mmu_t mmu, pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 	struct pvo_entry *pvo;
 	vm_offset_t lim;
 	vm_paddr_t pa;
 	vm_size_t len;
 
 	PMAP_LOCK(pm);
 	while (sz > 0) {
 		lim = round_page(va);
 		len = MIN(lim - va, sz);
 		pvo = moea64_pvo_find_va(pm, va & ~ADDR_POFF);
 		if (pvo != NULL && !(pvo->pvo_pte.lpte.pte_lo & LPTE_I)) {
 			pa = (pvo->pvo_pte.lpte.pte_lo & LPTE_RPGN) |
 			    (va & ADDR_POFF);
 			moea64_syncicache(mmu, pm, va, pa, len);
 		}
 		va += len;
 		sz -= len;
 	}
 	PMAP_UNLOCK(pm);
 }
Index: head/sys/powerpc/booke/platform_bare.c
===================================================================
--- head/sys/powerpc/booke/platform_bare.c	(revision 222812)
+++ head/sys/powerpc/booke/platform_bare.c	(revision 222813)
@@ -1,316 +1,316 @@
 /*-
  * Copyright (c) 2008-2009 Semihalf, Rafal Jaworowski
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/bus.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 
 #include <machine/bus.h>
 #include <machine/cpu.h>
 #include <machine/hid.h>
 #include <machine/platform.h>
 #include <machine/platformvar.h>
 #include <machine/smp.h>
 #include <machine/spr.h>
 #include <machine/vmparam.h>
 
 #include <dev/fdt/fdt_common.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #include <dev/ofw/openfirm.h>
 
 #include <powerpc/mpc85xx/mpc85xx.h>
 
 #include "platform_if.h"
 
 #ifdef SMP
 extern void *ap_pcpu;
 extern uint8_t __boot_page[];		/* Boot page body */
 extern uint32_t kernload;		/* Kernel physical load address */
 #endif
 
 extern uint32_t *bootinfo;
 
 static int cpu, maxcpu;
 
 static int bare_probe(platform_t);
 static void bare_mem_regions(platform_t, struct mem_region **phys, int *physsz,
     struct mem_region **avail, int *availsz);
 static u_long bare_timebase_freq(platform_t, struct cpuref *cpuref);
 static int bare_smp_first_cpu(platform_t, struct cpuref *cpuref);
 static int bare_smp_next_cpu(platform_t, struct cpuref *cpuref);
 static int bare_smp_get_bsp(platform_t, struct cpuref *cpuref);
 static int bare_smp_start_cpu(platform_t, struct pcpu *cpu);
 
 static void e500_reset(platform_t);
 
 static platform_method_t bare_methods[] = {
 	PLATFORMMETHOD(platform_probe, 		bare_probe),
 	PLATFORMMETHOD(platform_mem_regions,	bare_mem_regions),
 	PLATFORMMETHOD(platform_timebase_freq,	bare_timebase_freq),
 
 	PLATFORMMETHOD(platform_smp_first_cpu,	bare_smp_first_cpu),
 	PLATFORMMETHOD(platform_smp_next_cpu,	bare_smp_next_cpu),
 	PLATFORMMETHOD(platform_smp_get_bsp,	bare_smp_get_bsp),
 	PLATFORMMETHOD(platform_smp_start_cpu,	bare_smp_start_cpu),
 
 	PLATFORMMETHOD(platform_reset,		e500_reset),
 
 	{ 0, 0 }
 };
 
 static platform_def_t bare_platform = {
 	"bare metal",
 	bare_methods,
 	0
 };
 
 PLATFORM_DEF(bare_platform);
 
 static int
 bare_probe(platform_t plat)
 {
 	uint32_t ver, sr;
 	int i, law_max, tgt;
 
 	ver = SVR_VER(mfspr(SPR_SVR));
 	switch (ver & ~0x0008) {	/* Mask Security Enabled bit */
 	case SVR_P4080:
 		maxcpu = 8;
 		break;
 	case SVR_P4040:
 		maxcpu = 4;
 		break;
 	case SVR_MPC8572:
 	case SVR_P1020:
 	case SVR_P2020:
 		maxcpu = 2;
 		break;
 	default:
 		maxcpu = 1;
 		break;
 	}
 
 	/*
 	 * Clear local access windows. Skip DRAM entries, so we don't shoot
 	 * ourselves in the foot.
 	 */
 	law_max = law_getmax();
 	for (i = 0; i < law_max; i++) {
 		sr = ccsr_read4(OCP85XX_LAWSR(i));
 		if ((sr & 0x80000000) == 0)
 			continue;
 		tgt = (sr & 0x01f00000) >> 20;
 		if (tgt == OCP85XX_TGTIF_RAM1 || tgt == OCP85XX_TGTIF_RAM2 ||
 		    tgt == OCP85XX_TGTIF_RAM_INTL)
 			continue;
 
 		ccsr_write4(OCP85XX_LAWSR(i), sr & 0x7fffffff);
 	}
 
 	return (BUS_PROBE_GENERIC);
 }
 
 #define MEM_REGIONS	8
 static struct mem_region avail_regions[MEM_REGIONS];
 
 void
 bare_mem_regions(platform_t plat, struct mem_region **phys, int *physsz,
     struct mem_region **avail, int *availsz)
 {
 	uint32_t memsize;
 	int i, rv;
 
 	rv = fdt_get_mem_regions(avail_regions, availsz, &memsize);
 
 	if (rv != 0)
 		return;
 
 	for (i = 0; i < *availsz; i++) {
 		if (avail_regions[i].mr_start < 1048576) {
 			avail_regions[i].mr_size =
 			    avail_regions[i].mr_size -
 			    (1048576 - avail_regions[i].mr_start);
 			avail_regions[i].mr_start = 1048576;
 		}
 	}
 	*avail = avail_regions;
 
 	/* On the bare metal platform phys == avail memory */
 	*physsz = *availsz;
 	*phys = *avail;
 }
 
 static u_long
 bare_timebase_freq(platform_t plat, struct cpuref *cpuref)
 {
 	u_long ticks;
 	phandle_t cpus, child;
 	pcell_t freq;
 
 	if (bootinfo != NULL) {
 		/* Backward compatibility. See 8-STABLE. */
 		ticks = bootinfo[3] >> 3;
 	} else
 		ticks = 0;
 
 	if ((cpus = OF_finddevice("/cpus")) == 0)
 		goto out;
 
 	if ((child = OF_child(cpus)) == 0)
 		goto out;
 
 	freq = 0;
 	if (OF_getprop(child, "bus-frequency", (void *)&freq,
 	    sizeof(freq)) <= 0)
 		goto out;
 
 	/*
 	 * Time Base and Decrementer are updated every 8 CCB bus clocks.
 	 * HID0[SEL_TBCLK] = 0
 	 */
 	if (freq != 0)
 		ticks = freq / 8;
 
 out:
 	if (ticks <= 0)
 		panic("Unable to determine timebase frequency!");
 
 	return (ticks);
 }
 
 static int
 bare_smp_first_cpu(platform_t plat, struct cpuref *cpuref)
 {
 
 	cpu = 0;
 	cpuref->cr_cpuid = cpu;
 	cpuref->cr_hwref = cpuref->cr_cpuid;
 	if (bootverbose)
 		printf("powerpc_smp_first_cpu: cpuid %d\n", cpuref->cr_cpuid);
 	cpu++;
 
 	return (0);
 }
 
 static int
 bare_smp_next_cpu(platform_t plat, struct cpuref *cpuref)
 {
 
 	if (cpu >= maxcpu)
 		return (ENOENT);
 
 	cpuref->cr_cpuid = cpu++;
 	cpuref->cr_hwref = cpuref->cr_cpuid;
 	if (bootverbose)
 		printf("powerpc_smp_next_cpu: cpuid %d\n", cpuref->cr_cpuid);
 
 	return (0);
 }
 
 static int
 bare_smp_get_bsp(platform_t plat, struct cpuref *cpuref)
 {
 
 	cpuref->cr_cpuid = mfspr(SPR_PIR);
 	cpuref->cr_hwref = cpuref->cr_cpuid;
 
 	return (0);
 }
 
 static int
 bare_smp_start_cpu(platform_t plat, struct pcpu *pc)
 {
 #ifdef SMP
 	uint32_t bptr, eebpcr;
 	int timeout;
 
 	eebpcr = ccsr_read4(OCP85XX_EEBPCR);
-	if ((eebpcr & (pc->pc_cpumask << 24)) != 0) {
+	if ((eebpcr & (1 << (pc->pc_cpuid + 24))) != 0) {
 		printf("%s: CPU=%d already out of hold-off state!\n",
 		    __func__, pc->pc_cpuid);
 		return (ENXIO);
 	}
 
 	ap_pcpu = pc;
 	__asm __volatile("msync; isync");
 
 	/*
 	 * Set BPTR to the physical address of the boot page
 	 */
 	bptr = ((uint32_t)__boot_page - KERNBASE) + kernload;
 	ccsr_write4(OCP85XX_BPTR, (bptr >> 12) | 0x80000000);
 
 	/*
 	 * Release AP from hold-off state
 	 */
-	eebpcr |= (pc->pc_cpumask << 24);
+	eebpcr |= (1 << (pc->pc_cpuid + 24));
 	ccsr_write4(OCP85XX_EEBPCR, eebpcr);
 	__asm __volatile("isync; msync");
 
 	timeout = 500;
 	while (!pc->pc_awake && timeout--)
 		DELAY(1000);	/* wait 1ms */
 
 	return ((pc->pc_awake) ? 0 : EBUSY);
 #else
 	/* No SMP support */
 	return (ENXIO);
 #endif
 }
 
 static void
 e500_reset(platform_t plat)
 {
 
 	/*
 	 * Try the dedicated reset register first.
 	 * If the SoC doesn't have one, we'll fall
 	 * back to using the debug control register.
 	 */
 	ccsr_write4(OCP85XX_RSTCR, 2);
 
 	/* Clear DBCR0, disables debug interrupts and events. */
 	mtspr(SPR_DBCR0, 0);
 	__asm __volatile("isync");
 
 	/* Enable Debug Interrupts in MSR. */
 	mtmsr(mfmsr() | PSL_DE);
 
 	/* Enable debug interrupts and issue reset. */
 	mtspr(SPR_DBCR0, mfspr(SPR_DBCR0) | DBCR0_IDM | DBCR0_RST_SYSTEM);
 
 	printf("Reset failed...\n");
 	while (1);
 }
 
Index: head/sys/powerpc/booke/pmap.c
===================================================================
--- head/sys/powerpc/booke/pmap.c	(revision 222812)
+++ head/sys/powerpc/booke/pmap.c	(revision 222813)
@@ -1,3139 +1,3142 @@
 /*-
  * Copyright (C) 2007-2009 Semihalf, Rafal Jaworowski <raj@semihalf.com>
  * Copyright (C) 2006 Semihalf, Marian Balakowicz <m8@semihalf.com>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN
  * NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
  * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
  * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
  * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
  * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * Some hw specific parts of this pmap were derived or influenced
  * by NetBSD's ibm4xx pmap module. More generic code is shared with
  * a few other pmap modules from the FreeBSD tree.
  */
 
  /*
   * VM layout notes:
   *
   * Kernel and user threads run within one common virtual address space
   * defined by AS=0.
   *
   * Virtual address space layout:
   * -----------------------------
   * 0x0000_0000 - 0xafff_ffff	: user process
   * 0xb000_0000 - 0xbfff_ffff	: pmap_mapdev()-ed area (PCI/PCIE etc.)
   * 0xc000_0000 - 0xc0ff_ffff	: kernel reserved
   *   0xc000_0000 - data_end	: kernel code+data, env, metadata etc.
   * 0xc100_0000 - 0xfeef_ffff	: KVA
   *   0xc100_0000 - 0xc100_3fff : reserved for page zero/copy
   *   0xc100_4000 - 0xc200_3fff : reserved for ptbl bufs
   *   0xc200_4000 - 0xc200_8fff : guard page + kstack0
   *   0xc200_9000 - 0xfeef_ffff	: actual free KVA space
   * 0xfef0_0000 - 0xffff_ffff	: I/O devices region
   */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/types.h>
 #include <sys/param.h>
 #include <sys/malloc.h>
 #include <sys/ktr.h>
 #include <sys/proc.h>
 #include <sys/user.h>
 #include <sys/queue.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/msgbuf.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
+#include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/vmmeter.h>
 
 #include <vm/vm.h>
 #include <vm/vm_page.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_object.h>
 #include <vm/vm_param.h>
 #include <vm/vm_map.h>
 #include <vm/vm_pager.h>
 #include <vm/uma.h>
 
 #include <machine/cpu.h>
 #include <machine/pcb.h>
 #include <machine/platform.h>
 
 #include <machine/tlb.h>
 #include <machine/spr.h>
 #include <machine/vmparam.h>
 #include <machine/md_var.h>
 #include <machine/mmuvar.h>
 #include <machine/pmap.h>
 #include <machine/pte.h>
 
 #include "mmu_if.h"
 
 #ifdef  DEBUG
 #define debugf(fmt, args...) printf(fmt, ##args)
 #else
 #define debugf(fmt, args...)
 #endif
 
 #define TODO			panic("%s: not implemented", __func__);
 
 #include "opt_sched.h"
 #ifndef SCHED_4BSD
 #error "e500 only works with SCHED_4BSD which uses a global scheduler lock."
 #endif
 extern struct mtx sched_lock;
 
 extern int dumpsys_minidump;
 
 extern unsigned char _etext[];
 extern unsigned char _end[];
 
 /* Kernel physical load address. */
 extern uint32_t kernload;
 vm_offset_t kernstart;
 vm_size_t kernsize;
 
 /* Message buffer and tables. */
 static vm_offset_t data_start;
 static vm_size_t data_end;
 
 /* Phys/avail memory regions. */
 static struct mem_region *availmem_regions;
 static int availmem_regions_sz;
 static struct mem_region *physmem_regions;
 static int physmem_regions_sz;
 
 /* Reserved KVA space and mutex for mmu_booke_zero_page. */
 static vm_offset_t zero_page_va;
 static struct mtx zero_page_mutex;
 
 static struct mtx tlbivax_mutex;
 
 /*
  * Reserved KVA space for mmu_booke_zero_page_idle. This is used
  * by idle thred only, no lock required.
  */
 static vm_offset_t zero_page_idle_va;
 
 /* Reserved KVA space and mutex for mmu_booke_copy_page. */
 static vm_offset_t copy_page_src_va;
 static vm_offset_t copy_page_dst_va;
 static struct mtx copy_page_mutex;
 
 /**************************************************************************/
 /* PMAP */
 /**************************************************************************/
 
 static void mmu_booke_enter_locked(mmu_t, pmap_t, vm_offset_t, vm_page_t,
     vm_prot_t, boolean_t);
 
 unsigned int kptbl_min;		/* Index of the first kernel ptbl. */
 unsigned int kernel_ptbls;	/* Number of KVA ptbls. */
 
 /*
  * If user pmap is processed with mmu_booke_remove and the resident count
  * drops to 0, there are no more pages to remove, so we need not continue.
  */
 #define PMAP_REMOVE_DONE(pmap) \
 	((pmap) != kernel_pmap && (pmap)->pm_stats.resident_count == 0)
 
 extern void tid_flush(tlbtid_t);
 
 /**************************************************************************/
 /* TLB and TID handling */
 /**************************************************************************/
 
 /* Translation ID busy table */
 static volatile pmap_t tidbusy[MAXCPU][TID_MAX + 1];
 
 /*
  * TLB0 capabilities (entry, way numbers etc.). These can vary between e500
  * core revisions and should be read from h/w registers during early config.
  */
 uint32_t tlb0_entries;
 uint32_t tlb0_ways;
 uint32_t tlb0_entries_per_way;
 
 #define TLB0_ENTRIES		(tlb0_entries)
 #define TLB0_WAYS		(tlb0_ways)
 #define TLB0_ENTRIES_PER_WAY	(tlb0_entries_per_way)
 
 #define TLB1_ENTRIES 16
 
 /* In-ram copy of the TLB1 */
 static tlb_entry_t tlb1[TLB1_ENTRIES];
 
 /* Next free entry in the TLB1 */
 static unsigned int tlb1_idx;
 
 static tlbtid_t tid_alloc(struct pmap *);
 
 static void tlb_print_entry(int, uint32_t, uint32_t, uint32_t, uint32_t);
 
 static int tlb1_set_entry(vm_offset_t, vm_offset_t, vm_size_t, uint32_t);
 static void tlb1_write_entry(unsigned int);
 static int tlb1_iomapped(int, vm_paddr_t, vm_size_t, vm_offset_t *);
 static vm_size_t tlb1_mapin_region(vm_offset_t, vm_offset_t, vm_size_t);
 
 static vm_size_t tsize2size(unsigned int);
 static unsigned int size2tsize(vm_size_t);
 static unsigned int ilog2(unsigned int);
 
 static void set_mas4_defaults(void);
 
 static inline void tlb0_flush_entry(vm_offset_t);
 static inline unsigned int tlb0_tableidx(vm_offset_t, unsigned int);
 
 /**************************************************************************/
 /* Page table management */
 /**************************************************************************/
 
 /* Data for the pv entry allocation mechanism */
 static uma_zone_t pvzone;
 static struct vm_object pvzone_obj;
 static int pv_entry_count = 0, pv_entry_max = 0, pv_entry_high_water = 0;
 
 #define PV_ENTRY_ZONE_MIN	2048	/* min pv entries in uma zone */
 
 #ifndef PMAP_SHPGPERPROC
 #define PMAP_SHPGPERPROC	200
 #endif
 
 static void ptbl_init(void);
 static struct ptbl_buf *ptbl_buf_alloc(void);
 static void ptbl_buf_free(struct ptbl_buf *);
 static void ptbl_free_pmap_ptbl(pmap_t, pte_t *);
 
 static pte_t *ptbl_alloc(mmu_t, pmap_t, unsigned int);
 static void ptbl_free(mmu_t, pmap_t, unsigned int);
 static void ptbl_hold(mmu_t, pmap_t, unsigned int);
 static int ptbl_unhold(mmu_t, pmap_t, unsigned int);
 
 static vm_paddr_t pte_vatopa(mmu_t, pmap_t, vm_offset_t);
 static pte_t *pte_find(mmu_t, pmap_t, vm_offset_t);
 static void pte_enter(mmu_t, pmap_t, vm_page_t, vm_offset_t, uint32_t);
 static int pte_remove(mmu_t, pmap_t, vm_offset_t, uint8_t);
 
 static pv_entry_t pv_alloc(void);
 static void pv_free(pv_entry_t);
 static void pv_insert(pmap_t, vm_offset_t, vm_page_t);
 static void pv_remove(pmap_t, vm_offset_t, vm_page_t);
 
 /* Number of kva ptbl buffers, each covering one ptbl (PTBL_PAGES). */
 #define PTBL_BUFS		(128 * 16)
 
 struct ptbl_buf {
 	TAILQ_ENTRY(ptbl_buf) link;	/* list link */
 	vm_offset_t kva;		/* va of mapping */
 };
 
 /* ptbl free list and a lock used for access synchronization. */
 static TAILQ_HEAD(, ptbl_buf) ptbl_buf_freelist;
 static struct mtx ptbl_buf_freelist_lock;
 
 /* Base address of kva space allocated fot ptbl bufs. */
 static vm_offset_t ptbl_buf_pool_vabase;
 
 /* Pointer to ptbl_buf structures. */
 static struct ptbl_buf *ptbl_bufs;
 
 void pmap_bootstrap_ap(volatile uint32_t *);
 
 /*
  * Kernel MMU interface
  */
 static void		mmu_booke_change_wiring(mmu_t, pmap_t, vm_offset_t, boolean_t);
 static void		mmu_booke_clear_modify(mmu_t, vm_page_t);
 static void		mmu_booke_clear_reference(mmu_t, vm_page_t);
 static void		mmu_booke_copy(mmu_t, pmap_t, pmap_t, vm_offset_t,
     vm_size_t, vm_offset_t);
 static void		mmu_booke_copy_page(mmu_t, vm_page_t, vm_page_t);
 static void		mmu_booke_enter(mmu_t, pmap_t, vm_offset_t, vm_page_t,
     vm_prot_t, boolean_t);
 static void		mmu_booke_enter_object(mmu_t, pmap_t, vm_offset_t, vm_offset_t,
     vm_page_t, vm_prot_t);
 static void		mmu_booke_enter_quick(mmu_t, pmap_t, vm_offset_t, vm_page_t,
     vm_prot_t);
 static vm_paddr_t	mmu_booke_extract(mmu_t, pmap_t, vm_offset_t);
 static vm_page_t	mmu_booke_extract_and_hold(mmu_t, pmap_t, vm_offset_t,
     vm_prot_t);
 static void		mmu_booke_init(mmu_t);
 static boolean_t	mmu_booke_is_modified(mmu_t, vm_page_t);
 static boolean_t	mmu_booke_is_prefaultable(mmu_t, pmap_t, vm_offset_t);
 static boolean_t	mmu_booke_is_referenced(mmu_t, vm_page_t);
 static boolean_t	mmu_booke_ts_referenced(mmu_t, vm_page_t);
 static vm_offset_t	mmu_booke_map(mmu_t, vm_offset_t *, vm_offset_t, vm_offset_t,
     int);
 static int		mmu_booke_mincore(mmu_t, pmap_t, vm_offset_t,
     vm_paddr_t *);
 static void		mmu_booke_object_init_pt(mmu_t, pmap_t, vm_offset_t,
     vm_object_t, vm_pindex_t, vm_size_t);
 static boolean_t	mmu_booke_page_exists_quick(mmu_t, pmap_t, vm_page_t);
 static void		mmu_booke_page_init(mmu_t, vm_page_t);
 static int		mmu_booke_page_wired_mappings(mmu_t, vm_page_t);
 static void		mmu_booke_pinit(mmu_t, pmap_t);
 static void		mmu_booke_pinit0(mmu_t, pmap_t);
 static void		mmu_booke_protect(mmu_t, pmap_t, vm_offset_t, vm_offset_t,
     vm_prot_t);
 static void		mmu_booke_qenter(mmu_t, vm_offset_t, vm_page_t *, int);
 static void		mmu_booke_qremove(mmu_t, vm_offset_t, int);
 static void		mmu_booke_release(mmu_t, pmap_t);
 static void		mmu_booke_remove(mmu_t, pmap_t, vm_offset_t, vm_offset_t);
 static void		mmu_booke_remove_all(mmu_t, vm_page_t);
 static void		mmu_booke_remove_write(mmu_t, vm_page_t);
 static void		mmu_booke_zero_page(mmu_t, vm_page_t);
 static void		mmu_booke_zero_page_area(mmu_t, vm_page_t, int, int);
 static void		mmu_booke_zero_page_idle(mmu_t, vm_page_t);
 static void		mmu_booke_activate(mmu_t, struct thread *);
 static void		mmu_booke_deactivate(mmu_t, struct thread *);
 static void		mmu_booke_bootstrap(mmu_t, vm_offset_t, vm_offset_t);
 static void		*mmu_booke_mapdev(mmu_t, vm_offset_t, vm_size_t);
 static void		mmu_booke_unmapdev(mmu_t, vm_offset_t, vm_size_t);
 static vm_offset_t	mmu_booke_kextract(mmu_t, vm_offset_t);
 static void		mmu_booke_kenter(mmu_t, vm_offset_t, vm_offset_t);
 static void		mmu_booke_kremove(mmu_t, vm_offset_t);
 static boolean_t	mmu_booke_dev_direct_mapped(mmu_t, vm_offset_t, vm_size_t);
 static void		mmu_booke_sync_icache(mmu_t, pmap_t, vm_offset_t,
     vm_size_t);
 static vm_offset_t	mmu_booke_dumpsys_map(mmu_t, struct pmap_md *,
     vm_size_t, vm_size_t *);
 static void		mmu_booke_dumpsys_unmap(mmu_t, struct pmap_md *,
     vm_size_t, vm_offset_t);
 static struct pmap_md	*mmu_booke_scan_md(mmu_t, struct pmap_md *);
 
 static mmu_method_t mmu_booke_methods[] = {
 	/* pmap dispatcher interface */
 	MMUMETHOD(mmu_change_wiring,	mmu_booke_change_wiring),
 	MMUMETHOD(mmu_clear_modify,	mmu_booke_clear_modify),
 	MMUMETHOD(mmu_clear_reference,	mmu_booke_clear_reference),
 	MMUMETHOD(mmu_copy,		mmu_booke_copy),
 	MMUMETHOD(mmu_copy_page,	mmu_booke_copy_page),
 	MMUMETHOD(mmu_enter,		mmu_booke_enter),
 	MMUMETHOD(mmu_enter_object,	mmu_booke_enter_object),
 	MMUMETHOD(mmu_enter_quick,	mmu_booke_enter_quick),
 	MMUMETHOD(mmu_extract,		mmu_booke_extract),
 	MMUMETHOD(mmu_extract_and_hold,	mmu_booke_extract_and_hold),
 	MMUMETHOD(mmu_init,		mmu_booke_init),
 	MMUMETHOD(mmu_is_modified,	mmu_booke_is_modified),
 	MMUMETHOD(mmu_is_prefaultable,	mmu_booke_is_prefaultable),
 	MMUMETHOD(mmu_is_referenced,	mmu_booke_is_referenced),
 	MMUMETHOD(mmu_ts_referenced,	mmu_booke_ts_referenced),
 	MMUMETHOD(mmu_map,		mmu_booke_map),
 	MMUMETHOD(mmu_mincore,		mmu_booke_mincore),
 	MMUMETHOD(mmu_object_init_pt,	mmu_booke_object_init_pt),
 	MMUMETHOD(mmu_page_exists_quick,mmu_booke_page_exists_quick),
 	MMUMETHOD(mmu_page_init,	mmu_booke_page_init),
 	MMUMETHOD(mmu_page_wired_mappings, mmu_booke_page_wired_mappings),
 	MMUMETHOD(mmu_pinit,		mmu_booke_pinit),
 	MMUMETHOD(mmu_pinit0,		mmu_booke_pinit0),
 	MMUMETHOD(mmu_protect,		mmu_booke_protect),
 	MMUMETHOD(mmu_qenter,		mmu_booke_qenter),
 	MMUMETHOD(mmu_qremove,		mmu_booke_qremove),
 	MMUMETHOD(mmu_release,		mmu_booke_release),
 	MMUMETHOD(mmu_remove,		mmu_booke_remove),
 	MMUMETHOD(mmu_remove_all,	mmu_booke_remove_all),
 	MMUMETHOD(mmu_remove_write,	mmu_booke_remove_write),
 	MMUMETHOD(mmu_sync_icache,	mmu_booke_sync_icache),
 	MMUMETHOD(mmu_zero_page,	mmu_booke_zero_page),
 	MMUMETHOD(mmu_zero_page_area,	mmu_booke_zero_page_area),
 	MMUMETHOD(mmu_zero_page_idle,	mmu_booke_zero_page_idle),
 	MMUMETHOD(mmu_activate,		mmu_booke_activate),
 	MMUMETHOD(mmu_deactivate,	mmu_booke_deactivate),
 
 	/* Internal interfaces */
 	MMUMETHOD(mmu_bootstrap,	mmu_booke_bootstrap),
 	MMUMETHOD(mmu_dev_direct_mapped,mmu_booke_dev_direct_mapped),
 	MMUMETHOD(mmu_mapdev,		mmu_booke_mapdev),
 	MMUMETHOD(mmu_kenter,		mmu_booke_kenter),
 	MMUMETHOD(mmu_kextract,		mmu_booke_kextract),
 /*	MMUMETHOD(mmu_kremove,		mmu_booke_kremove),	*/
 	MMUMETHOD(mmu_unmapdev,		mmu_booke_unmapdev),
 
 	/* dumpsys() support */
 	MMUMETHOD(mmu_dumpsys_map,	mmu_booke_dumpsys_map),
 	MMUMETHOD(mmu_dumpsys_unmap,	mmu_booke_dumpsys_unmap),
 	MMUMETHOD(mmu_scan_md,		mmu_booke_scan_md),
 
 	{ 0, 0 }
 };
 
 MMU_DEF(booke_mmu, MMU_TYPE_BOOKE, mmu_booke_methods, 0);
 
 static inline void
 tlb_miss_lock(void)
 {
 #ifdef SMP
 	struct pcpu *pc;
 
 	if (!smp_started)
 		return;
 
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 		if (pc != pcpup) {
 
 			CTR3(KTR_PMAP, "%s: tlb miss LOCK of CPU=%d, "
 			    "tlb_lock=%p", __func__, pc->pc_cpuid, pc->pc_booke_tlb_lock);
 
 			KASSERT((pc->pc_cpuid != PCPU_GET(cpuid)),
 			    ("tlb_miss_lock: tried to lock self"));
 
 			tlb_lock(pc->pc_booke_tlb_lock);
 
 			CTR1(KTR_PMAP, "%s: locked", __func__);
 		}
 	}
 #endif
 }
 
 static inline void
 tlb_miss_unlock(void)
 {
 #ifdef SMP
 	struct pcpu *pc;
 
 	if (!smp_started)
 		return;
 
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 		if (pc != pcpup) {
 			CTR2(KTR_PMAP, "%s: tlb miss UNLOCK of CPU=%d",
 			    __func__, pc->pc_cpuid);
 
 			tlb_unlock(pc->pc_booke_tlb_lock);
 
 			CTR1(KTR_PMAP, "%s: unlocked", __func__);
 		}
 	}
 #endif
 }
 
 /* Return number of entries in TLB0. */
 static __inline void
 tlb0_get_tlbconf(void)
 {
 	uint32_t tlb0_cfg;
 
 	tlb0_cfg = mfspr(SPR_TLB0CFG);
 	tlb0_entries = tlb0_cfg & TLBCFG_NENTRY_MASK;
 	tlb0_ways = (tlb0_cfg & TLBCFG_ASSOC_MASK) >> TLBCFG_ASSOC_SHIFT;
 	tlb0_entries_per_way = tlb0_entries / tlb0_ways;
 }
 
 /* Initialize pool of kva ptbl buffers. */
 static void
 ptbl_init(void)
 {
 	int i;
 
 	CTR3(KTR_PMAP, "%s: s (ptbl_bufs = 0x%08x size 0x%08x)", __func__,
 	    (uint32_t)ptbl_bufs, sizeof(struct ptbl_buf) * PTBL_BUFS);
 	CTR3(KTR_PMAP, "%s: s (ptbl_buf_pool_vabase = 0x%08x size = 0x%08x)",
 	    __func__, ptbl_buf_pool_vabase, PTBL_BUFS * PTBL_PAGES * PAGE_SIZE);
 
 	mtx_init(&ptbl_buf_freelist_lock, "ptbl bufs lock", NULL, MTX_DEF);
 	TAILQ_INIT(&ptbl_buf_freelist);
 
 	for (i = 0; i < PTBL_BUFS; i++) {
 		ptbl_bufs[i].kva = ptbl_buf_pool_vabase + i * PTBL_PAGES * PAGE_SIZE;
 		TAILQ_INSERT_TAIL(&ptbl_buf_freelist, &ptbl_bufs[i], link);
 	}
 }
 
 /* Get a ptbl_buf from the freelist. */
 static struct ptbl_buf *
 ptbl_buf_alloc(void)
 {
 	struct ptbl_buf *buf;
 
 	mtx_lock(&ptbl_buf_freelist_lock);
 	buf = TAILQ_FIRST(&ptbl_buf_freelist);
 	if (buf != NULL)
 		TAILQ_REMOVE(&ptbl_buf_freelist, buf, link);
 	mtx_unlock(&ptbl_buf_freelist_lock);
 
 	CTR2(KTR_PMAP, "%s: buf = %p", __func__, buf);
 
 	return (buf);
 }
 
 /* Return ptbl buff to free pool. */
 static void
 ptbl_buf_free(struct ptbl_buf *buf)
 {
 
 	CTR2(KTR_PMAP, "%s: buf = %p", __func__, buf);
 
 	mtx_lock(&ptbl_buf_freelist_lock);
 	TAILQ_INSERT_TAIL(&ptbl_buf_freelist, buf, link);
 	mtx_unlock(&ptbl_buf_freelist_lock);
 }
 
 /*
  * Search the list of allocated ptbl bufs and find on list of allocated ptbls
  */
 static void
 ptbl_free_pmap_ptbl(pmap_t pmap, pte_t *ptbl)
 {
 	struct ptbl_buf *pbuf;
 
 	CTR2(KTR_PMAP, "%s: ptbl = %p", __func__, ptbl);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	TAILQ_FOREACH(pbuf, &pmap->pm_ptbl_list, link)
 		if (pbuf->kva == (vm_offset_t)ptbl) {
 			/* Remove from pmap ptbl buf list. */
 			TAILQ_REMOVE(&pmap->pm_ptbl_list, pbuf, link);
 
 			/* Free corresponding ptbl buf. */
 			ptbl_buf_free(pbuf);
 			break;
 		}
 }
 
 /* Allocate page table. */
 static pte_t *
 ptbl_alloc(mmu_t mmu, pmap_t pmap, unsigned int pdir_idx)
 {
 	vm_page_t mtbl[PTBL_PAGES];
 	vm_page_t m;
 	struct ptbl_buf *pbuf;
 	unsigned int pidx;
 	pte_t *ptbl;
 	int i;
 
 	CTR4(KTR_PMAP, "%s: pmap = %p su = %d pdir_idx = %d", __func__, pmap,
 	    (pmap == kernel_pmap), pdir_idx);
 
 	KASSERT((pdir_idx <= (VM_MAXUSER_ADDRESS / PDIR_SIZE)),
 	    ("ptbl_alloc: invalid pdir_idx"));
 	KASSERT((pmap->pm_pdir[pdir_idx] == NULL),
 	    ("pte_alloc: valid ptbl entry exists!"));
 
 	pbuf = ptbl_buf_alloc();
 	if (pbuf == NULL)
 		panic("pte_alloc: couldn't alloc kernel virtual memory");
 		
 	ptbl = (pte_t *)pbuf->kva;
 
 	CTR2(KTR_PMAP, "%s: ptbl kva = %p", __func__, ptbl);
 
 	/* Allocate ptbl pages, this will sleep! */
 	for (i = 0; i < PTBL_PAGES; i++) {
 		pidx = (PTBL_PAGES * pdir_idx) + i;
 		while ((m = vm_page_alloc(NULL, pidx,
 		    VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) {
 
 			PMAP_UNLOCK(pmap);
 			vm_page_unlock_queues();
 			VM_WAIT;
 			vm_page_lock_queues();
 			PMAP_LOCK(pmap);
 		}
 		mtbl[i] = m;
 	}
 
 	/* Map allocated pages into kernel_pmap. */
 	mmu_booke_qenter(mmu, (vm_offset_t)ptbl, mtbl, PTBL_PAGES);
 
 	/* Zero whole ptbl. */
 	bzero((caddr_t)ptbl, PTBL_PAGES * PAGE_SIZE);
 
 	/* Add pbuf to the pmap ptbl bufs list. */
 	TAILQ_INSERT_TAIL(&pmap->pm_ptbl_list, pbuf, link);
 
 	return (ptbl);
 }
 
 /* Free ptbl pages and invalidate pdir entry. */
 static void
 ptbl_free(mmu_t mmu, pmap_t pmap, unsigned int pdir_idx)
 {
 	pte_t *ptbl;
 	vm_paddr_t pa;
 	vm_offset_t va;
 	vm_page_t m;
 	int i;
 
 	CTR4(KTR_PMAP, "%s: pmap = %p su = %d pdir_idx = %d", __func__, pmap,
 	    (pmap == kernel_pmap), pdir_idx);
 
 	KASSERT((pdir_idx <= (VM_MAXUSER_ADDRESS / PDIR_SIZE)),
 	    ("ptbl_free: invalid pdir_idx"));
 
 	ptbl = pmap->pm_pdir[pdir_idx];
 
 	CTR2(KTR_PMAP, "%s: ptbl = %p", __func__, ptbl);
 
 	KASSERT((ptbl != NULL), ("ptbl_free: null ptbl"));
 
 	/*
 	 * Invalidate the pdir entry as soon as possible, so that other CPUs
 	 * don't attempt to look up the page tables we are releasing.
 	 */
 	mtx_lock_spin(&tlbivax_mutex);
 	tlb_miss_lock();
 	
 	pmap->pm_pdir[pdir_idx] = NULL;
 
 	tlb_miss_unlock();
 	mtx_unlock_spin(&tlbivax_mutex);
 
 	for (i = 0; i < PTBL_PAGES; i++) {
 		va = ((vm_offset_t)ptbl + (i * PAGE_SIZE));
 		pa = pte_vatopa(mmu, kernel_pmap, va);
 		m = PHYS_TO_VM_PAGE(pa);
 		vm_page_free_zero(m);
 		atomic_subtract_int(&cnt.v_wire_count, 1);
 		mmu_booke_kremove(mmu, va);
 	}
 
 	ptbl_free_pmap_ptbl(pmap, ptbl);
 }
 
 /*
  * Decrement ptbl pages hold count and attempt to free ptbl pages.
  * Called when removing pte entry from ptbl.
  *
  * Return 1 if ptbl pages were freed.
  */
 static int
 ptbl_unhold(mmu_t mmu, pmap_t pmap, unsigned int pdir_idx)
 {
 	pte_t *ptbl;
 	vm_paddr_t pa;
 	vm_page_t m;
 	int i;
 
 	CTR4(KTR_PMAP, "%s: pmap = %p su = %d pdir_idx = %d", __func__, pmap,
 	    (pmap == kernel_pmap), pdir_idx);
 
 	KASSERT((pdir_idx <= (VM_MAXUSER_ADDRESS / PDIR_SIZE)),
 	    ("ptbl_unhold: invalid pdir_idx"));
 	KASSERT((pmap != kernel_pmap),
 	    ("ptbl_unhold: unholding kernel ptbl!"));
 
 	ptbl = pmap->pm_pdir[pdir_idx];
 
 	//debugf("ptbl_unhold: ptbl = 0x%08x\n", (u_int32_t)ptbl);
 	KASSERT(((vm_offset_t)ptbl >= VM_MIN_KERNEL_ADDRESS),
 	    ("ptbl_unhold: non kva ptbl"));
 
 	/* decrement hold count */
 	for (i = 0; i < PTBL_PAGES; i++) {
 		pa = pte_vatopa(mmu, kernel_pmap,
 		    (vm_offset_t)ptbl + (i * PAGE_SIZE));
 		m = PHYS_TO_VM_PAGE(pa);
 		m->wire_count--;
 	}
 
 	/*
 	 * Free ptbl pages if there are no pte etries in this ptbl.
 	 * wire_count has the same value for all ptbl pages, so check the last
 	 * page.
 	 */
 	if (m->wire_count == 0) {
 		ptbl_free(mmu, pmap, pdir_idx);
 
 		//debugf("ptbl_unhold: e (freed ptbl)\n");
 		return (1);
 	}
 
 	return (0);
 }
 
 /*
  * Increment hold count for ptbl pages. This routine is used when a new pte
  * entry is being inserted into the ptbl.
  */
 static void
 ptbl_hold(mmu_t mmu, pmap_t pmap, unsigned int pdir_idx)
 {
 	vm_paddr_t pa;
 	pte_t *ptbl;
 	vm_page_t m;
 	int i;
 
 	CTR3(KTR_PMAP, "%s: pmap = %p pdir_idx = %d", __func__, pmap,
 	    pdir_idx);
 
 	KASSERT((pdir_idx <= (VM_MAXUSER_ADDRESS / PDIR_SIZE)),
 	    ("ptbl_hold: invalid pdir_idx"));
 	KASSERT((pmap != kernel_pmap),
 	    ("ptbl_hold: holding kernel ptbl!"));
 
 	ptbl = pmap->pm_pdir[pdir_idx];
 
 	KASSERT((ptbl != NULL), ("ptbl_hold: null ptbl"));
 
 	for (i = 0; i < PTBL_PAGES; i++) {
 		pa = pte_vatopa(mmu, kernel_pmap,
 		    (vm_offset_t)ptbl + (i * PAGE_SIZE));
 		m = PHYS_TO_VM_PAGE(pa);
 		m->wire_count++;
 	}
 }
 
 /* Allocate pv_entry structure. */
 pv_entry_t
 pv_alloc(void)
 {
 	pv_entry_t pv;
 
 	pv_entry_count++;
 	if (pv_entry_count > pv_entry_high_water)
 		pagedaemon_wakeup();
 	pv = uma_zalloc(pvzone, M_NOWAIT);
 
 	return (pv);
 }
 
 /* Free pv_entry structure. */
 static __inline void
 pv_free(pv_entry_t pve)
 {
 
 	pv_entry_count--;
 	uma_zfree(pvzone, pve);
 }
 
 
 /* Allocate and initialize pv_entry structure. */
 static void
 pv_insert(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	pv_entry_t pve;
 
 	//int su = (pmap == kernel_pmap);
 	//debugf("pv_insert: s (su = %d pmap = 0x%08x va = 0x%08x m = 0x%08x)\n", su,
 	//	(u_int32_t)pmap, va, (u_int32_t)m);
 
 	pve = pv_alloc();
 	if (pve == NULL)
 		panic("pv_insert: no pv entries!");
 
 	pve->pv_pmap = pmap;
 	pve->pv_va = va;
 
 	/* add to pv_list */
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 
 	TAILQ_INSERT_TAIL(&m->md.pv_list, pve, pv_link);
 
 	//debugf("pv_insert: e\n");
 }
 
 /* Destroy pv entry. */
 static void
 pv_remove(pmap_t pmap, vm_offset_t va, vm_page_t m)
 {
 	pv_entry_t pve;
 
 	//int su = (pmap == kernel_pmap);
 	//debugf("pv_remove: s (su = %d pmap = 0x%08x va = 0x%08x)\n", su, (u_int32_t)pmap, va);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 
 	/* find pv entry */
 	TAILQ_FOREACH(pve, &m->md.pv_list, pv_link) {
 		if ((pmap == pve->pv_pmap) && (va == pve->pv_va)) {
 			/* remove from pv_list */
 			TAILQ_REMOVE(&m->md.pv_list, pve, pv_link);
 			if (TAILQ_EMPTY(&m->md.pv_list))
 				vm_page_flag_clear(m, PG_WRITEABLE);
 
 			/* free pv entry struct */
 			pv_free(pve);
 			break;
 		}
 	}
 
 	//debugf("pv_remove: e\n");
 }
 
 /*
  * Clean pte entry, try to free page table page if requested.
  *
  * Return 1 if ptbl pages were freed, otherwise return 0.
  */
 static int
 pte_remove(mmu_t mmu, pmap_t pmap, vm_offset_t va, uint8_t flags)
 {
 	unsigned int pdir_idx = PDIR_IDX(va);
 	unsigned int ptbl_idx = PTBL_IDX(va);
 	vm_page_t m;
 	pte_t *ptbl;
 	pte_t *pte;
 
 	//int su = (pmap == kernel_pmap);
 	//debugf("pte_remove: s (su = %d pmap = 0x%08x va = 0x%08x flags = %d)\n",
 	//		su, (u_int32_t)pmap, va, flags);
 
 	ptbl = pmap->pm_pdir[pdir_idx];
 	KASSERT(ptbl, ("pte_remove: null ptbl"));
 
 	pte = &ptbl[ptbl_idx];
 
 	if (pte == NULL || !PTE_ISVALID(pte))
 		return (0);
 
 	if (PTE_ISWIRED(pte))
 		pmap->pm_stats.wired_count--;
 
 	/* Handle managed entry. */
 	if (PTE_ISMANAGED(pte)) {
 		/* Get vm_page_t for mapped pte. */
 		m = PHYS_TO_VM_PAGE(PTE_PA(pte));
 
 		if (PTE_ISMODIFIED(pte))
 			vm_page_dirty(m);
 
 		if (PTE_ISREFERENCED(pte))
 			vm_page_flag_set(m, PG_REFERENCED);
 
 		pv_remove(pmap, va, m);
 	}
 
 	mtx_lock_spin(&tlbivax_mutex);
 	tlb_miss_lock();
 
 	tlb0_flush_entry(va);
 	pte->flags = 0;
 	pte->rpn = 0;
 
 	tlb_miss_unlock();
 	mtx_unlock_spin(&tlbivax_mutex);
 
 	pmap->pm_stats.resident_count--;
 
 	if (flags & PTBL_UNHOLD) {
 		//debugf("pte_remove: e (unhold)\n");
 		return (ptbl_unhold(mmu, pmap, pdir_idx));
 	}
 
 	//debugf("pte_remove: e\n");
 	return (0);
 }
 
 /*
  * Insert PTE for a given page and virtual address.
  */
 static void
 pte_enter(mmu_t mmu, pmap_t pmap, vm_page_t m, vm_offset_t va, uint32_t flags)
 {
 	unsigned int pdir_idx = PDIR_IDX(va);
 	unsigned int ptbl_idx = PTBL_IDX(va);
 	pte_t *ptbl, *pte;
 
 	CTR4(KTR_PMAP, "%s: su = %d pmap = %p va = %p", __func__,
 	    pmap == kernel_pmap, pmap, va);
 
 	/* Get the page table pointer. */
 	ptbl = pmap->pm_pdir[pdir_idx];
 
 	if (ptbl == NULL) {
 		/* Allocate page table pages. */
 		ptbl = ptbl_alloc(mmu, pmap, pdir_idx);
 	} else {
 		/*
 		 * Check if there is valid mapping for requested
 		 * va, if there is, remove it.
 		 */
 		pte = &pmap->pm_pdir[pdir_idx][ptbl_idx];
 		if (PTE_ISVALID(pte)) {
 			pte_remove(mmu, pmap, va, PTBL_HOLD);
 		} else {
 			/*
 			 * pte is not used, increment hold count
 			 * for ptbl pages.
 			 */
 			if (pmap != kernel_pmap)
 				ptbl_hold(mmu, pmap, pdir_idx);
 		}
 	}
 
 	/*
 	 * Insert pv_entry into pv_list for mapped page if part of managed
 	 * memory.
 	 */
         if ((m->flags & PG_FICTITIOUS) == 0) {
 		if ((m->flags & PG_UNMANAGED) == 0) {
 			flags |= PTE_MANAGED;
 
 			/* Create and insert pv entry. */
 			pv_insert(pmap, va, m);
 		}
 	}
 
 	pmap->pm_stats.resident_count++;
 	
 	mtx_lock_spin(&tlbivax_mutex);
 	tlb_miss_lock();
 
 	tlb0_flush_entry(va);
 	if (pmap->pm_pdir[pdir_idx] == NULL) {
 		/*
 		 * If we just allocated a new page table, hook it in
 		 * the pdir.
 		 */
 		pmap->pm_pdir[pdir_idx] = ptbl;
 	}
 	pte = &(pmap->pm_pdir[pdir_idx][ptbl_idx]);
 	pte->rpn = VM_PAGE_TO_PHYS(m) & ~PTE_PA_MASK;
 	pte->flags |= (PTE_VALID | flags);
 
 	tlb_miss_unlock();
 	mtx_unlock_spin(&tlbivax_mutex);
 }
 
 /* Return the pa for the given pmap/va. */
 static vm_paddr_t
 pte_vatopa(mmu_t mmu, pmap_t pmap, vm_offset_t va)
 {
 	vm_paddr_t pa = 0;
 	pte_t *pte;
 
 	pte = pte_find(mmu, pmap, va);
 	if ((pte != NULL) && PTE_ISVALID(pte))
 		pa = (PTE_PA(pte) | (va & PTE_PA_MASK));
 	return (pa);
 }
 
 /* Get a pointer to a PTE in a page table. */
 static pte_t *
 pte_find(mmu_t mmu, pmap_t pmap, vm_offset_t va)
 {
 	unsigned int pdir_idx = PDIR_IDX(va);
 	unsigned int ptbl_idx = PTBL_IDX(va);
 
 	KASSERT((pmap != NULL), ("pte_find: invalid pmap"));
 
 	if (pmap->pm_pdir[pdir_idx])
 		return (&(pmap->pm_pdir[pdir_idx][ptbl_idx]));
 
 	return (NULL);
 }
 
 /**************************************************************************/
 /* PMAP related */
 /**************************************************************************/
 
 /*
  * This is called during booke_init, before the system is really initialized.
  */
 static void
 mmu_booke_bootstrap(mmu_t mmu, vm_offset_t start, vm_offset_t kernelend)
 {
 	vm_offset_t phys_kernelend;
 	struct mem_region *mp, *mp1;
 	int cnt, i, j;
 	u_int s, e, sz;
 	u_int phys_avail_count;
 	vm_size_t physsz, hwphyssz, kstack0_sz;
 	vm_offset_t kernel_pdir, kstack0, va;
 	vm_paddr_t kstack0_phys;
 	void *dpcpu;
 	pte_t *pte;
 
 	debugf("mmu_booke_bootstrap: entered\n");
 
 	/* Initialize invalidation mutex */
 	mtx_init(&tlbivax_mutex, "tlbivax", NULL, MTX_SPIN);
 
 	/* Read TLB0 size and associativity. */
 	tlb0_get_tlbconf();
 
 	/* Align kernel start and end address (kernel image). */
 	kernstart = trunc_page(start);
 	data_start = round_page(kernelend);
 	kernsize = data_start - kernstart;
 
 	data_end = data_start;
 
 	/* Allocate space for the message buffer. */
 	msgbufp = (struct msgbuf *)data_end;
 	data_end += msgbufsize;
 	debugf(" msgbufp at 0x%08x end = 0x%08x\n", (uint32_t)msgbufp,
 	    data_end);
 
 	data_end = round_page(data_end);
 
 	/* Allocate the dynamic per-cpu area. */
 	dpcpu = (void *)data_end;
 	data_end += DPCPU_SIZE;
 	dpcpu_init(dpcpu, 0);
 
 	/* Allocate space for ptbl_bufs. */
 	ptbl_bufs = (struct ptbl_buf *)data_end;
 	data_end += sizeof(struct ptbl_buf) * PTBL_BUFS;
 	debugf(" ptbl_bufs at 0x%08x end = 0x%08x\n", (uint32_t)ptbl_bufs,
 	    data_end);
 
 	data_end = round_page(data_end);
 
 	/* Allocate PTE tables for kernel KVA. */
 	kernel_pdir = data_end;
 	kernel_ptbls = (VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS +
 	    PDIR_SIZE - 1) / PDIR_SIZE;
 	data_end += kernel_ptbls * PTBL_PAGES * PAGE_SIZE;
 	debugf(" kernel ptbls: %d\n", kernel_ptbls);
 	debugf(" kernel pdir at 0x%08x end = 0x%08x\n", kernel_pdir, data_end);
 
 	debugf(" data_end: 0x%08x\n", data_end);
 	if (data_end - kernstart > 0x1000000) {
 		data_end = (data_end + 0x3fffff) & ~0x3fffff;
 		tlb1_mapin_region(kernstart + 0x1000000,
 		    kernload + 0x1000000, data_end - kernstart - 0x1000000);
 	} else
 		data_end = (data_end + 0xffffff) & ~0xffffff;
 
 	debugf(" updated data_end: 0x%08x\n", data_end);
 
 	kernsize += data_end - data_start;
 
 	/*
 	 * Clear the structures - note we can only do it safely after the
 	 * possible additional TLB1 translations are in place (above) so that
 	 * all range up to the currently calculated 'data_end' is covered.
 	 */
 	memset((void *)ptbl_bufs, 0, sizeof(struct ptbl_buf) * PTBL_SIZE);
 	memset((void *)kernel_pdir, 0, kernel_ptbls * PTBL_PAGES * PAGE_SIZE);
 
 	/*******************************************************/
 	/* Set the start and end of kva. */
 	/*******************************************************/
 	virtual_avail = round_page(data_end);
 	virtual_end = VM_MAX_KERNEL_ADDRESS;
 
 	/* Allocate KVA space for page zero/copy operations. */
 	zero_page_va = virtual_avail;
 	virtual_avail += PAGE_SIZE;
 	zero_page_idle_va = virtual_avail;
 	virtual_avail += PAGE_SIZE;
 	copy_page_src_va = virtual_avail;
 	virtual_avail += PAGE_SIZE;
 	copy_page_dst_va = virtual_avail;
 	virtual_avail += PAGE_SIZE;
 	debugf("zero_page_va = 0x%08x\n", zero_page_va);
 	debugf("zero_page_idle_va = 0x%08x\n", zero_page_idle_va);
 	debugf("copy_page_src_va = 0x%08x\n", copy_page_src_va);
 	debugf("copy_page_dst_va = 0x%08x\n", copy_page_dst_va);
 
 	/* Initialize page zero/copy mutexes. */
 	mtx_init(&zero_page_mutex, "mmu_booke_zero_page", NULL, MTX_DEF);
 	mtx_init(&copy_page_mutex, "mmu_booke_copy_page", NULL, MTX_DEF);
 
 	/* Allocate KVA space for ptbl bufs. */
 	ptbl_buf_pool_vabase = virtual_avail;
 	virtual_avail += PTBL_BUFS * PTBL_PAGES * PAGE_SIZE;
 	debugf("ptbl_buf_pool_vabase = 0x%08x end = 0x%08x\n",
 	    ptbl_buf_pool_vabase, virtual_avail);
 
 	/* Calculate corresponding physical addresses for the kernel region. */
 	phys_kernelend = kernload + kernsize;
 	debugf("kernel image and allocated data:\n");
 	debugf(" kernload    = 0x%08x\n", kernload);
 	debugf(" kernstart   = 0x%08x\n", kernstart);
 	debugf(" kernsize    = 0x%08x\n", kernsize);
 
 	if (sizeof(phys_avail) / sizeof(phys_avail[0]) < availmem_regions_sz)
 		panic("mmu_booke_bootstrap: phys_avail too small");
 
 	/*
 	 * Remove kernel physical address range from avail regions list. Page
 	 * align all regions.  Non-page aligned memory isn't very interesting
 	 * to us.  Also, sort the entries for ascending addresses.
 	 */
 
 	/* Retrieve phys/avail mem regions */
 	mem_regions(&physmem_regions, &physmem_regions_sz,
 	    &availmem_regions, &availmem_regions_sz);
 	sz = 0;
 	cnt = availmem_regions_sz;
 	debugf("processing avail regions:\n");
 	for (mp = availmem_regions; mp->mr_size; mp++) {
 		s = mp->mr_start;
 		e = mp->mr_start + mp->mr_size;
 		debugf(" %08x-%08x -> ", s, e);
 		/* Check whether this region holds all of the kernel. */
 		if (s < kernload && e > phys_kernelend) {
 			availmem_regions[cnt].mr_start = phys_kernelend;
 			availmem_regions[cnt++].mr_size = e - phys_kernelend;
 			e = kernload;
 		}
 		/* Look whether this regions starts within the kernel. */
 		if (s >= kernload && s < phys_kernelend) {
 			if (e <= phys_kernelend)
 				goto empty;
 			s = phys_kernelend;
 		}
 		/* Now look whether this region ends within the kernel. */
 		if (e > kernload && e <= phys_kernelend) {
 			if (s >= kernload)
 				goto empty;
 			e = kernload;
 		}
 		/* Now page align the start and size of the region. */
 		s = round_page(s);
 		e = trunc_page(e);
 		if (e < s)
 			e = s;
 		sz = e - s;
 		debugf("%08x-%08x = %x\n", s, e, sz);
 
 		/* Check whether some memory is left here. */
 		if (sz == 0) {
 		empty:
 			memmove(mp, mp + 1,
 			    (cnt - (mp - availmem_regions)) * sizeof(*mp));
 			cnt--;
 			mp--;
 			continue;
 		}
 
 		/* Do an insertion sort. */
 		for (mp1 = availmem_regions; mp1 < mp; mp1++)
 			if (s < mp1->mr_start)
 				break;
 		if (mp1 < mp) {
 			memmove(mp1 + 1, mp1, (char *)mp - (char *)mp1);
 			mp1->mr_start = s;
 			mp1->mr_size = sz;
 		} else {
 			mp->mr_start = s;
 			mp->mr_size = sz;
 		}
 	}
 	availmem_regions_sz = cnt;
 
 	/*******************************************************/
 	/* Steal physical memory for kernel stack from the end */
 	/* of the first avail region                           */
 	/*******************************************************/
 	kstack0_sz = KSTACK_PAGES * PAGE_SIZE;
 	kstack0_phys = availmem_regions[0].mr_start +
 	    availmem_regions[0].mr_size;
 	kstack0_phys -= kstack0_sz;
 	availmem_regions[0].mr_size -= kstack0_sz;
 
 	/*******************************************************/
 	/* Fill in phys_avail table, based on availmem_regions */
 	/*******************************************************/
 	phys_avail_count = 0;
 	physsz = 0;
 	hwphyssz = 0;
 	TUNABLE_ULONG_FETCH("hw.physmem", (u_long *) &hwphyssz);
 
 	debugf("fill in phys_avail:\n");
 	for (i = 0, j = 0; i < availmem_regions_sz; i++, j += 2) {
 
 		debugf(" region: 0x%08x - 0x%08x (0x%08x)\n",
 		    availmem_regions[i].mr_start,
 		    availmem_regions[i].mr_start +
 		        availmem_regions[i].mr_size,
 		    availmem_regions[i].mr_size);
 
 		if (hwphyssz != 0 &&
 		    (physsz + availmem_regions[i].mr_size) >= hwphyssz) {
 			debugf(" hw.physmem adjust\n");
 			if (physsz < hwphyssz) {
 				phys_avail[j] = availmem_regions[i].mr_start;
 				phys_avail[j + 1] =
 				    availmem_regions[i].mr_start +
 				    hwphyssz - physsz;
 				physsz = hwphyssz;
 				phys_avail_count++;
 			}
 			break;
 		}
 
 		phys_avail[j] = availmem_regions[i].mr_start;
 		phys_avail[j + 1] = availmem_regions[i].mr_start +
 		    availmem_regions[i].mr_size;
 		phys_avail_count++;
 		physsz += availmem_regions[i].mr_size;
 	}
 	physmem = btoc(physsz);
 
 	/* Calculate the last available physical address. */
 	for (i = 0; phys_avail[i + 2] != 0; i += 2)
 		;
 	Maxmem = powerpc_btop(phys_avail[i + 1]);
 
 	debugf("Maxmem = 0x%08lx\n", Maxmem);
 	debugf("phys_avail_count = %d\n", phys_avail_count);
 	debugf("physsz = 0x%08x physmem = %ld (0x%08lx)\n", physsz, physmem,
 	    physmem);
 
 	/*******************************************************/
 	/* Initialize (statically allocated) kernel pmap. */
 	/*******************************************************/
 	PMAP_LOCK_INIT(kernel_pmap);
 	kptbl_min = VM_MIN_KERNEL_ADDRESS / PDIR_SIZE;
 
 	debugf("kernel_pmap = 0x%08x\n", (uint32_t)kernel_pmap);
 	debugf("kptbl_min = %d, kernel_ptbls = %d\n", kptbl_min, kernel_ptbls);
 	debugf("kernel pdir range: 0x%08x - 0x%08x\n",
 	    kptbl_min * PDIR_SIZE, (kptbl_min + kernel_ptbls) * PDIR_SIZE - 1);
 
 	/* Initialize kernel pdir */
 	for (i = 0; i < kernel_ptbls; i++)
 		kernel_pmap->pm_pdir[kptbl_min + i] =
 		    (pte_t *)(kernel_pdir + (i * PAGE_SIZE * PTBL_PAGES));
 
 	for (i = 0; i < MAXCPU; i++) {
 		kernel_pmap->pm_tid[i] = TID_KERNEL;
 		
 		/* Initialize each CPU's tidbusy entry 0 with kernel_pmap */
 		tidbusy[i][0] = kernel_pmap;
 	}
 
 	/*
 	 * Fill in PTEs covering kernel code and data. They are not required
 	 * for address translation, as this area is covered by static TLB1
 	 * entries, but for pte_vatopa() to work correctly with kernel area
 	 * addresses.
 	 */
 	for (va = KERNBASE; va < data_end; va += PAGE_SIZE) {
 		pte = &(kernel_pmap->pm_pdir[PDIR_IDX(va)][PTBL_IDX(va)]);
 		pte->rpn = kernload + (va - KERNBASE);
 		pte->flags = PTE_M | PTE_SR | PTE_SW | PTE_SX | PTE_WIRED |
 		    PTE_VALID;
 	}
 	/* Mark kernel_pmap active on all CPUs */
-	kernel_pmap->pm_active = ~0;
+	CPU_FILL(&kernel_pmap->pm_active);
 
 	/*******************************************************/
 	/* Final setup */
 	/*******************************************************/
 
 	/* Enter kstack0 into kernel map, provide guard page */
 	kstack0 = virtual_avail + KSTACK_GUARD_PAGES * PAGE_SIZE;
 	thread0.td_kstack = kstack0;
 	thread0.td_kstack_pages = KSTACK_PAGES;
 
 	debugf("kstack_sz = 0x%08x\n", kstack0_sz);
 	debugf("kstack0_phys at 0x%08x - 0x%08x\n",
 	    kstack0_phys, kstack0_phys + kstack0_sz);
 	debugf("kstack0 at 0x%08x - 0x%08x\n", kstack0, kstack0 + kstack0_sz);
 	
 	virtual_avail += KSTACK_GUARD_PAGES * PAGE_SIZE + kstack0_sz;
 	for (i = 0; i < KSTACK_PAGES; i++) {
 		mmu_booke_kenter(mmu, kstack0, kstack0_phys);
 		kstack0 += PAGE_SIZE;
 		kstack0_phys += PAGE_SIZE;
 	}
 	
 	debugf("virtual_avail = %08x\n", virtual_avail);
 	debugf("virtual_end   = %08x\n", virtual_end);
 
 	debugf("mmu_booke_bootstrap: exit\n");
 }
 
 void
 pmap_bootstrap_ap(volatile uint32_t *trcp __unused)
 {
 	int i;
 
 	/*
 	 * Finish TLB1 configuration: the BSP already set up its TLB1 and we
 	 * have the snapshot of its contents in the s/w tlb1[] table, so use
 	 * these values directly to (re)program AP's TLB1 hardware.
 	 */
 	for (i = 0; i < tlb1_idx; i ++) {
 		/* Skip invalid entries */
 		if (!(tlb1[i].mas1 & MAS1_VALID))
 			continue;
 
 		tlb1_write_entry(i);
 	}
 
 	set_mas4_defaults();
 }
 
 /*
  * Get the physical page address for the given pmap/virtual address.
  */
 static vm_paddr_t
 mmu_booke_extract(mmu_t mmu, pmap_t pmap, vm_offset_t va)
 {
 	vm_paddr_t pa;
 
 	PMAP_LOCK(pmap);
 	pa = pte_vatopa(mmu, pmap, va);
 	PMAP_UNLOCK(pmap);
 
 	return (pa);
 }
 
 /*
  * Extract the physical page address associated with the given
  * kernel virtual address.
  */
 static vm_paddr_t
 mmu_booke_kextract(mmu_t mmu, vm_offset_t va)
 {
 
 	return (pte_vatopa(mmu, kernel_pmap, va));
 }
 
 /*
  * Initialize the pmap module.
  * Called by vm_init, to initialize any structures that the pmap
  * system needs to map virtual memory.
  */
 static void
 mmu_booke_init(mmu_t mmu)
 {
 	int shpgperproc = PMAP_SHPGPERPROC;
 
 	/*
 	 * Initialize the address space (zone) for the pv entries.  Set a
 	 * high water mark so that the system can recover from excessive
 	 * numbers of pv entries.
 	 */
 	pvzone = uma_zcreate("PV ENTRY", sizeof(struct pv_entry), NULL, NULL,
 	    NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_VM | UMA_ZONE_NOFREE);
 
 	TUNABLE_INT_FETCH("vm.pmap.shpgperproc", &shpgperproc);
 	pv_entry_max = shpgperproc * maxproc + cnt.v_page_count;
 
 	TUNABLE_INT_FETCH("vm.pmap.pv_entries", &pv_entry_max);
 	pv_entry_high_water = 9 * (pv_entry_max / 10);
 
 	uma_zone_set_obj(pvzone, &pvzone_obj, pv_entry_max);
 
 	/* Pre-fill pvzone with initial number of pv entries. */
 	uma_prealloc(pvzone, PV_ENTRY_ZONE_MIN);
 
 	/* Initialize ptbl allocation. */
 	ptbl_init();
 }
 
 /*
  * Map a list of wired pages into kernel virtual address space.  This is
  * intended for temporary mappings which do not need page modification or
  * references recorded.  Existing mappings in the region are overwritten.
  */
 static void
 mmu_booke_qenter(mmu_t mmu, vm_offset_t sva, vm_page_t *m, int count)
 {
 	vm_offset_t va;
 
 	va = sva;
 	while (count-- > 0) {
 		mmu_booke_kenter(mmu, va, VM_PAGE_TO_PHYS(*m));
 		va += PAGE_SIZE;
 		m++;
 	}
 }
 
 /*
  * Remove page mappings from kernel virtual address space.  Intended for
  * temporary mappings entered by mmu_booke_qenter.
  */
 static void
 mmu_booke_qremove(mmu_t mmu, vm_offset_t sva, int count)
 {
 	vm_offset_t va;
 
 	va = sva;
 	while (count-- > 0) {
 		mmu_booke_kremove(mmu, va);
 		va += PAGE_SIZE;
 	}
 }
 
 /*
  * Map a wired page into kernel virtual address space.
  */
 static void
 mmu_booke_kenter(mmu_t mmu, vm_offset_t va, vm_offset_t pa)
 {
 	unsigned int pdir_idx = PDIR_IDX(va);
 	unsigned int ptbl_idx = PTBL_IDX(va);
 	uint32_t flags;
 	pte_t *pte;
 
 	KASSERT(((va >= VM_MIN_KERNEL_ADDRESS) &&
 	    (va <= VM_MAX_KERNEL_ADDRESS)), ("mmu_booke_kenter: invalid va"));
 
 	flags = 0;
 	flags |= (PTE_SR | PTE_SW | PTE_SX | PTE_WIRED | PTE_VALID);
 	flags |= PTE_M;
 
 	pte = &(kernel_pmap->pm_pdir[pdir_idx][ptbl_idx]);
 
 	mtx_lock_spin(&tlbivax_mutex);
 	tlb_miss_lock();
 	
 	if (PTE_ISVALID(pte)) {
 	
 		CTR1(KTR_PMAP, "%s: replacing entry!", __func__);
 
 		/* Flush entry from TLB0 */
 		tlb0_flush_entry(va);
 	}
 
 	pte->rpn = pa & ~PTE_PA_MASK;
 	pte->flags = flags;
 
 	//debugf("mmu_booke_kenter: pdir_idx = %d ptbl_idx = %d va=0x%08x "
 	//		"pa=0x%08x rpn=0x%08x flags=0x%08x\n",
 	//		pdir_idx, ptbl_idx, va, pa, pte->rpn, pte->flags);
 
 	/* Flush the real memory from the instruction cache. */
 	if ((flags & (PTE_I | PTE_G)) == 0) {
 		__syncicache((void *)va, PAGE_SIZE);
 	}
 
 	tlb_miss_unlock();
 	mtx_unlock_spin(&tlbivax_mutex);
 }
 
 /*
  * Remove a page from kernel page table.
  */
 static void
 mmu_booke_kremove(mmu_t mmu, vm_offset_t va)
 {
 	unsigned int pdir_idx = PDIR_IDX(va);
 	unsigned int ptbl_idx = PTBL_IDX(va);
 	pte_t *pte;
 
 //	CTR2(KTR_PMAP,("%s: s (va = 0x%08x)\n", __func__, va));
 
 	KASSERT(((va >= VM_MIN_KERNEL_ADDRESS) &&
 	    (va <= VM_MAX_KERNEL_ADDRESS)),
 	    ("mmu_booke_kremove: invalid va"));
 
 	pte = &(kernel_pmap->pm_pdir[pdir_idx][ptbl_idx]);
 
 	if (!PTE_ISVALID(pte)) {
 	
 		CTR1(KTR_PMAP, "%s: invalid pte", __func__);
 
 		return;
 	}
 
 	mtx_lock_spin(&tlbivax_mutex);
 	tlb_miss_lock();
 
 	/* Invalidate entry in TLB0, update PTE. */
 	tlb0_flush_entry(va);
 	pte->flags = 0;
 	pte->rpn = 0;
 
 	tlb_miss_unlock();
 	mtx_unlock_spin(&tlbivax_mutex);
 }
 
 /*
  * Initialize pmap associated with process 0.
  */
 static void
 mmu_booke_pinit0(mmu_t mmu, pmap_t pmap)
 {
 
 	mmu_booke_pinit(mmu, pmap);
 	PCPU_SET(curpmap, pmap);
 }
 
 /*
  * Initialize a preallocated and zeroed pmap structure,
  * such as one in a vmspace structure.
  */
 static void
 mmu_booke_pinit(mmu_t mmu, pmap_t pmap)
 {
 	int i;
 
 	CTR4(KTR_PMAP, "%s: pmap = %p, proc %d '%s'", __func__, pmap,
 	    curthread->td_proc->p_pid, curthread->td_proc->p_comm);
 
 	KASSERT((pmap != kernel_pmap), ("pmap_pinit: initializing kernel_pmap"));
 
 	PMAP_LOCK_INIT(pmap);
 	for (i = 0; i < MAXCPU; i++)
 		pmap->pm_tid[i] = TID_NONE;
-	pmap->pm_active = 0;
+	CPU_ZERO(&kernel_pmap->pm_active);
 	bzero(&pmap->pm_stats, sizeof(pmap->pm_stats));
 	bzero(&pmap->pm_pdir, sizeof(pte_t *) * PDIR_NENTRIES);
 	TAILQ_INIT(&pmap->pm_ptbl_list);
 }
 
 /*
  * Release any resources held by the given physical map.
  * Called when a pmap initialized by mmu_booke_pinit is being released.
  * Should only be called if the map contains no valid mappings.
  */
 static void
 mmu_booke_release(mmu_t mmu, pmap_t pmap)
 {
 
 	KASSERT(pmap->pm_stats.resident_count == 0,
 	    ("pmap_release: pmap resident count %ld != 0",
 	    pmap->pm_stats.resident_count));
 
 	PMAP_LOCK_DESTROY(pmap);
 }
 
 /*
  * Insert the given physical page at the specified virtual address in the
  * target physical map with the protection requested. If specified the page
  * will be wired down.
  */
 static void
 mmu_booke_enter(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	mmu_booke_enter_locked(mmu, pmap, va, m, prot, wired);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 static void
 mmu_booke_enter_locked(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 	pte_t *pte;
 	vm_paddr_t pa;
 	uint32_t flags;
 	int su, sync;
 
 	pa = VM_PAGE_TO_PHYS(m);
 	su = (pmap == kernel_pmap);
 	sync = 0;
 
 	//debugf("mmu_booke_enter_locked: s (pmap=0x%08x su=%d tid=%d m=0x%08x va=0x%08x "
 	//		"pa=0x%08x prot=0x%08x wired=%d)\n",
 	//		(u_int32_t)pmap, su, pmap->pm_tid,
 	//		(u_int32_t)m, va, pa, prot, wired);
 
 	if (su) {
 		KASSERT(((va >= virtual_avail) &&
 		    (va <= VM_MAX_KERNEL_ADDRESS)),
 		    ("mmu_booke_enter_locked: kernel pmap, non kernel va"));
 	} else {
 		KASSERT((va <= VM_MAXUSER_ADDRESS),
 		    ("mmu_booke_enter_locked: user pmap, non user va"));
 	}
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 	    (m->oflags & VPO_BUSY) != 0 || VM_OBJECT_LOCKED(m->object),
 	    ("mmu_booke_enter_locked: page %p is not busy", m));
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * If there is an existing mapping, and the physical address has not
 	 * changed, must be protection or wiring change.
 	 */
 	if (((pte = pte_find(mmu, pmap, va)) != NULL) &&
 	    (PTE_ISVALID(pte)) && (PTE_PA(pte) == pa)) {
 	    
 		/*
 		 * Before actually updating pte->flags we calculate and
 		 * prepare its new value in a helper var.
 		 */
 		flags = pte->flags;
 		flags &= ~(PTE_UW | PTE_UX | PTE_SW | PTE_SX | PTE_MODIFIED);
 
 		/* Wiring change, just update stats. */
 		if (wired) {
 			if (!PTE_ISWIRED(pte)) {
 				flags |= PTE_WIRED;
 				pmap->pm_stats.wired_count++;
 			}
 		} else {
 			if (PTE_ISWIRED(pte)) {
 				flags &= ~PTE_WIRED;
 				pmap->pm_stats.wired_count--;
 			}
 		}
 
 		if (prot & VM_PROT_WRITE) {
 			/* Add write permissions. */
 			flags |= PTE_SW;
 			if (!su)
 				flags |= PTE_UW;
 
 			if ((flags & PTE_MANAGED) != 0)
 				vm_page_flag_set(m, PG_WRITEABLE);
 		} else {
 			/* Handle modified pages, sense modify status. */
 
 			/*
 			 * The PTE_MODIFIED flag could be set by underlying
 			 * TLB misses since we last read it (above), possibly
 			 * other CPUs could update it so we check in the PTE
 			 * directly rather than rely on that saved local flags
 			 * copy.
 			 */
 			if (PTE_ISMODIFIED(pte))
 				vm_page_dirty(m);
 		}
 
 		if (prot & VM_PROT_EXECUTE) {
 			flags |= PTE_SX;
 			if (!su)
 				flags |= PTE_UX;
 
 			/*
 			 * Check existing flags for execute permissions: if we
 			 * are turning execute permissions on, icache should
 			 * be flushed.
 			 */
 			if ((pte->flags & (PTE_UX | PTE_SX)) == 0)
 				sync++;
 		}
 
 		flags &= ~PTE_REFERENCED;
 
 		/*
 		 * The new flags value is all calculated -- only now actually
 		 * update the PTE.
 		 */
 		mtx_lock_spin(&tlbivax_mutex);
 		tlb_miss_lock();
 
 		tlb0_flush_entry(va);
 		pte->flags = flags;
 
 		tlb_miss_unlock();
 		mtx_unlock_spin(&tlbivax_mutex);
 
 	} else {
 		/*
 		 * If there is an existing mapping, but it's for a different
 		 * physical address, pte_enter() will delete the old mapping.
 		 */
 		//if ((pte != NULL) && PTE_ISVALID(pte))
 		//	debugf("mmu_booke_enter_locked: replace\n");
 		//else
 		//	debugf("mmu_booke_enter_locked: new\n");
 
 		/* Now set up the flags and install the new mapping. */
 		flags = (PTE_SR | PTE_VALID);
 		flags |= PTE_M;
 
 		if (!su)
 			flags |= PTE_UR;
 
 		if (prot & VM_PROT_WRITE) {
 			flags |= PTE_SW;
 			if (!su)
 				flags |= PTE_UW;
 
 			if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0)
 				vm_page_flag_set(m, PG_WRITEABLE);
 		}
 
 		if (prot & VM_PROT_EXECUTE) {
 			flags |= PTE_SX;
 			if (!su)
 				flags |= PTE_UX;
 		}
 
 		/* If its wired update stats. */
 		if (wired) {
 			pmap->pm_stats.wired_count++;
 			flags |= PTE_WIRED;
 		}
 
 		pte_enter(mmu, pmap, m, va, flags);
 
 		/* Flush the real memory from the instruction cache. */
 		if (prot & VM_PROT_EXECUTE)
 			sync++;
 	}
 
 	if (sync && (su || pmap == PCPU_GET(curpmap))) {
 		__syncicache((void *)va, PAGE_SIZE);
 		sync = 0;
 	}
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 static void
 mmu_booke_enter_object(mmu_t mmu, pmap_t pmap, vm_offset_t start,
     vm_offset_t end, vm_page_t m_start, vm_prot_t prot)
 {
 	vm_page_t m;
 	vm_pindex_t diff, psize;
 
 	psize = atop(end - start);
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		mmu_booke_enter_locked(mmu, pmap, start + ptoa(diff), m,
 		    prot & (VM_PROT_READ | VM_PROT_EXECUTE), FALSE);
 		m = TAILQ_NEXT(m, listq);
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 static void
 mmu_booke_enter_quick(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	mmu_booke_enter_locked(mmu, pmap, va, m,
 	    prot & (VM_PROT_READ | VM_PROT_EXECUTE), FALSE);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Remove the given range of addresses from the specified map.
  *
  * It is assumed that the start and end are properly rounded to the page size.
  */
 static void
 mmu_booke_remove(mmu_t mmu, pmap_t pmap, vm_offset_t va, vm_offset_t endva)
 {
 	pte_t *pte;
 	uint8_t hold_flag;
 
 	int su = (pmap == kernel_pmap);
 
 	//debugf("mmu_booke_remove: s (su = %d pmap=0x%08x tid=%d va=0x%08x endva=0x%08x)\n",
 	//		su, (u_int32_t)pmap, pmap->pm_tid, va, endva);
 
 	if (su) {
 		KASSERT(((va >= virtual_avail) &&
 		    (va <= VM_MAX_KERNEL_ADDRESS)),
 		    ("mmu_booke_remove: kernel pmap, non kernel va"));
 	} else {
 		KASSERT((va <= VM_MAXUSER_ADDRESS),
 		    ("mmu_booke_remove: user pmap, non user va"));
 	}
 
 	if (PMAP_REMOVE_DONE(pmap)) {
 		//debugf("mmu_booke_remove: e (empty)\n");
 		return;
 	}
 
 	hold_flag = PTBL_HOLD_FLAG(pmap);
 	//debugf("mmu_booke_remove: hold_flag = %d\n", hold_flag);
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	for (; va < endva; va += PAGE_SIZE) {
 		pte = pte_find(mmu, pmap, va);
 		if ((pte != NULL) && PTE_ISVALID(pte))
 			pte_remove(mmu, pmap, va, hold_flag);
 	}
 	PMAP_UNLOCK(pmap);
 	vm_page_unlock_queues();
 
 	//debugf("mmu_booke_remove: e\n");
 }
 
 /*
  * Remove physical page from all pmaps in which it resides.
  */
 static void
 mmu_booke_remove_all(mmu_t mmu, vm_page_t m)
 {
 	pv_entry_t pv, pvn;
 	uint8_t hold_flag;
 
 	vm_page_lock_queues();
 	for (pv = TAILQ_FIRST(&m->md.pv_list); pv != NULL; pv = pvn) {
 		pvn = TAILQ_NEXT(pv, pv_link);
 
 		PMAP_LOCK(pv->pv_pmap);
 		hold_flag = PTBL_HOLD_FLAG(pv->pv_pmap);
 		pte_remove(mmu, pv->pv_pmap, pv->pv_va, hold_flag);
 		PMAP_UNLOCK(pv->pv_pmap);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 /*
  * Map a range of physical addresses into kernel virtual address space.
  */
 static vm_offset_t
 mmu_booke_map(mmu_t mmu, vm_offset_t *virt, vm_offset_t pa_start,
     vm_offset_t pa_end, int prot)
 {
 	vm_offset_t sva = *virt;
 	vm_offset_t va = sva;
 
 	//debugf("mmu_booke_map: s (sva = 0x%08x pa_start = 0x%08x pa_end = 0x%08x)\n",
 	//		sva, pa_start, pa_end);
 
 	while (pa_start < pa_end) {
 		mmu_booke_kenter(mmu, va, pa_start);
 		va += PAGE_SIZE;
 		pa_start += PAGE_SIZE;
 	}
 	*virt = va;
 
 	//debugf("mmu_booke_map: e (va = 0x%08x)\n", va);
 	return (sva);
 }
 
 /*
  * The pmap must be activated before it's address space can be accessed in any
  * way.
  */
 static void
 mmu_booke_activate(mmu_t mmu, struct thread *td)
 {
 	pmap_t pmap;
 
 	pmap = &td->td_proc->p_vmspace->vm_pmap;
 
 	CTR5(KTR_PMAP, "%s: s (td = %p, proc = '%s', id = %d, pmap = 0x%08x)",
 	    __func__, td, td->td_proc->p_comm, td->td_proc->p_pid, pmap);
 
 	KASSERT((pmap != kernel_pmap), ("mmu_booke_activate: kernel_pmap!"));
 
 	mtx_lock_spin(&sched_lock);
 
-	atomic_set_int(&pmap->pm_active, PCPU_GET(cpumask));
+	CPU_OR_ATOMIC(&pmap->pm_active, PCPU_PTR(cpumask));
 	PCPU_SET(curpmap, pmap);
 	
 	if (pmap->pm_tid[PCPU_GET(cpuid)] == TID_NONE)
 		tid_alloc(pmap);
 
 	/* Load PID0 register with pmap tid value. */
 	mtspr(SPR_PID0, pmap->pm_tid[PCPU_GET(cpuid)]);
 	__asm __volatile("isync");
 
 	mtx_unlock_spin(&sched_lock);
 
 	CTR3(KTR_PMAP, "%s: e (tid = %d for '%s')", __func__,
 	    pmap->pm_tid[PCPU_GET(cpuid)], td->td_proc->p_comm);
 }
 
 /*
  * Deactivate the specified process's address space.
  */
 static void
 mmu_booke_deactivate(mmu_t mmu, struct thread *td)
 {
 	pmap_t pmap;
 
 	pmap = &td->td_proc->p_vmspace->vm_pmap;
 	
 	CTR5(KTR_PMAP, "%s: td=%p, proc = '%s', id = %d, pmap = 0x%08x",
 	    __func__, td, td->td_proc->p_comm, td->td_proc->p_pid, pmap);
 
-	atomic_clear_int(&pmap->pm_active, PCPU_GET(cpumask));
+	sched_pin();
+	CPU_NAND_ATOMIC(&pmap->pm_active, PCPU_PTR(cpumask));
+	sched_unpin();
 	PCPU_SET(curpmap, NULL);
 }
 
 /*
  * Copy the range specified by src_addr/len
  * from the source map to the range dst_addr/len
  * in the destination map.
  *
  * This routine is only advisory and need not do anything.
  */
 static void
 mmu_booke_copy(mmu_t mmu, pmap_t dst_pmap, pmap_t src_pmap,
     vm_offset_t dst_addr, vm_size_t len, vm_offset_t src_addr)
 {
 
 }
 
 /*
  * Set the physical protection on the specified range of this map as requested.
  */
 static void
 mmu_booke_protect(mmu_t mmu, pmap_t pmap, vm_offset_t sva, vm_offset_t eva,
     vm_prot_t prot)
 {
 	vm_offset_t va;
 	vm_page_t m;
 	pte_t *pte;
 
 	if ((prot & VM_PROT_READ) == VM_PROT_NONE) {
 		mmu_booke_remove(mmu, pmap, sva, eva);
 		return;
 	}
 
 	if (prot & VM_PROT_WRITE)
 		return;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pmap);
 	for (va = sva; va < eva; va += PAGE_SIZE) {
 		if ((pte = pte_find(mmu, pmap, va)) != NULL) {
 			if (PTE_ISVALID(pte)) {
 				m = PHYS_TO_VM_PAGE(PTE_PA(pte));
 
 				mtx_lock_spin(&tlbivax_mutex);
 				tlb_miss_lock();
 
 				/* Handle modified pages. */
 				if (PTE_ISMODIFIED(pte) && PTE_ISMANAGED(pte))
 					vm_page_dirty(m);
 
 				tlb0_flush_entry(va);
 				pte->flags &= ~(PTE_UW | PTE_SW | PTE_MODIFIED);
 
 				tlb_miss_unlock();
 				mtx_unlock_spin(&tlbivax_mutex);
 			}
 		}
 	}
 	PMAP_UNLOCK(pmap);
 	vm_page_unlock_queues();
 }
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 static void
 mmu_booke_remove_write(mmu_t mmu, vm_page_t m)
 {
 	pv_entry_t pv;
 	pte_t *pte;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("mmu_booke_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_link) {
 		PMAP_LOCK(pv->pv_pmap);
 		if ((pte = pte_find(mmu, pv->pv_pmap, pv->pv_va)) != NULL) {
 			if (PTE_ISVALID(pte)) {
 				m = PHYS_TO_VM_PAGE(PTE_PA(pte));
 
 				mtx_lock_spin(&tlbivax_mutex);
 				tlb_miss_lock();
 
 				/* Handle modified pages. */
 				if (PTE_ISMODIFIED(pte))
 					vm_page_dirty(m);
 
 				/* Flush mapping from TLB0. */
 				pte->flags &= ~(PTE_UW | PTE_SW | PTE_MODIFIED);
 
 				tlb_miss_unlock();
 				mtx_unlock_spin(&tlbivax_mutex);
 			}
 		}
 		PMAP_UNLOCK(pv->pv_pmap);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 static void
 mmu_booke_sync_icache(mmu_t mmu, pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 	pte_t *pte;
 	pmap_t pmap;
 	vm_page_t m;
 	vm_offset_t addr;
 	vm_paddr_t pa;
 	int active, valid;
  
 	va = trunc_page(va);
 	sz = round_page(sz);
 
 	vm_page_lock_queues();
 	pmap = PCPU_GET(curpmap);
 	active = (pm == kernel_pmap || pm == pmap) ? 1 : 0;
 	while (sz > 0) {
 		PMAP_LOCK(pm);
 		pte = pte_find(mmu, pm, va);
 		valid = (pte != NULL && PTE_ISVALID(pte)) ? 1 : 0;
 		if (valid)
 			pa = PTE_PA(pte);
 		PMAP_UNLOCK(pm);
 		if (valid) {
 			if (!active) {
 				/* Create a mapping in the active pmap. */
 				addr = 0;
 				m = PHYS_TO_VM_PAGE(pa);
 				PMAP_LOCK(pmap);
 				pte_enter(mmu, pmap, m, addr,
 				    PTE_SR | PTE_VALID | PTE_UR);
 				__syncicache((void *)addr, PAGE_SIZE);
 				pte_remove(mmu, pmap, addr, PTBL_UNHOLD);
 				PMAP_UNLOCK(pmap);
 			} else
 				__syncicache((void *)va, PAGE_SIZE);
 		}
 		va += PAGE_SIZE;
 		sz -= PAGE_SIZE;
 	}
 	vm_page_unlock_queues();
 }
 
 /*
  * Atomically extract and hold the physical page with the given
  * pmap and virtual address pair if that mapping permits the given
  * protection.
  */
 static vm_page_t
 mmu_booke_extract_and_hold(mmu_t mmu, pmap_t pmap, vm_offset_t va,
     vm_prot_t prot)
 {
 	pte_t *pte;
 	vm_page_t m;
 	uint32_t pte_wbit;
 	vm_paddr_t pa;
 	
 	m = NULL;
 	pa = 0;	
 	PMAP_LOCK(pmap);
 retry:
 	pte = pte_find(mmu, pmap, va);
 	if ((pte != NULL) && PTE_ISVALID(pte)) {
 		if (pmap == kernel_pmap)
 			pte_wbit = PTE_SW;
 		else
 			pte_wbit = PTE_UW;
 
 		if ((pte->flags & pte_wbit) || ((prot & VM_PROT_WRITE) == 0)) {
 			if (vm_page_pa_tryrelock(pmap, PTE_PA(pte), &pa))
 				goto retry;
 			m = PHYS_TO_VM_PAGE(PTE_PA(pte));
 			vm_page_hold(m);
 		}
 	}
 
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pmap);
 	return (m);
 }
 
 /*
  * Initialize a vm_page's machine-dependent fields.
  */
 static void
 mmu_booke_page_init(mmu_t mmu, vm_page_t m)
 {
 
 	TAILQ_INIT(&m->md.pv_list);
 }
 
 /*
  * mmu_booke_zero_page_area zeros the specified hardware page by
  * mapping it into virtual memory and using bzero to clear
  * its contents.
  *
  * off and size must reside within a single page.
  */
 static void
 mmu_booke_zero_page_area(mmu_t mmu, vm_page_t m, int off, int size)
 {
 	vm_offset_t va;
 
 	/* XXX KASSERT off and size are within a single page? */
 
 	mtx_lock(&zero_page_mutex);
 	va = zero_page_va;
 
 	mmu_booke_kenter(mmu, va, VM_PAGE_TO_PHYS(m));
 	bzero((caddr_t)va + off, size);
 	mmu_booke_kremove(mmu, va);
 
 	mtx_unlock(&zero_page_mutex);
 }
 
 /*
  * mmu_booke_zero_page zeros the specified hardware page.
  */
 static void
 mmu_booke_zero_page(mmu_t mmu, vm_page_t m)
 {
 
 	mmu_booke_zero_page_area(mmu, m, 0, PAGE_SIZE);
 }
 
 /*
  * mmu_booke_copy_page copies the specified (machine independent) page by
  * mapping the page into virtual memory and using memcopy to copy the page,
  * one machine dependent page at a time.
  */
 static void
 mmu_booke_copy_page(mmu_t mmu, vm_page_t sm, vm_page_t dm)
 {
 	vm_offset_t sva, dva;
 
 	sva = copy_page_src_va;
 	dva = copy_page_dst_va;
 
 	mtx_lock(&copy_page_mutex);
 	mmu_booke_kenter(mmu, sva, VM_PAGE_TO_PHYS(sm));
 	mmu_booke_kenter(mmu, dva, VM_PAGE_TO_PHYS(dm));
 	memcpy((caddr_t)dva, (caddr_t)sva, PAGE_SIZE);
 	mmu_booke_kremove(mmu, dva);
 	mmu_booke_kremove(mmu, sva);
 	mtx_unlock(&copy_page_mutex);
 }
 
 /*
  * mmu_booke_zero_page_idle zeros the specified hardware page by mapping it
  * into virtual memory and using bzero to clear its contents. This is intended
  * to be called from the vm_pagezero process only and outside of Giant. No
  * lock is required.
  */
 static void
 mmu_booke_zero_page_idle(mmu_t mmu, vm_page_t m)
 {
 	vm_offset_t va;
 
 	va = zero_page_idle_va;
 	mmu_booke_kenter(mmu, va, VM_PAGE_TO_PHYS(m));
 	bzero((caddr_t)va, PAGE_SIZE);
 	mmu_booke_kremove(mmu, va);
 }
 
 /*
  * Return whether or not the specified physical page was modified
  * in any of physical maps.
  */
 static boolean_t
 mmu_booke_is_modified(mmu_t mmu, vm_page_t m)
 {
 	pte_t *pte;
 	pv_entry_t pv;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("mmu_booke_is_modified: page %p is not managed", m));
 	rv = FALSE;
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no PTEs can be modified.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return (rv);
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_link) {
 		PMAP_LOCK(pv->pv_pmap);
 		if ((pte = pte_find(mmu, pv->pv_pmap, pv->pv_va)) != NULL &&
 		    PTE_ISVALID(pte)) {
 			if (PTE_ISMODIFIED(pte))
 				rv = TRUE;
 		}
 		PMAP_UNLOCK(pv->pv_pmap);
 		if (rv)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Return whether or not the specified virtual address is eligible
  * for prefault.
  */
 static boolean_t
 mmu_booke_is_prefaultable(mmu_t mmu, pmap_t pmap, vm_offset_t addr)
 {
 
 	return (FALSE);
 }
 
 /*
  * Return whether or not the specified physical page was referenced
  * in any physical maps.
  */
 static boolean_t
 mmu_booke_is_referenced(mmu_t mmu, vm_page_t m)
 {
 	pte_t *pte;
 	pv_entry_t pv;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("mmu_booke_is_referenced: page %p is not managed", m));
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_link) {
 		PMAP_LOCK(pv->pv_pmap);
 		if ((pte = pte_find(mmu, pv->pv_pmap, pv->pv_va)) != NULL &&
 		    PTE_ISVALID(pte)) {
 			if (PTE_ISREFERENCED(pte))
 				rv = TRUE;
 		}
 		PMAP_UNLOCK(pv->pv_pmap);
 		if (rv)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Clear the modify bits on the specified physical page.
  */
 static void
 mmu_booke_clear_modify(mmu_t mmu, vm_page_t m)
 {
 	pte_t *pte;
 	pv_entry_t pv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("mmu_booke_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("mmu_booke_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no PTEs can be modified.
 	 * If the object containing the page is locked and the page is not
 	 * VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_link) {
 		PMAP_LOCK(pv->pv_pmap);
 		if ((pte = pte_find(mmu, pv->pv_pmap, pv->pv_va)) != NULL &&
 		    PTE_ISVALID(pte)) {
 			mtx_lock_spin(&tlbivax_mutex);
 			tlb_miss_lock();
 			
 			if (pte->flags & (PTE_SW | PTE_UW | PTE_MODIFIED)) {
 				tlb0_flush_entry(pv->pv_va);
 				pte->flags &= ~(PTE_SW | PTE_UW | PTE_MODIFIED |
 				    PTE_REFERENCED);
 			}
 
 			tlb_miss_unlock();
 			mtx_unlock_spin(&tlbivax_mutex);
 		}
 		PMAP_UNLOCK(pv->pv_pmap);
 	}
 	vm_page_unlock_queues();
 }
 
 /*
  * Return a count of reference bits for a page, clearing those bits.
  * It is not necessary for every reference bit to be cleared, but it
  * is necessary that 0 only be returned when there are truly no
  * reference bits set.
  *
  * XXX: The exact number of bits to check and clear is a matter that
  * should be tested and standardized at some point in the future for
  * optimal aging of shared pages.
  */
 static int
 mmu_booke_ts_referenced(mmu_t mmu, vm_page_t m)
 {
 	pte_t *pte;
 	pv_entry_t pv;
 	int count;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("mmu_booke_ts_referenced: page %p is not managed", m));
 	count = 0;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_link) {
 		PMAP_LOCK(pv->pv_pmap);
 		if ((pte = pte_find(mmu, pv->pv_pmap, pv->pv_va)) != NULL &&
 		    PTE_ISVALID(pte)) {
 			if (PTE_ISREFERENCED(pte)) {
 				mtx_lock_spin(&tlbivax_mutex);
 				tlb_miss_lock();
 
 				tlb0_flush_entry(pv->pv_va);
 				pte->flags &= ~PTE_REFERENCED;
 
 				tlb_miss_unlock();
 				mtx_unlock_spin(&tlbivax_mutex);
 
 				if (++count > 4) {
 					PMAP_UNLOCK(pv->pv_pmap);
 					break;
 				}
 			}
 		}
 		PMAP_UNLOCK(pv->pv_pmap);
 	}
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  * Clear the reference bit on the specified physical page.
  */
 static void
 mmu_booke_clear_reference(mmu_t mmu, vm_page_t m)
 {
 	pte_t *pte;
 	pv_entry_t pv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("mmu_booke_clear_reference: page %p is not managed", m));
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_link) {
 		PMAP_LOCK(pv->pv_pmap);
 		if ((pte = pte_find(mmu, pv->pv_pmap, pv->pv_va)) != NULL &&
 		    PTE_ISVALID(pte)) {
 			if (PTE_ISREFERENCED(pte)) {
 				mtx_lock_spin(&tlbivax_mutex);
 				tlb_miss_lock();
 				
 				tlb0_flush_entry(pv->pv_va);
 				pte->flags &= ~PTE_REFERENCED;
 
 				tlb_miss_unlock();
 				mtx_unlock_spin(&tlbivax_mutex);
 			}
 		}
 		PMAP_UNLOCK(pv->pv_pmap);
 	}
 	vm_page_unlock_queues();
 }
 
 /*
  * Change wiring attribute for a map/virtual-address pair.
  */
 static void
 mmu_booke_change_wiring(mmu_t mmu, pmap_t pmap, vm_offset_t va, boolean_t wired)
 {
 	pte_t *pte;
 
 	PMAP_LOCK(pmap);
 	if ((pte = pte_find(mmu, pmap, va)) != NULL) {
 		if (wired) {
 			if (!PTE_ISWIRED(pte)) {
 				pte->flags |= PTE_WIRED;
 				pmap->pm_stats.wired_count++;
 			}
 		} else {
 			if (PTE_ISWIRED(pte)) {
 				pte->flags &= ~PTE_WIRED;
 				pmap->pm_stats.wired_count--;
 			}
 		}
 	}
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Return true if the pmap's pv is one of the first 16 pvs linked to from this
  * page.  This count may be changed upwards or downwards in the future; it is
  * only necessary that true be returned for a small subset of pmaps for proper
  * page aging.
  */
 static boolean_t
 mmu_booke_page_exists_quick(mmu_t mmu, pmap_t pmap, vm_page_t m)
 {
 	pv_entry_t pv;
 	int loops;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("mmu_booke_page_exists_quick: page %p is not managed", m));
 	loops = 0;
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_link) {
 		if (pv->pv_pmap == pmap) {
 			rv = TRUE;
 			break;
 		}
 		if (++loops >= 16)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Return the number of managed mappings to the given physical page that are
  * wired.
  */
 static int
 mmu_booke_page_wired_mappings(mmu_t mmu, vm_page_t m)
 {
 	pv_entry_t pv;
 	pte_t *pte;
 	int count = 0;
 
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_link) {
 		PMAP_LOCK(pv->pv_pmap);
 		if ((pte = pte_find(mmu, pv->pv_pmap, pv->pv_va)) != NULL)
 			if (PTE_ISVALID(pte) && PTE_ISWIRED(pte))
 				count++;
 		PMAP_UNLOCK(pv->pv_pmap);
 	}
 	vm_page_unlock_queues();
 	return (count);
 }
 
 static int
 mmu_booke_dev_direct_mapped(mmu_t mmu, vm_offset_t pa, vm_size_t size)
 {
 	int i;
 	vm_offset_t va;
 
 	/*
 	 * This currently does not work for entries that
 	 * overlap TLB1 entries.
 	 */
 	for (i = 0; i < tlb1_idx; i ++) {
 		if (tlb1_iomapped(i, pa, size, &va) == 0)
 			return (0);
 	}
 
 	return (EFAULT);
 }
 
 vm_offset_t
 mmu_booke_dumpsys_map(mmu_t mmu, struct pmap_md *md, vm_size_t ofs,
     vm_size_t *sz)
 {
 	vm_paddr_t pa, ppa;
 	vm_offset_t va;
 	vm_size_t gran;
 
 	/* Raw physical memory dumps don't have a virtual address. */
 	if (md->md_vaddr == ~0UL) {
 		/* We always map a 256MB page at 256M. */
 		gran = 256 * 1024 * 1024;
 		pa = md->md_paddr + ofs;
 		ppa = pa & ~(gran - 1);
 		ofs = pa - ppa;
 		va = gran;
 		tlb1_set_entry(va, ppa, gran, _TLB_ENTRY_IO);
 		if (*sz > (gran - ofs))
 			*sz = gran - ofs;
 		return (va + ofs);
 	}
 
 	/* Minidumps are based on virtual memory addresses. */
 	va = md->md_vaddr + ofs;
 	if (va >= kernstart + kernsize) {
 		gran = PAGE_SIZE - (va & PAGE_MASK);
 		if (*sz > gran)
 			*sz = gran;
 	}
 	return (va);
 }
 
 void
 mmu_booke_dumpsys_unmap(mmu_t mmu, struct pmap_md *md, vm_size_t ofs,
     vm_offset_t va)
 {
 
 	/* Raw physical memory dumps don't have a virtual address. */
 	if (md->md_vaddr == ~0UL) {
 		tlb1_idx--;
 		tlb1[tlb1_idx].mas1 = 0;
 		tlb1[tlb1_idx].mas2 = 0;
 		tlb1[tlb1_idx].mas3 = 0;
 		tlb1_write_entry(tlb1_idx);
 		return;
 	}
  
 	/* Minidumps are based on virtual memory addresses. */
 	/* Nothing to do... */
 }
 
 struct pmap_md *
 mmu_booke_scan_md(mmu_t mmu, struct pmap_md *prev)
 {
 	static struct pmap_md md;
 	pte_t *pte;
 	vm_offset_t va;
  
 	if (dumpsys_minidump) {
 		md.md_paddr = ~0UL;	/* Minidumps use virtual addresses. */
 		if (prev == NULL) {
 			/* 1st: kernel .data and .bss. */
 			md.md_index = 1;
 			md.md_vaddr = trunc_page((uintptr_t)_etext);
 			md.md_size = round_page((uintptr_t)_end) - md.md_vaddr;
 			return (&md);
 		}
 		switch (prev->md_index) {
 		case 1:
 			/* 2nd: msgbuf and tables (see pmap_bootstrap()). */
 			md.md_index = 2;
 			md.md_vaddr = data_start;
 			md.md_size = data_end - data_start;
 			break;
 		case 2:
 			/* 3rd: kernel VM. */
 			va = prev->md_vaddr + prev->md_size;
 			/* Find start of next chunk (from va). */
 			while (va < virtual_end) {
 				/* Don't dump the buffer cache. */
 				if (va >= kmi.buffer_sva &&
 				    va < kmi.buffer_eva) {
 					va = kmi.buffer_eva;
 					continue;
 				}
 				pte = pte_find(mmu, kernel_pmap, va);
 				if (pte != NULL && PTE_ISVALID(pte))
 					break;
 				va += PAGE_SIZE;
 			}
 			if (va < virtual_end) {
 				md.md_vaddr = va;
 				va += PAGE_SIZE;
 				/* Find last page in chunk. */
 				while (va < virtual_end) {
 					/* Don't run into the buffer cache. */
 					if (va == kmi.buffer_sva)
 						break;
 					pte = pte_find(mmu, kernel_pmap, va);
 					if (pte == NULL || !PTE_ISVALID(pte))
 						break;
 					va += PAGE_SIZE;
 				}
 				md.md_size = va - md.md_vaddr;
 				break;
 			}
 			md.md_index = 3;
 			/* FALLTHROUGH */
 		default:
 			return (NULL);
 		}
 	} else { /* minidumps */
 		mem_regions(&physmem_regions, &physmem_regions_sz,
 		    &availmem_regions, &availmem_regions_sz);
 
 		if (prev == NULL) {
 			/* first physical chunk. */
 			md.md_paddr = physmem_regions[0].mr_start;
 			md.md_size = physmem_regions[0].mr_size;
 			md.md_vaddr = ~0UL;
 			md.md_index = 1;
 		} else if (md.md_index < physmem_regions_sz) {
 			md.md_paddr = physmem_regions[md.md_index].mr_start;
 			md.md_size = physmem_regions[md.md_index].mr_size;
 			md.md_vaddr = ~0UL;
 			md.md_index++;
 		} else {
 			/* There's no next physical chunk. */
 			return (NULL);
 		}
 	}
 
 	return (&md);
 }
 
 /*
  * Map a set of physical memory pages into the kernel virtual address space.
  * Return a pointer to where it is mapped. This routine is intended to be used
  * for mapping device memory, NOT real memory.
  */
 static void *
 mmu_booke_mapdev(mmu_t mmu, vm_offset_t pa, vm_size_t size)
 {
 	void *res;
 	uintptr_t va;
 	vm_size_t sz;
 
 	va = (pa >= 0x80000000) ? pa : (0xe2000000 + pa);
 	res = (void *)va;
 
 	do {
 		sz = 1 << (ilog2(size) & ~1);
 		if (bootverbose)
 			printf("Wiring VA=%x to PA=%x (size=%x), "
 			    "using TLB1[%d]\n", va, pa, sz, tlb1_idx);
 		tlb1_set_entry(va, pa, sz, _TLB_ENTRY_IO);
 		size -= sz;
 		pa += sz;
 		va += sz;
 	} while (size > 0);
 
 	return (res);
 }
 
 /*
  * 'Unmap' a range mapped by mmu_booke_mapdev().
  */
 static void
 mmu_booke_unmapdev(mmu_t mmu, vm_offset_t va, vm_size_t size)
 {
 	vm_offset_t base, offset;
 
 	/*
 	 * Unmap only if this is inside kernel virtual space.
 	 */
 	if ((va >= VM_MIN_KERNEL_ADDRESS) && (va <= VM_MAX_KERNEL_ADDRESS)) {
 		base = trunc_page(va);
 		offset = va & PAGE_MASK;
 		size = roundup(offset + size, PAGE_SIZE);
 		kmem_free(kernel_map, base, size);
 	}
 }
 
 /*
  * mmu_booke_object_init_pt preloads the ptes for a given object into the
  * specified pmap. This eliminates the blast of soft faults on process startup
  * and immediately after an mmap.
  */
 static void
 mmu_booke_object_init_pt(mmu_t mmu, pmap_t pmap, vm_offset_t addr,
     vm_object_t object, vm_pindex_t pindex, vm_size_t size)
 {
 
 	VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
 	KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG,
 	    ("mmu_booke_object_init_pt: non-device object"));
 }
 
 /*
  * Perform the pmap work for mincore.
  */
 static int
 mmu_booke_mincore(mmu_t mmu, pmap_t pmap, vm_offset_t addr,
     vm_paddr_t *locked_pa)
 {
 
 	TODO;
 	return (0);
 }
 
 /**************************************************************************/
 /* TID handling */
 /**************************************************************************/
 
 /*
  * Allocate a TID. If necessary, steal one from someone else.
  * The new TID is flushed from the TLB before returning.
  */
 static tlbtid_t
 tid_alloc(pmap_t pmap)
 {
 	tlbtid_t tid;
 	int thiscpu;
 
 	KASSERT((pmap != kernel_pmap), ("tid_alloc: kernel pmap"));
 
 	CTR2(KTR_PMAP, "%s: s (pmap = %p)", __func__, pmap);
 
 	thiscpu = PCPU_GET(cpuid);
 
 	tid = PCPU_GET(tid_next);
 	if (tid > TID_MAX)
 		tid = TID_MIN;
 	PCPU_SET(tid_next, tid + 1);
 
 	/* If we are stealing TID then clear the relevant pmap's field */
 	if (tidbusy[thiscpu][tid] != NULL) {
 
 		CTR2(KTR_PMAP, "%s: warning: stealing tid %d", __func__, tid);
 		
 		tidbusy[thiscpu][tid]->pm_tid[thiscpu] = TID_NONE;
 
 		/* Flush all entries from TLB0 matching this TID. */
 		tid_flush(tid);
 	}
 
 	tidbusy[thiscpu][tid] = pmap;
 	pmap->pm_tid[thiscpu] = tid;
 	__asm __volatile("msync; isync");
 
 	CTR3(KTR_PMAP, "%s: e (%02d next = %02d)", __func__, tid,
 	    PCPU_GET(tid_next));
 
 	return (tid);
 }
 
 /**************************************************************************/
 /* TLB0 handling */
 /**************************************************************************/
 
 static void
 tlb_print_entry(int i, uint32_t mas1, uint32_t mas2, uint32_t mas3,
     uint32_t mas7)
 {
 	int as;
 	char desc[3];
 	tlbtid_t tid;
 	vm_size_t size;
 	unsigned int tsize;
 
 	desc[2] = '\0';
 	if (mas1 & MAS1_VALID)
 		desc[0] = 'V';
 	else
 		desc[0] = ' ';
 
 	if (mas1 & MAS1_IPROT)
 		desc[1] = 'P';
 	else
 		desc[1] = ' ';
 
 	as = (mas1 & MAS1_TS_MASK) ? 1 : 0;
 	tid = MAS1_GETTID(mas1);
 
 	tsize = (mas1 & MAS1_TSIZE_MASK) >> MAS1_TSIZE_SHIFT;
 	size = 0;
 	if (tsize)
 		size = tsize2size(tsize);
 
 	debugf("%3d: (%s) [AS=%d] "
 	    "sz = 0x%08x tsz = %d tid = %d mas1 = 0x%08x "
 	    "mas2(va) = 0x%08x mas3(pa) = 0x%08x mas7 = 0x%08x\n",
 	    i, desc, as, size, tsize, tid, mas1, mas2, mas3, mas7);
 }
 
 /* Convert TLB0 va and way number to tlb0[] table index. */
 static inline unsigned int
 tlb0_tableidx(vm_offset_t va, unsigned int way)
 {
 	unsigned int idx;
 
 	idx = (way * TLB0_ENTRIES_PER_WAY);
 	idx += (va & MAS2_TLB0_ENTRY_IDX_MASK) >> MAS2_TLB0_ENTRY_IDX_SHIFT;
 	return (idx);
 }
 
 /*
  * Invalidate TLB0 entry.
  */
 static inline void
 tlb0_flush_entry(vm_offset_t va)
 {
 
 	CTR2(KTR_PMAP, "%s: s va=0x%08x", __func__, va);
 
 	mtx_assert(&tlbivax_mutex, MA_OWNED);
 
 	__asm __volatile("tlbivax 0, %0" :: "r"(va & MAS2_EPN_MASK));
 	__asm __volatile("isync; msync");
 	__asm __volatile("tlbsync; msync");
 
 	CTR1(KTR_PMAP, "%s: e", __func__);
 }
 
 /* Print out contents of the MAS registers for each TLB0 entry */
 void
 tlb0_print_tlbentries(void)
 {
 	uint32_t mas0, mas1, mas2, mas3, mas7;
 	int entryidx, way, idx;
 
 	debugf("TLB0 entries:\n");
 	for (way = 0; way < TLB0_WAYS; way ++)
 		for (entryidx = 0; entryidx < TLB0_ENTRIES_PER_WAY; entryidx++) {
 
 			mas0 = MAS0_TLBSEL(0) | MAS0_ESEL(way);
 			mtspr(SPR_MAS0, mas0);
 			__asm __volatile("isync");
 
 			mas2 = entryidx << MAS2_TLB0_ENTRY_IDX_SHIFT;
 			mtspr(SPR_MAS2, mas2);
 
 			__asm __volatile("isync; tlbre");
 
 			mas1 = mfspr(SPR_MAS1);
 			mas2 = mfspr(SPR_MAS2);
 			mas3 = mfspr(SPR_MAS3);
 			mas7 = mfspr(SPR_MAS7);
 
 			idx = tlb0_tableidx(mas2, way);
 			tlb_print_entry(idx, mas1, mas2, mas3, mas7);
 		}
 }
 
 /**************************************************************************/
 /* TLB1 handling */
 /**************************************************************************/
 
 /*
  * TLB1 mapping notes:
  *
  * TLB1[0]	CCSRBAR
  * TLB1[1]	Kernel text and data.
  * TLB1[2-15]	Additional kernel text and data mappings (if required), PCI
  *		windows, other devices mappings.
  */
 
 /*
  * Write given entry to TLB1 hardware.
  * Use 32 bit pa, clear 4 high-order bits of RPN (mas7).
  */
 static void
 tlb1_write_entry(unsigned int idx)
 {
 	uint32_t mas0, mas7;
 
 	//debugf("tlb1_write_entry: s\n");
 
 	/* Clear high order RPN bits */
 	mas7 = 0;
 
 	/* Select entry */
 	mas0 = MAS0_TLBSEL(1) | MAS0_ESEL(idx);
 	//debugf("tlb1_write_entry: mas0 = 0x%08x\n", mas0);
 
 	mtspr(SPR_MAS0, mas0);
 	__asm __volatile("isync");
 	mtspr(SPR_MAS1, tlb1[idx].mas1);
 	__asm __volatile("isync");
 	mtspr(SPR_MAS2, tlb1[idx].mas2);
 	__asm __volatile("isync");
 	mtspr(SPR_MAS3, tlb1[idx].mas3);
 	__asm __volatile("isync");
 	mtspr(SPR_MAS7, mas7);
 	__asm __volatile("isync; tlbwe; isync; msync");
 
 	//debugf("tlb1_write_entry: e\n");
 }
 
 /*
  * Return the largest uint value log such that 2^log <= num.
  */
 static unsigned int
 ilog2(unsigned int num)
 {
 	int lz;
 
 	__asm ("cntlzw %0, %1" : "=r" (lz) : "r" (num));
 	return (31 - lz);
 }
 
 /*
  * Convert TLB TSIZE value to mapped region size.
  */
 static vm_size_t
 tsize2size(unsigned int tsize)
 {
 
 	/*
 	 * size = 4^tsize KB
 	 * size = 4^tsize * 2^10 = 2^(2 * tsize - 10)
 	 */
 
 	return ((1 << (2 * tsize)) * 1024);
 }
 
 /*
  * Convert region size (must be power of 4) to TLB TSIZE value.
  */
 static unsigned int
 size2tsize(vm_size_t size)
 {
 
 	return (ilog2(size) / 2 - 5);
 }
 
 /*
  * Register permanent kernel mapping in TLB1.
  *
  * Entries are created starting from index 0 (current free entry is
  * kept in tlb1_idx) and are not supposed to be invalidated.
  */
 static int
 tlb1_set_entry(vm_offset_t va, vm_offset_t pa, vm_size_t size,
     uint32_t flags)
 {
 	uint32_t ts, tid;
 	int tsize;
 	
 	if (tlb1_idx >= TLB1_ENTRIES) {
 		printf("tlb1_set_entry: TLB1 full!\n");
 		return (-1);
 	}
 
 	/* Convert size to TSIZE */
 	tsize = size2tsize(size);
 
 	tid = (TID_KERNEL << MAS1_TID_SHIFT) & MAS1_TID_MASK;
 	/* XXX TS is hard coded to 0 for now as we only use single address space */
 	ts = (0 << MAS1_TS_SHIFT) & MAS1_TS_MASK;
 
 	/* XXX LOCK tlb1[] */
 
 	tlb1[tlb1_idx].mas1 = MAS1_VALID | MAS1_IPROT | ts | tid;
 	tlb1[tlb1_idx].mas1 |= ((tsize << MAS1_TSIZE_SHIFT) & MAS1_TSIZE_MASK);
 	tlb1[tlb1_idx].mas2 = (va & MAS2_EPN_MASK) | flags;
 
 	/* Set supervisor RWX permission bits */
 	tlb1[tlb1_idx].mas3 = (pa & MAS3_RPN) | MAS3_SR | MAS3_SW | MAS3_SX;
 
 	tlb1_write_entry(tlb1_idx++);
 
 	/* XXX UNLOCK tlb1[] */
 
 	/*
 	 * XXX in general TLB1 updates should be propagated between CPUs,
 	 * since current design assumes to have the same TLB1 set-up on all
 	 * cores.
 	 */
 	return (0);
 }
 
 static int
 tlb1_entry_size_cmp(const void *a, const void *b)
 {
 	const vm_size_t *sza;
 	const vm_size_t *szb;
 
 	sza = a;
 	szb = b;
 	if (*sza > *szb)
 		return (-1);
 	else if (*sza < *szb)
 		return (1);
 	else
 		return (0);
 }
 
 /*
  * Map in contiguous RAM region into the TLB1 using maximum of
  * KERNEL_REGION_MAX_TLB_ENTRIES entries.
  *
  * If necessary round up last entry size and return total size
  * used by all allocated entries.
  */
 vm_size_t
 tlb1_mapin_region(vm_offset_t va, vm_offset_t pa, vm_size_t size)
 {
 	vm_size_t entry_size[KERNEL_REGION_MAX_TLB_ENTRIES];
 	vm_size_t mapped_size, sz, esz;
 	unsigned int log;
 	int i;
 
 	CTR4(KTR_PMAP, "%s: region size = 0x%08x va = 0x%08x pa = 0x%08x",
 	    __func__, size, va, pa);
 
 	mapped_size = 0;
 	sz = size;
 	memset(entry_size, 0, sizeof(entry_size));
 
 	/* Calculate entry sizes. */
 	for (i = 0; i < KERNEL_REGION_MAX_TLB_ENTRIES && sz > 0; i++) {
 
 		/* Largest region that is power of 4 and fits within size */
 		log = ilog2(sz) / 2;
 		esz = 1 << (2 * log);
 
 		/* If this is last entry cover remaining size. */
 		if (i ==  KERNEL_REGION_MAX_TLB_ENTRIES - 1) {
 			while (esz < sz)
 				esz = esz << 2;
 		}
 
 		entry_size[i] = esz;
 		mapped_size += esz;
 		if (esz < sz)
 			sz -= esz;
 		else
 			sz = 0;
 	}
 
 	/* Sort entry sizes, required to get proper entry address alignment. */
 	qsort(entry_size, KERNEL_REGION_MAX_TLB_ENTRIES,
 	    sizeof(vm_size_t), tlb1_entry_size_cmp);
 
 	/* Load TLB1 entries. */
 	for (i = 0; i < KERNEL_REGION_MAX_TLB_ENTRIES; i++) {
 		esz = entry_size[i];
 		if (!esz)
 			break;
 
 		CTR5(KTR_PMAP, "%s: entry %d: sz  = 0x%08x (va = 0x%08x "
 		    "pa = 0x%08x)", __func__, tlb1_idx, esz, va, pa);
 
 		tlb1_set_entry(va, pa, esz, _TLB_ENTRY_MEM);
 
 		va += esz;
 		pa += esz;
 	}
 
 	CTR3(KTR_PMAP, "%s: mapped size 0x%08x (wasted space 0x%08x)",
 	    __func__, mapped_size, mapped_size - size);
 
 	return (mapped_size);
 }
 
 /*
  * TLB1 initialization routine, to be called after the very first
  * assembler level setup done in locore.S.
  */
 void
 tlb1_init(vm_offset_t ccsrbar)
 {
 	uint32_t mas0;
 
 	/* TLB1[0] is used to map the kernel. Save that entry. */
 	mas0 = MAS0_TLBSEL(1) | MAS0_ESEL(0);
 	mtspr(SPR_MAS0, mas0);
 	__asm __volatile("isync; tlbre");
 
 	tlb1[0].mas1 = mfspr(SPR_MAS1);
 	tlb1[0].mas2 = mfspr(SPR_MAS2);
 	tlb1[0].mas3 = mfspr(SPR_MAS3);
 
 	/* Map in CCSRBAR in TLB1[1] */
 	tlb1_idx = 1;
 	tlb1_set_entry(CCSRBAR_VA, ccsrbar, CCSRBAR_SIZE, _TLB_ENTRY_IO);
 
 	/* Setup TLB miss defaults */
 	set_mas4_defaults();
 }
 
 /*
  * Setup MAS4 defaults.
  * These values are loaded to MAS0-2 on a TLB miss.
  */
 static void
 set_mas4_defaults(void)
 {
 	uint32_t mas4;
 
 	/* Defaults: TLB0, PID0, TSIZED=4K */
 	mas4 = MAS4_TLBSELD0;
 	mas4 |= (TLB_SIZE_4K << MAS4_TSIZED_SHIFT) & MAS4_TSIZED_MASK;
 #ifdef SMP
 	mas4 |= MAS4_MD;
 #endif
 	mtspr(SPR_MAS4, mas4);
 	__asm __volatile("isync");
 }
 
 /*
  * Print out contents of the MAS registers for each TLB1 entry
  */
 void
 tlb1_print_tlbentries(void)
 {
 	uint32_t mas0, mas1, mas2, mas3, mas7;
 	int i;
 
 	debugf("TLB1 entries:\n");
 	for (i = 0; i < TLB1_ENTRIES; i++) {
 
 		mas0 = MAS0_TLBSEL(1) | MAS0_ESEL(i);
 		mtspr(SPR_MAS0, mas0);
 
 		__asm __volatile("isync; tlbre");
 
 		mas1 = mfspr(SPR_MAS1);
 		mas2 = mfspr(SPR_MAS2);
 		mas3 = mfspr(SPR_MAS3);
 		mas7 = mfspr(SPR_MAS7);
 
 		tlb_print_entry(i, mas1, mas2, mas3, mas7);
 	}
 }
 
 /*
  * Print out contents of the in-ram tlb1 table.
  */
 void
 tlb1_print_entries(void)
 {
 	int i;
 
 	debugf("tlb1[] table entries:\n");
 	for (i = 0; i < TLB1_ENTRIES; i++)
 		tlb_print_entry(i, tlb1[i].mas1, tlb1[i].mas2, tlb1[i].mas3, 0);
 }
 
 /*
  * Return 0 if the physical IO range is encompassed by one of the
  * the TLB1 entries, otherwise return related error code.
  */
 static int
 tlb1_iomapped(int i, vm_paddr_t pa, vm_size_t size, vm_offset_t *va)
 {
 	uint32_t prot;
 	vm_paddr_t pa_start;
 	vm_paddr_t pa_end;
 	unsigned int entry_tsize;
 	vm_size_t entry_size;
 
 	*va = (vm_offset_t)NULL;
 
 	/* Skip invalid entries */
 	if (!(tlb1[i].mas1 & MAS1_VALID))
 		return (EINVAL);
 
 	/*
 	 * The entry must be cache-inhibited, guarded, and r/w
 	 * so it can function as an i/o page
 	 */
 	prot = tlb1[i].mas2 & (MAS2_I | MAS2_G);
 	if (prot != (MAS2_I | MAS2_G))
 		return (EPERM);
 
 	prot = tlb1[i].mas3 & (MAS3_SR | MAS3_SW);
 	if (prot != (MAS3_SR | MAS3_SW))
 		return (EPERM);
 
 	/* The address should be within the entry range. */
 	entry_tsize = (tlb1[i].mas1 & MAS1_TSIZE_MASK) >> MAS1_TSIZE_SHIFT;
 	KASSERT((entry_tsize), ("tlb1_iomapped: invalid entry tsize"));
 
 	entry_size = tsize2size(entry_tsize);
 	pa_start = tlb1[i].mas3 & MAS3_RPN;
 	pa_end = pa_start + entry_size - 1;
 
 	if ((pa < pa_start) || ((pa + size) > pa_end))
 		return (ERANGE);
 
 	/* Return virtual address of this mapping. */
 	*va = (tlb1[i].mas2 & MAS2_EPN_MASK) + (pa - pa_start);
 	return (0);
 }
Index: head/sys/powerpc/include/_types.h
===================================================================
--- head/sys/powerpc/include/_types.h	(revision 222812)
+++ head/sys/powerpc/include/_types.h	(revision 222813)
@@ -1,158 +1,157 @@
 /*-
  * Copyright (c) 2002 Mike Barcroft <mike@FreeBSD.org>
  * Copyright (c) 1990, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	From: @(#)ansi.h	8.2 (Berkeley) 1/4/94
  *	From: @(#)types.h	8.3 (Berkeley) 1/5/94
  * $FreeBSD$
  */
 
 #ifndef _MACHINE__TYPES_H_
 #define	_MACHINE__TYPES_H_
 
 #ifndef _SYS_CDEFS_H_
 #error this file needs sys/cdefs.h as a prerequisite
 #endif
 
 /*
  * Basic types upon which most other types are built.
  */
 typedef	__signed char		__int8_t;
 typedef	unsigned char		__uint8_t;
 typedef	short			__int16_t;
 typedef	unsigned short		__uint16_t;
 typedef	int			__int32_t;
 typedef	unsigned int		__uint32_t;
 #ifdef __LP64__
 typedef	long			__int64_t;
 typedef	unsigned long		__uint64_t;
 #else
 #ifndef lint
 __extension__
 #endif
 /* LONGLONG */
 typedef	long long		__int64_t;
 #ifndef lint
 __extension__
 #endif
 /* LONGLONG */
 typedef	unsigned long long	__uint64_t;
 #endif
 
 /*
  * Standard type definitions.
  */
 typedef	__uint32_t	__clock_t;		/* clock()... */
-typedef	unsigned int	__cpumask_t;
 typedef	double		__double_t;
 typedef	double		__float_t;
 #ifdef __LP64__
 typedef	__int64_t	__critical_t;
 typedef	__int64_t	__intfptr_t;
 typedef	__int64_t	__intptr_t;
 #else
 typedef	__int32_t	__critical_t;
 typedef	__int32_t	__intfptr_t;
 typedef	__int32_t	__intptr_t;
 #endif
 typedef	__int64_t	__intmax_t;
 typedef	__int32_t	__int_fast8_t;
 typedef	__int32_t	__int_fast16_t;
 typedef	__int32_t	__int_fast32_t;
 typedef	__int64_t	__int_fast64_t;
 typedef	__int8_t	__int_least8_t;
 typedef	__int16_t	__int_least16_t;
 typedef	__int32_t	__int_least32_t;
 typedef	__int64_t	__int_least64_t;
 #ifdef __LP64__
 typedef	__int64_t	__ptrdiff_t;		/* ptr1 - ptr2 */
 typedef	__int64_t	__register_t;
 typedef	__int64_t	__segsz_t;		/* segment size (in pages) */
 typedef	__uint64_t	__size_t;		/* sizeof() */
 typedef	__int64_t	__ssize_t;		/* byte count or error */
 typedef	__int64_t	__time_t;		/* time()... */
 typedef	__uint64_t	__uintfptr_t;
 typedef	__uint64_t	__uintptr_t;
 #else
 typedef	__int32_t	__ptrdiff_t;		/* ptr1 - ptr2 */
 typedef	__int32_t	__register_t;
 typedef	__int32_t	__segsz_t;		/* segment size (in pages) */
 typedef	__uint32_t	__size_t;		/* sizeof() */
 typedef	__int32_t	__ssize_t;		/* byte count or error */
 typedef	__int32_t	__time_t;		/* time()... */
 typedef	__uint32_t	__uintfptr_t;
 typedef	__uint32_t	__uintptr_t;
 #endif
 typedef	__uint64_t	__uintmax_t;
 typedef	__uint32_t	__uint_fast8_t;
 typedef	__uint32_t	__uint_fast16_t;
 typedef	__uint32_t	__uint_fast32_t;
 typedef	__uint64_t	__uint_fast64_t;
 typedef	__uint8_t	__uint_least8_t;
 typedef	__uint16_t	__uint_least16_t;
 typedef	__uint32_t	__uint_least32_t;
 typedef	__uint64_t	__uint_least64_t;
 #ifdef __LP64__
 typedef	__uint64_t	__u_register_t;
 typedef	__uint64_t	__vm_offset_t;
 typedef	__uint64_t	__vm_paddr_t;
 typedef	__uint64_t	__vm_size_t;
 #else
 typedef	__uint32_t	__u_register_t;
 typedef	__uint32_t	__vm_offset_t;
 typedef	__uint32_t	__vm_paddr_t;
 typedef	__uint32_t	__vm_size_t;
 #endif
 typedef	__int64_t	__vm_ooffset_t;
 typedef	__uint64_t	__vm_pindex_t;
 
 /*
  * Unusual type definitions.
  */
 #if defined(__GNUCLIKE_BUILTIN_VARARGS)
 typedef __builtin_va_list	__va_list;	/* internally known to gcc */
 #else
 typedef	struct {
 	char	__gpr;
 	char	__fpr;
 	char	__pad[2];
 	char	*__stack;
 	char	*__base;
 } __va_list;
 #endif /* post GCC 2.95 */
 #if defined(__GNUC_VA_LIST_COMPATIBILITY) && !defined(__GNUC_VA_LIST) \
     && !defined(__NO_GNUC_VA_LIST)
 #define __GNUC_VA_LIST
 typedef __va_list		__gnuc_va_list;	/* compatibility w/GNU headers*/
 #endif
 
 #endif /* !_MACHINE__TYPES_H_ */
Index: head/sys/powerpc/include/openpicvar.h
===================================================================
--- head/sys/powerpc/include/openpicvar.h	(revision 222812)
+++ head/sys/powerpc/include/openpicvar.h	(revision 222813)
@@ -1,69 +1,69 @@
 /*-
  * Copyright (C) 2002 Benno Rice.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef	_POWERPC_OPENPICVAR_H_
 #define	_POWERPC_OPENPICVAR_H_
 
 #define OPENPIC_DEVSTR	"OpenPIC Interrupt Controller"
 
 #define OPENPIC_IRQMAX	256	/* h/w allows more */
 
 struct openpic_softc {
 	device_t	sc_dev;
 	struct resource	*sc_memr;
 	struct resource	*sc_intr;
 	bus_space_tag_t sc_bt;
 	bus_space_handle_t sc_bh;
 	char		*sc_version;
 	int		sc_rid;
 	int		sc_irq;
 	void		*sc_icookie;
 	u_int		sc_ncpu;
 	u_int		sc_nirq;
 	int		sc_psim;
 };
 
 extern devclass_t openpic_devclass;
 
 /*
  * Bus-independent attach i/f
  */
 int	openpic_common_attach(device_t, uint32_t);
 
 /*
  * PIC interface.
  */
-void	openpic_bind(device_t dev, u_int irq, cpumask_t cpumask);
+void	openpic_bind(device_t dev, u_int irq, cpuset_t cpumask);
 void	openpic_config(device_t, u_int, enum intr_trigger, enum intr_polarity);
 void	openpic_dispatch(device_t, struct trapframe *);
 void	openpic_enable(device_t, u_int, u_int);
 void	openpic_eoi(device_t, u_int);
 void	openpic_ipi(device_t, u_int);
 void	openpic_mask(device_t, u_int);
 void	openpic_unmask(device_t, u_int);
 
 #endif /* _POWERPC_OPENPICVAR_H_ */
Index: head/sys/powerpc/include/pmap.h
===================================================================
--- head/sys/powerpc/include/pmap.h	(revision 222812)
+++ head/sys/powerpc/include/pmap.h	(revision 222813)
@@ -1,252 +1,253 @@
 /*-
  * Copyright (C) 2006 Semihalf, Marian Balakowicz <m8@semihalf.com>
  * All rights reserved.
  *
  * Adapted for Freescale's e500 core CPUs.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. The name of the author may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN
  * NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
  * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
  * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
  * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
  * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 /*-
  * Copyright (C) 1995, 1996 Wolfgang Solfrank.
  * Copyright (C) 1995, 1996 TooLs GmbH.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by TooLs GmbH.
  * 4. The name of TooLs GmbH may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  *	from: $NetBSD: pmap.h,v 1.17 2000/03/30 16:18:24 jdolecek Exp $
  */
 
 #ifndef	_MACHINE_PMAP_H_
 #define	_MACHINE_PMAP_H_
 
 #include <sys/queue.h>
 #include <sys/tree.h>
+#include <sys/_cpuset.h>
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
 #include <machine/sr.h>
 #include <machine/pte.h>
 #include <machine/slb.h>
 #include <machine/tlb.h>
 
 struct pmap_md {
 	u_int		md_index;
 	vm_paddr_t      md_paddr;
 	vm_offset_t     md_vaddr;
 	vm_size_t       md_size;
 };
 
 #if defined(AIM)
 
 #if !defined(NPMAPS)
 #define	NPMAPS		32768
 #endif /* !defined(NPMAPS) */
 
 struct	slbtnode;
 
 struct	pmap {
 	struct	mtx	pm_mtx;
 	
     #ifdef __powerpc64__
 	struct slbtnode	*pm_slb_tree_root;
 	struct slb	**pm_slb;
 	int		pm_slb_len;
     #else
 	register_t	pm_sr[16];
     #endif
-	cpumask_t	pm_active;
+	cpuset_t	pm_active;
 
 	struct pmap	*pmap_phys;
 	struct		pmap_statistics	pm_stats;
 };
 
 typedef	struct pmap *pmap_t;
 
 struct pvo_entry {
 	LIST_ENTRY(pvo_entry) pvo_vlink;	/* Link to common virt page */
 	LIST_ENTRY(pvo_entry) pvo_olink;	/* Link to overflow entry */
 	union {
 		struct	pte pte;		/* 32 bit PTE */
 		struct	lpte lpte;		/* 64 bit PTE */
 	} pvo_pte;
 	pmap_t		pvo_pmap;		/* Owning pmap */
 	vm_offset_t	pvo_vaddr;		/* VA of entry */
 	uint64_t	pvo_vpn;		/* Virtual page number */
 };
 LIST_HEAD(pvo_head, pvo_entry);
 
 #define	PVO_PTEGIDX_MASK	0x007UL		/* which PTEG slot */
 #define	PVO_PTEGIDX_VALID	0x008UL		/* slot is valid */
 #define	PVO_WIRED		0x010UL		/* PVO entry is wired */
 #define	PVO_MANAGED		0x020UL		/* PVO entry is managed */
 #define	PVO_EXECUTABLE		0x040UL		/* PVO entry is executable */
 #define	PVO_BOOTSTRAP		0x080UL		/* PVO entry allocated during
 						   bootstrap */
 #define PVO_FAKE		0x100UL		/* fictitious phys page */
 #define PVO_LARGE		0x200UL		/* large page */
 #define	PVO_VADDR(pvo)		((pvo)->pvo_vaddr & ~ADDR_POFF)
 #define PVO_ISFAKE(pvo)		((pvo)->pvo_vaddr & PVO_FAKE)
 #define	PVO_PTEGIDX_GET(pvo)	((pvo)->pvo_vaddr & PVO_PTEGIDX_MASK)
 #define	PVO_PTEGIDX_ISSET(pvo)	((pvo)->pvo_vaddr & PVO_PTEGIDX_VALID)
 #define	PVO_PTEGIDX_CLR(pvo)	\
 	((void)((pvo)->pvo_vaddr &= ~(PVO_PTEGIDX_VALID|PVO_PTEGIDX_MASK)))
 #define	PVO_PTEGIDX_SET(pvo, i)	\
 	((void)((pvo)->pvo_vaddr |= (i)|PVO_PTEGIDX_VALID))
 #define	PVO_VSID(pvo)		((pvo)->pvo_vpn >> 16)
 
 struct	md_page {
 	u_int64_t	 mdpg_attrs;
 	vm_memattr_t	 mdpg_cache_attrs;
 	struct	pvo_head mdpg_pvoh;
 };
 
 #define	pmap_page_get_memattr(m)	((m)->md.mdpg_cache_attrs)
 #define	pmap_page_is_mapped(m)	(!LIST_EMPTY(&(m)->md.mdpg_pvoh))
 
 /*
  * Return the VSID corresponding to a given virtual address.
  * If no VSID is currently defined, it will allocate one, and add
  * it to a free slot if available.
  *
  * NB: The PMAP MUST be locked already.
  */
 uint64_t va_to_vsid(pmap_t pm, vm_offset_t va);
 
 /* Lock-free, non-allocating lookup routines */
 uint64_t kernel_va_to_slbv(vm_offset_t va);
 struct slb *user_va_to_slb_entry(pmap_t pm, vm_offset_t va);
 
 uint64_t allocate_user_vsid(pmap_t pm, uint64_t esid, int large);
 void	free_vsid(pmap_t pm, uint64_t esid, int large);
 void	slb_insert_user(pmap_t pm, struct slb *slb);
 void	slb_insert_kernel(uint64_t slbe, uint64_t slbv);
 
 struct slbtnode *slb_alloc_tree(void);
 void     slb_free_tree(pmap_t pm);
 struct slb **slb_alloc_user_cache(void);
 void	slb_free_user_cache(struct slb **);
 
 #else
 
 struct pmap {
 	struct mtx		pm_mtx;		/* pmap mutex */
 	tlbtid_t		pm_tid[MAXCPU];	/* TID to identify this pmap entries in TLB */
-	cpumask_t		pm_active;	/* active on cpus */
+	cpuset_t		pm_active;	/* active on cpus */
 	struct pmap_statistics	pm_stats;	/* pmap statistics */
 
 	/* Page table directory, array of pointers to page tables. */
 	pte_t			*pm_pdir[PDIR_NENTRIES];
 
 	/* List of allocated ptbl bufs (ptbl kva regions). */
 	TAILQ_HEAD(, ptbl_buf)	pm_ptbl_list;
 };
 typedef	struct pmap *pmap_t;
 
 struct pv_entry {
 	pmap_t pv_pmap;
 	vm_offset_t pv_va;
 	TAILQ_ENTRY(pv_entry) pv_link;
 };
 typedef struct pv_entry *pv_entry_t;
 
 struct md_page {
 	TAILQ_HEAD(, pv_entry) pv_list;
 };
 
 #define	pmap_page_get_memattr(m)	VM_MEMATTR_DEFAULT
 #define	pmap_page_is_mapped(m)	(!TAILQ_EMPTY(&(m)->md.pv_list))
 
 #endif /* AIM */
 
 extern	struct pmap kernel_pmap_store;
 #define	kernel_pmap	(&kernel_pmap_store)
 
 #ifdef _KERNEL
 
 #define	PMAP_LOCK(pmap)		mtx_lock(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_ASSERT(pmap, type) \
 				mtx_assert(&(pmap)->pm_mtx, (type))
 #define	PMAP_LOCK_DESTROY(pmap)	mtx_destroy(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_INIT(pmap)	mtx_init(&(pmap)->pm_mtx, "pmap", \
 				    NULL, MTX_DEF)
 #define	PMAP_LOCKED(pmap)	mtx_owned(&(pmap)->pm_mtx)
 #define	PMAP_MTX(pmap)		(&(pmap)->pm_mtx)
 #define	PMAP_TRYLOCK(pmap)	mtx_trylock(&(pmap)->pm_mtx)
 #define	PMAP_UNLOCK(pmap)	mtx_unlock(&(pmap)->pm_mtx)
 
 void		pmap_bootstrap(vm_offset_t, vm_offset_t);
 void		pmap_kenter(vm_offset_t va, vm_offset_t pa);
 void		pmap_kenter_attr(vm_offset_t va, vm_offset_t pa, vm_memattr_t);
 void		pmap_kremove(vm_offset_t);
 void		*pmap_mapdev(vm_offset_t, vm_size_t);
 void		*pmap_mapdev_attr(vm_offset_t, vm_size_t, vm_memattr_t);
 void		pmap_unmapdev(vm_offset_t, vm_size_t);
 void		pmap_page_set_memattr(vm_page_t, vm_memattr_t);
 void		pmap_deactivate(struct thread *);
 vm_offset_t	pmap_kextract(vm_offset_t);
 int		pmap_dev_direct_mapped(vm_offset_t, vm_size_t);
 boolean_t	pmap_mmu_install(char *name, int prio);
 
 #define	vtophys(va)	pmap_kextract((vm_offset_t)(va))
 
 #define PHYS_AVAIL_SZ	128
 extern	vm_offset_t phys_avail[PHYS_AVAIL_SZ];
 extern	vm_offset_t virtual_avail;
 extern	vm_offset_t virtual_end;
 
 extern	vm_offset_t msgbuf_phys;
 
 extern	int pmap_bootstrapped;
 
 extern vm_offset_t pmap_dumpsys_map(struct pmap_md *, vm_size_t, vm_size_t *);
 extern void pmap_dumpsys_unmap(struct pmap_md *, vm_size_t, vm_offset_t);
 
 extern struct pmap_md *pmap_scan_md(struct pmap_md *);
 
 #endif
 
 #endif /* !_MACHINE_PMAP_H_ */
Index: head/sys/powerpc/include/smp.h
===================================================================
--- head/sys/powerpc/include/smp.h	(revision 222812)
+++ head/sys/powerpc/include/smp.h	(revision 222813)
@@ -1,60 +1,62 @@
 /*-
  * Copyright (c) 2008 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_SMP_H_
 #define _MACHINE_SMP_H_
 
 #ifdef _KERNEL
 
 #define	IPI_AST			0
 #define	IPI_PREEMPT		1
 #define	IPI_RENDEZVOUS		2
 #define	IPI_STOP		3
 #define	IPI_STOP_HARD		3
 #define	IPI_HARDCLOCK		4
 
 #ifndef LOCORE
 
+#include <sys/_cpuset.h>
+
 void	ipi_all_but_self(int ipi);
 void	ipi_cpu(int cpu, u_int ipi);
-void	ipi_selected(cpumask_t cpus, int ipi);
+void	ipi_selected(cpuset_t cpus, int ipi);
 
 struct cpuref {
 	uintptr_t	cr_hwref;
 	u_int		cr_cpuid;
 };
 
 void	pmap_cpu_bootstrap(int);
 void	cpudep_ap_early_bootstrap(void);
 uintptr_t cpudep_ap_bootstrap(void);
 void	cpudep_ap_setup(void);
 void	machdep_ap_bootstrap(void);
 
 #endif /* !LOCORE */
 #endif /* _KERNEL */
 #endif /* !_MACHINE_SMP_H */
Index: head/sys/powerpc/mpc85xx/openpic_fdt.c
===================================================================
--- head/sys/powerpc/mpc85xx/openpic_fdt.c	(revision 222812)
+++ head/sys/powerpc/mpc85xx/openpic_fdt.c	(revision 222813)
@@ -1,92 +1,93 @@
 /*-
  * Copyright (c) 2009-2010 The FreeBSD Foundation
  * All rights reserved.
  *
  * This software was developed by Semihalf under sponsorship from
  * the FreeBSD Foundation.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/bus.h>
 
 #include <machine/bus.h>
 #include <machine/intr_machdep.h>
-#include <machine/openpicvar.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
+
+#include <machine/openpicvar.h>
 
 #include "pic_if.h"
 
 static int openpic_fdt_probe(device_t);
 static int openpic_fdt_attach(device_t);
 
 static device_method_t openpic_fdt_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		openpic_fdt_probe),
 	DEVMETHOD(device_attach,	openpic_fdt_attach),
 
 	/* PIC interface */
 	DEVMETHOD(pic_bind,		openpic_bind),
 	DEVMETHOD(pic_config,		openpic_config),
 	DEVMETHOD(pic_dispatch,		openpic_dispatch),
 	DEVMETHOD(pic_enable,		openpic_enable),
 	DEVMETHOD(pic_eoi,		openpic_eoi),
 	DEVMETHOD(pic_ipi,		openpic_ipi),
 	DEVMETHOD(pic_mask,		openpic_mask),
 	DEVMETHOD(pic_unmask,		openpic_unmask),
 
 	{ 0, 0 },
 };
 
 static driver_t openpic_fdt_driver = {
 	"openpic",
 	openpic_fdt_methods,
 	sizeof(struct openpic_softc)
 };
 
 DRIVER_MODULE(openpic, simplebus, openpic_fdt_driver, openpic_devclass, 0, 0);
 
 static int
 openpic_fdt_probe(device_t dev)
 {
 
 	if (!ofw_bus_is_compatible(dev, "chrp,open-pic"))
 		return (ENXIO);
 		
 	device_set_desc(dev, OPENPIC_DEVSTR);
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 openpic_fdt_attach(device_t dev)
 {
 
 	return (openpic_common_attach(dev, ofw_bus_get_node(dev)));
 }
Index: head/sys/powerpc/powerpc/intr_machdep.c
===================================================================
--- head/sys/powerpc/powerpc/intr_machdep.c	(revision 222812)
+++ head/sys/powerpc/powerpc/intr_machdep.c	(revision 222813)
@@ -1,554 +1,555 @@
 /*-
  * Copyright (c) 1991 The Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 /*-
  * Copyright (c) 2002 Benno Rice.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)isa.c	7.2 (Berkeley) 5/13/91
  *	form: src/sys/i386/isa/intr_machdep.c,v 1.57 2001/07/20
  *
  * $FreeBSD$
  */
 
 #include "opt_isa.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/queue.h>
 #include <sys/bus.h>
+#include <sys/cpuset.h>
 #include <sys/interrupt.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/smp.h>
 #include <sys/syslog.h>
 #include <sys/vmmeter.h>
 #include <sys/proc.h>
 
 #include <machine/frame.h>
 #include <machine/intr_machdep.h>
 #include <machine/md_var.h>
 #include <machine/smp.h>
 #include <machine/trap.h>
 
 #include "pic_if.h"
 
 #define	MAX_STRAY_LOG	5
 
 MALLOC_DEFINE(M_INTR, "intr", "interrupt handler data");
 
 struct powerpc_intr {
 	struct intr_event *event;
 	long	*cntp;
 	u_int	irq;
 	device_t pic;
 	u_int	intline;
 	u_int	vector;
 	u_int	cntindex;
-	cpumask_t cpu;
+	cpuset_t cpu;
 	enum intr_trigger trig;
 	enum intr_polarity pol;
 };
 
 struct pic {
 	device_t dev;
 	uint32_t node;
 	u_int	irqs;
 	u_int	ipis;
 	int	base;
 };
 
 static u_int intrcnt_index = 0;
 static struct mtx intr_table_lock;
 static struct powerpc_intr *powerpc_intrs[INTR_VECTORS];
 static struct pic piclist[MAX_PICS];
 static u_int nvectors;		/* Allocated vectors */
 static u_int npics;		/* PICs registered */
 #ifdef DEV_ISA
 static u_int nirqs = 16;	/* Allocated IRQS (ISA pre-allocated). */
 #else
 static u_int nirqs = 0;		/* Allocated IRQs. */
 #endif
 static u_int stray_count;
 
 device_t root_pic;
 
 #ifdef SMP
 static void *ipi_cookie;
 #endif
 
 static void
 intr_init(void *dummy __unused)
 {
 
 	mtx_init(&intr_table_lock, "intr sources lock", NULL, MTX_DEF);
 }
 SYSINIT(intr_init, SI_SUB_INTR, SI_ORDER_FIRST, intr_init, NULL);
 
 #ifdef SMP
 static void
 smp_intr_init(void *dummy __unused)
 {
 	struct powerpc_intr *i;
 	int vector;
 
 	for (vector = 0; vector < nvectors; vector++) {
 		i = powerpc_intrs[vector];
 		if (i != NULL && i->pic == root_pic)
 			PIC_BIND(i->pic, i->intline, i->cpu);
 	}
 }
 SYSINIT(smp_intr_init, SI_SUB_SMP, SI_ORDER_ANY, smp_intr_init, NULL);
 #endif
 
 static void
 intrcnt_setname(const char *name, int index)
 {
 
 	snprintf(intrnames + (MAXCOMLEN + 1) * index, MAXCOMLEN + 1, "%-*s",
 	    MAXCOMLEN, name);
 }
 
 void
 intrcnt_add(const char *name, u_long **countp)
 {
 	int idx;
 
 	idx = atomic_fetchadd_int(&intrcnt_index, 1);
 	*countp = &intrcnt[idx];
 	intrcnt_setname(name, idx);
 }
 
 static struct powerpc_intr *
 intr_lookup(u_int irq)
 {
 	char intrname[8];
 	struct powerpc_intr *i, *iscan;
 	int vector;
 
 	mtx_lock(&intr_table_lock);
 	for (vector = 0; vector < nvectors; vector++) {
 		i = powerpc_intrs[vector];
 		if (i != NULL && i->irq == irq) {
 			mtx_unlock(&intr_table_lock);
 			return (i);
 		}
 	}
 
 	i = malloc(sizeof(*i), M_INTR, M_NOWAIT);
 	if (i == NULL) {
 		mtx_unlock(&intr_table_lock);
 		return (NULL);
 	}
 
 	i->event = NULL;
 	i->cntp = NULL;
 	i->trig = INTR_TRIGGER_CONFORM;
 	i->pol = INTR_POLARITY_CONFORM;
 	i->irq = irq;
 	i->pic = NULL;
 	i->vector = -1;
 
 #ifdef SMP
 	i->cpu = all_cpus;
 #else
-	i->cpu = 1;
+	CPU_SETOF(0, &i->cpu);
 #endif
 
 	for (vector = 0; vector < INTR_VECTORS && vector <= nvectors;
 	    vector++) {
 		iscan = powerpc_intrs[vector];
 		if (iscan != NULL && iscan->irq == irq)
 			break;
 		if (iscan == NULL && i->vector == -1)
 			i->vector = vector;
 		iscan = NULL;
 	}
 
 	if (iscan == NULL && i->vector != -1) {
 		powerpc_intrs[i->vector] = i;
 		i->cntindex = atomic_fetchadd_int(&intrcnt_index, 1);
 		i->cntp = &intrcnt[i->cntindex];
 		sprintf(intrname, "irq%u:", i->irq);
 		intrcnt_setname(intrname, i->cntindex);
 		nvectors++;
 	}
 	mtx_unlock(&intr_table_lock);
 
 	if (iscan != NULL || i->vector == -1) {
 		free(i, M_INTR);
 		i = iscan;
 	}
 
 	return (i);
 }
 
 static int
 powerpc_map_irq(struct powerpc_intr *i)
 {
 	struct pic *p;
 	u_int cnt;
 	int idx;
 
 	for (idx = 0; idx < npics; idx++) {
 		p = &piclist[idx];
 		cnt = p->irqs + p->ipis;
 		if (i->irq >= p->base && i->irq < p->base + cnt)
 			break;
 	}
 	if (idx == npics)
 		return (EINVAL);
 
 	i->intline = i->irq - p->base;
 	i->pic = p->dev;
 
 	/* Try a best guess if that failed */
 	if (i->pic == NULL)
 		i->pic = root_pic;
 
 	return (0);
 }
 
 static void
 powerpc_intr_eoi(void *arg)
 {
 	struct powerpc_intr *i = arg;
 
 	PIC_EOI(i->pic, i->intline);
 }
 
 static void
 powerpc_intr_pre_ithread(void *arg)
 {
 	struct powerpc_intr *i = arg;
 
 	PIC_MASK(i->pic, i->intline);
 	PIC_EOI(i->pic, i->intline);
 }
 
 static void
 powerpc_intr_post_ithread(void *arg)
 {
 	struct powerpc_intr *i = arg;
 
 	PIC_UNMASK(i->pic, i->intline);
 }
 
 static int
 powerpc_assign_intr_cpu(void *arg, u_char cpu)
 {
 #ifdef SMP
 	struct powerpc_intr *i = arg;
 
 	if (cpu == NOCPU)
 		i->cpu = all_cpus;
 	else
-		i->cpu = 1 << cpu;
+		CPU_SETOF(cpu, &i->cpu);
 
 	if (!cold && i->pic != NULL && i->pic == root_pic)
 		PIC_BIND(i->pic, i->intline, i->cpu);
 
 	return (0);
 #else
 	return (EOPNOTSUPP);
 #endif
 }
 
 void
 powerpc_register_pic(device_t dev, uint32_t node, u_int irqs, u_int ipis,
     u_int atpic)
 {
 	struct pic *p;
 	u_int irq;
 	int idx;
 
 	mtx_lock(&intr_table_lock);
 
 	/* XXX see powerpc_get_irq(). */
 	for (idx = 0; idx < npics; idx++) {
 		p = &piclist[idx];
 		if (p->node != node)
 			continue;
 		if (node != 0 || p->dev == dev)
 			break;
 	}
 	p = &piclist[idx];
 
 	p->dev = dev;
 	p->node = node;
 	p->irqs = irqs;
 	p->ipis = ipis;
 	if (idx == npics) {
 #ifdef DEV_ISA
 		p->base = (atpic) ? 0 : nirqs;
 #else
 		p->base = nirqs;
 #endif
 		irq = p->base + irqs + ipis;
 		nirqs = MAX(nirqs, irq);
 		npics++;
 	}
 
 	mtx_unlock(&intr_table_lock);
 }
 
 u_int
 powerpc_get_irq(uint32_t node, u_int pin)
 {
 	int idx;
 
 	if (node == 0)
 		return (pin);
 
 	mtx_lock(&intr_table_lock);
 	for (idx = 0; idx < npics; idx++) {
 		if (piclist[idx].node == node) {
 			mtx_unlock(&intr_table_lock);
 			return (piclist[idx].base + pin);
 		}
 	}
 
 	/*
 	 * XXX we should never encounter an unregistered PIC, but that
 	 * can only be done when we properly support bus enumeration
 	 * using multiple passes. Until then, fake an entry and give it
 	 * some adhoc maximum number of IRQs and IPIs.
 	 */
 	piclist[idx].dev = NULL;
 	piclist[idx].node = node;
 	piclist[idx].irqs = 124;
 	piclist[idx].ipis = 4;
 	piclist[idx].base = nirqs;
 	nirqs += 128;
 	npics++;
 
 	mtx_unlock(&intr_table_lock);
 
 	return (piclist[idx].base + pin);
 }
 
 int
 powerpc_enable_intr(void)
 {
 	struct powerpc_intr *i;
 	int error, vector;
 #ifdef SMP
 	int n;
 #endif
 
 	if (npics == 0)
 		panic("no PIC detected\n");
 
 	if (root_pic == NULL)
 		root_pic = piclist[0].dev;
 
 #ifdef SMP
 	/* Install an IPI handler. */
 	if (mp_ncpus > 1) {
 		for (n = 0; n < npics; n++) {
 			if (piclist[n].dev != root_pic)
 				continue;
 
 			KASSERT(piclist[n].ipis != 0,
 			    ("%s: SMP root PIC does not supply any IPIs",
 			    __func__));
 			error = powerpc_setup_intr("IPI",
 			    MAP_IRQ(piclist[n].node, piclist[n].irqs),
 			    powerpc_ipi_handler, NULL, NULL,
 			    INTR_TYPE_MISC | INTR_EXCL, &ipi_cookie);
 			if (error) {
 				printf("unable to setup IPI handler\n");
 				return (error);
 			}
 		}
 	}
 #endif
 
 	for (vector = 0; vector < nvectors; vector++) {
 		i = powerpc_intrs[vector];
 		if (i == NULL)
 			continue;
 
 		error = powerpc_map_irq(i);
 		if (error)
 			continue;
 
 		if (i->trig != INTR_TRIGGER_CONFORM ||
 		    i->pol != INTR_POLARITY_CONFORM)
 			PIC_CONFIG(i->pic, i->intline, i->trig, i->pol);
 
 		if (i->event != NULL)
 			PIC_ENABLE(i->pic, i->intline, vector);
 	}
 
 	return (0);
 }
 
 int
 powerpc_setup_intr(const char *name, u_int irq, driver_filter_t filter,
     driver_intr_t handler, void *arg, enum intr_type flags, void **cookiep)
 {
 	struct powerpc_intr *i;
 	int error, enable = 0;
 
 	i = intr_lookup(irq);
 	if (i == NULL)
 		return (ENOMEM);
 
 	if (i->event == NULL) {
 		error = intr_event_create(&i->event, (void *)i, 0, irq,
 		    powerpc_intr_pre_ithread, powerpc_intr_post_ithread,
 		    powerpc_intr_eoi, powerpc_assign_intr_cpu, "irq%u:", irq);
 		if (error)
 			return (error);
 
 		enable = 1;
 	}
 
 	error = intr_event_add_handler(i->event, name, filter, handler, arg,
 	    intr_priority(flags), flags, cookiep);
 
 	mtx_lock(&intr_table_lock);
 	intrcnt_setname(i->event->ie_fullname, i->cntindex);
 	mtx_unlock(&intr_table_lock);
 
 	if (!cold) {
 		error = powerpc_map_irq(i);
 
 		if (!error && (i->trig != INTR_TRIGGER_CONFORM ||
 		    i->pol != INTR_POLARITY_CONFORM))
 			PIC_CONFIG(i->pic, i->intline, i->trig, i->pol);
 
 		if (!error && i->pic == root_pic)
 			PIC_BIND(i->pic, i->intline, i->cpu);
 
 		if (!error && enable)
 			PIC_ENABLE(i->pic, i->intline, i->vector);
 	}
 	return (error);
 }
 
 int
 powerpc_teardown_intr(void *cookie)
 {
 
 	return (intr_event_remove_handler(cookie));
 }
 
 #ifdef SMP
 int
 powerpc_bind_intr(u_int irq, u_char cpu)
 {
 	struct powerpc_intr *i;
 
 	i = intr_lookup(irq);
 	if (i == NULL)
 		return (ENOMEM);
 
 	return (intr_event_bind(i->event, cpu));
 }
 #endif
 
 int
 powerpc_config_intr(int irq, enum intr_trigger trig, enum intr_polarity pol)
 {
 	struct powerpc_intr *i;
 
 	i = intr_lookup(irq);
 	if (i == NULL)
 		return (ENOMEM);
 
 	i->trig = trig;
 	i->pol = pol;
 
 	if (!cold && i->pic != NULL)
 		PIC_CONFIG(i->pic, i->intline, trig, pol);
 
 	return (0);
 }
 
 void
 powerpc_dispatch_intr(u_int vector, struct trapframe *tf)
 {
 	struct powerpc_intr *i;
 	struct intr_event *ie;
 
 	i = powerpc_intrs[vector];
 	if (i == NULL)
 		goto stray;
 
 	(*i->cntp)++;
 
 	ie = i->event;
 	KASSERT(ie != NULL, ("%s: interrupt without an event", __func__));
 
 	if (intr_event_handle(ie, tf) != 0) {
 		goto stray;
 	}
 	return;
 
 stray:
 	stray_count++;
 	if (stray_count <= MAX_STRAY_LOG) {
 		printf("stray irq %d\n", i ? i->irq : -1);
 		if (stray_count >= MAX_STRAY_LOG) {
 			printf("got %d stray interrupts, not logging anymore\n",
 			    MAX_STRAY_LOG);
 		}
 	}
 	if (i != NULL)
 		PIC_MASK(i->pic, i->intline);
 }
Index: head/sys/powerpc/powerpc/mp_machdep.c
===================================================================
--- head/sys/powerpc/powerpc/mp_machdep.c	(revision 222812)
+++ head/sys/powerpc/powerpc/mp_machdep.c	(revision 222813)
@@ -1,374 +1,376 @@
 /*-
  * Copyright (c) 2008 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/bus.h>
+#include <sys/cpuset.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_map.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 
 #include <machine/bus.h>
 #include <machine/cpu.h>
 #include <machine/intr_machdep.h>
 #include <machine/pcb.h>
 #include <machine/platform.h>
 #include <machine/md_var.h>
 #include <machine/smp.h>
 
 #include "pic_if.h"
 
 extern struct pcpu __pcpu[MAXCPU];
 
 volatile static int ap_awake;
 volatile static u_int ap_letgo;
 volatile static u_quad_t ap_timebase;
 static u_int ipi_msg_cnt[32];
 static struct mtx ap_boot_mtx;
 struct pcb stoppcbs[MAXCPU];
 
 void
 machdep_ap_bootstrap(void)
 {
 	/* Set up important bits on the CPU (HID registers, etc.) */
 	cpudep_ap_setup();
 
 	/* Set PIR */
 	PCPU_SET(pir, mfspr(SPR_PIR));
 	PCPU_SET(awake, 1);
 	__asm __volatile("msync; isync");
 
 	while (ap_letgo == 0)
 		;
 
 	/* Initialize DEC and TB, sync with the BSP values */
 #ifdef __powerpc64__
 	/* Writing to the time base register is hypervisor-privileged */
 	if (mfmsr() & PSL_HV)
 		mttb(ap_timebase);
 #else
 	mttb(ap_timebase);
 #endif
 	decr_ap_init();
 
 	/* Serialize console output and AP count increment */
 	mtx_lock_spin(&ap_boot_mtx);
 	ap_awake++;
 	printf("SMP: AP CPU #%d launched\n", PCPU_GET(cpuid));
 	mtx_unlock_spin(&ap_boot_mtx);
 
 	/* Initialize curthread */
 	PCPU_SET(curthread, PCPU_GET(idlethread));
 	PCPU_SET(curpcb, curthread->td_pcb);
 
 	/* Start per-CPU event timers. */
 	cpu_initclocks_ap();
 
 	/* Announce ourselves awake, and enter the scheduler */
 	sched_throw(NULL);
 }
 
 void
 cpu_mp_setmaxid(void)
 {
 	struct cpuref cpuref;
 	int error;
 
 	mp_ncpus = 0;
 	error = platform_smp_first_cpu(&cpuref);
 	while (!error) {
 		mp_ncpus++;
 		error = platform_smp_next_cpu(&cpuref);
 	}
 	/* Sanity. */
 	if (mp_ncpus == 0)
 		mp_ncpus = 1;
 
 	/*
 	 * Set the largest cpuid we're going to use. This is necessary
 	 * for VM initialization.
 	 */
 	mp_maxid = min(mp_ncpus, MAXCPU) - 1;
 }
 
 int
 cpu_mp_probe(void)
 {
 
 	/*
 	 * We're not going to enable SMP if there's only 1 processor.
 	 */
 	return (mp_ncpus > 1);
 }
 
 void
 cpu_mp_start(void)
 {
 	struct cpuref bsp, cpu;
 	struct pcpu *pc;
 	int error;
 
 	error = platform_smp_get_bsp(&bsp);
 	KASSERT(error == 0, ("Don't know BSP"));
 	KASSERT(bsp.cr_cpuid == 0, ("%s: cpuid != 0", __func__));
 
 	error = platform_smp_first_cpu(&cpu);
 	while (!error) {
 		if (cpu.cr_cpuid >= MAXCPU) {
 			printf("SMP: cpu%d: skipped -- ID out of range\n",
 			    cpu.cr_cpuid);
 			goto next;
 		}
-		if (all_cpus & (1 << cpu.cr_cpuid)) {
+		if (CPU_ISSET(cpu.cr_cpuid, &all_cpus)) {
 			printf("SMP: cpu%d: skipped - duplicate ID\n",
 			    cpu.cr_cpuid);
 			goto next;
 		}
 		if (cpu.cr_cpuid != bsp.cr_cpuid) {
 			void *dpcpu;
 
 			pc = &__pcpu[cpu.cr_cpuid];
 			dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE);
 			pcpu_init(pc, cpu.cr_cpuid, sizeof(*pc));
 			dpcpu_init(dpcpu, cpu.cr_cpuid);
 		} else {
 			pc = pcpup;
 			pc->pc_cpuid = bsp.cr_cpuid;
 			pc->pc_bsp = 1;
 		}
-		pc->pc_cpumask = 1 << pc->pc_cpuid;
+		CPU_SETOF(pc->pc_cpuid, &pc->pc_cpumask);
 		pc->pc_hwref = cpu.cr_hwref;
-		all_cpus |= pc->pc_cpumask;
+		CPU_OR(&all_cpus, &pc->pc_cpumask);
 next:
 		error = platform_smp_next_cpu(&cpu);
 	}
 }
 
 void
 cpu_mp_announce(void)
 {
 	struct pcpu *pc;
 	int i;
 
 	for (i = 0; i <= mp_maxid; i++) {
 		pc = pcpu_find(i);
 		if (pc == NULL)
 			continue;
 		printf("cpu%d: dev=%x", i, (int)pc->pc_hwref);
 		if (pc->pc_bsp)
 			printf(" (BSP)");
 		printf("\n");
 	}
 }
 
 static void
 cpu_mp_unleash(void *dummy)
 {
 	struct pcpu *pc;
 	int cpus, timeout;
 
 	if (mp_ncpus <= 1)
 		return;
 
 	mtx_init(&ap_boot_mtx, "ap boot", NULL, MTX_SPIN);
 
 	cpus = 0;
 	smp_cpus = 0;
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 		cpus++;
-		pc->pc_other_cpus = all_cpus & ~pc->pc_cpumask;
+		pc->pc_other_cpus = all_cpus;
+		CPU_NAND(&pc->pc_other_cpus, &pc->pc_cpumask);
 		if (!pc->pc_bsp) {
 			if (bootverbose)
 				printf("Waking up CPU %d (dev=%x)\n",
 				    pc->pc_cpuid, (int)pc->pc_hwref);
 
 			platform_smp_start_cpu(pc);
 			
 			timeout = 2000;	/* wait 2sec for the AP */
 			while (!pc->pc_awake && --timeout > 0)
 				DELAY(1000);
 
 		} else {
 			PCPU_SET(pir, mfspr(SPR_PIR));
 			pc->pc_awake = 1;
 		}
 		if (pc->pc_awake) {
 			if (bootverbose)
 				printf("Adding CPU %d, pir=%x, awake=%x\n",
 				    pc->pc_cpuid, pc->pc_pir, pc->pc_awake);
 			smp_cpus++;
 		} else
-			stopped_cpus |= (1 << pc->pc_cpuid);
+			CPU_SET(pc->pc_cpuid, &stopped_cpus);
 	}
 
 	ap_awake = 1;
 
 	/* Provide our current DEC and TB values for APs */
 	ap_timebase = mftb() + 10;
 	__asm __volatile("msync; isync");
 	
 	/* Let APs continue */
 	atomic_store_rel_int(&ap_letgo, 1);
 
 #ifdef __powerpc64__
 	/* Writing to the time base register is hypervisor-privileged */
 	if (mfmsr() & PSL_HV)
 		mttb(ap_timebase);
 #else
 	mttb(ap_timebase);
 #endif
 
 	while (ap_awake < smp_cpus)
 		;
 
 	if (smp_cpus != cpus || cpus != mp_ncpus) {
 		printf("SMP: %d CPUs found; %d CPUs usable; %d CPUs woken\n",
 		    mp_ncpus, cpus, smp_cpus);
 	}
 
 	/* Let the APs get into the scheduler */
 	DELAY(10000);
 
 	smp_active = 1;
 	smp_started = 1;
 }
 
 SYSINIT(start_aps, SI_SUB_SMP, SI_ORDER_FIRST, cpu_mp_unleash, NULL);
 
 int
 powerpc_ipi_handler(void *arg)
 {
-	cpumask_t self;
+	cpuset_t self;
 	uint32_t ipimask;
 	int msg;
 
 	CTR2(KTR_SMP, "%s: MSR 0x%08x", __func__, mfmsr());
 
 	ipimask = atomic_readandclear_32(&(pcpup->pc_ipimask));
 	if (ipimask == 0)
 		return (FILTER_STRAY);
 	while ((msg = ffs(ipimask) - 1) != -1) {
 		ipimask &= ~(1u << msg);
 		ipi_msg_cnt[msg]++;
 		switch (msg) {
 		case IPI_AST:
 			CTR1(KTR_SMP, "%s: IPI_AST", __func__);
 			break;
 		case IPI_PREEMPT:
 			CTR1(KTR_SMP, "%s: IPI_PREEMPT", __func__);
 			sched_preempt(curthread);
 			break;
 		case IPI_RENDEZVOUS:
 			CTR1(KTR_SMP, "%s: IPI_RENDEZVOUS", __func__);
 			smp_rendezvous_action();
 			break;
 		case IPI_STOP:
 
 			/*
 			 * IPI_STOP_HARD is mapped to IPI_STOP so it is not
 			 * necessary to add such case in the switch.
 			 */
 			CTR1(KTR_SMP, "%s: IPI_STOP or IPI_STOP_HARD (stop)",
 			    __func__);
 			savectx(&stoppcbs[PCPU_GET(cpuid)]);
 			self = PCPU_GET(cpumask);
 			savectx(PCPU_GET(curpcb));
-			atomic_set_int(&stopped_cpus, self);
-			while ((started_cpus & self) == 0)
+			CPU_OR_ATOMIC(&stopped_cpus, &self);
+			while (!CPU_OVERLAP(&started_cpus, &self))
 				cpu_spinwait();
-			atomic_clear_int(&started_cpus, self);
-			atomic_clear_int(&stopped_cpus, self);
+			CPU_NAND_ATOMIC(&started_cpus, &self);
+			CPU_NAND_ATOMIC(&stopped_cpus, &self);
 			CTR1(KTR_SMP, "%s: IPI_STOP (restart)", __func__);
 			break;
 		case IPI_HARDCLOCK:
 			CTR1(KTR_SMP, "%s: IPI_HARDCLOCK", __func__);
 			hardclockintr();
 			break;
 		}
 	}
 
 	return (FILTER_HANDLED);
 }
 
 static void
 ipi_send(struct pcpu *pc, int ipi)
 {
 
 	CTR4(KTR_SMP, "%s: pc=%p, targetcpu=%d, IPI=%d", __func__,
 	    pc, pc->pc_cpuid, ipi);
 
 	atomic_set_32(&pc->pc_ipimask, (1 << ipi));
 	PIC_IPI(root_pic, pc->pc_cpuid);
 
 	CTR1(KTR_SMP, "%s: sent", __func__);
 }
 
 /* Send an IPI to a set of cpus. */
 void
-ipi_selected(cpumask_t cpus, int ipi)
+ipi_selected(cpuset_t cpus, int ipi)
 {
 	struct pcpu *pc;
 
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
-		if (cpus & pc->pc_cpumask)
+		if (CPU_OVERLAP(&cpus, &pc->pc_cpumask))
 			ipi_send(pc, ipi);
 	}
 }
 
 /* Send an IPI to a specific CPU. */
 void
 ipi_cpu(int cpu, u_int ipi)
 {
 
 	ipi_send(cpuid_to_pcpu[cpu], ipi);
 }
 
 /* Send an IPI to all CPUs EXCEPT myself. */
 void
 ipi_all_but_self(int ipi)
 {
 	struct pcpu *pc;
 
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 		if (pc != pcpup)
 			ipi_send(pc, ipi);
 	}
 }
Index: head/sys/powerpc/powerpc/openpic.c
===================================================================
--- head/sys/powerpc/powerpc/openpic.c	(revision 222812)
+++ head/sys/powerpc/powerpc/openpic.c	(revision 222813)
@@ -1,377 +1,382 @@
 /*-
  * Copyright (C) 2002 Benno Rice.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/conf.h>
 #include <sys/kernel.h>
 #include <sys/proc.h>
 #include <sys/rman.h>
 #include <sys/sched.h>
 
 #include <machine/bus.h>
 #include <machine/intr_machdep.h>
 #include <machine/md_var.h>
 #include <machine/pio.h>
 #include <machine/resource.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <machine/openpicreg.h>
 #include <machine/openpicvar.h>
 
 #include "pic_if.h"
 
 devclass_t openpic_devclass;
 
 /*
  * Local routines
  */
 static int openpic_intr(void *arg);
 
 static __inline uint32_t
 openpic_read(struct openpic_softc *sc, u_int reg)
 {
 	return (bus_space_read_4(sc->sc_bt, sc->sc_bh, reg));
 }
 
 static __inline void
 openpic_write(struct openpic_softc *sc, u_int reg, uint32_t val)
 {
 	bus_space_write_4(sc->sc_bt, sc->sc_bh, reg, val);
 }
 
 static __inline void
 openpic_set_priority(struct openpic_softc *sc, int pri)
 {
 	u_int tpr;
 	uint32_t x;
 
 	sched_pin();
 	tpr = OPENPIC_PCPU_TPR((sc->sc_dev == root_pic) ? PCPU_GET(cpuid) : 0);
 	x = openpic_read(sc, tpr);
 	x &= ~OPENPIC_TPR_MASK;
 	x |= pri;
 	openpic_write(sc, tpr, x);
 	sched_unpin();
 }
 
 int
 openpic_common_attach(device_t dev, uint32_t node)
 {
 	struct openpic_softc *sc;
 	u_int     cpu, ipi, irq;
 	u_int32_t x;
 
 	sc = device_get_softc(dev);
 	sc->sc_dev = dev;
 
 	sc->sc_rid = 0;
 	sc->sc_memr = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &sc->sc_rid,
 	    RF_ACTIVE);
 
 	if (sc->sc_memr == NULL) {
 		device_printf(dev, "Could not alloc mem resource!\n");
 		return (ENXIO);
 	}
 
 	sc->sc_bt = rman_get_bustag(sc->sc_memr);
 	sc->sc_bh = rman_get_bushandle(sc->sc_memr);
 
 	/* Reset the PIC */
 	x = openpic_read(sc, OPENPIC_CONFIG);
 	x |= OPENPIC_CONFIG_RESET;
 	openpic_write(sc, OPENPIC_CONFIG, x);
 
 	while (openpic_read(sc, OPENPIC_CONFIG) & OPENPIC_CONFIG_RESET) {
 		powerpc_sync();
 		DELAY(100);
 	}
 
 	/* Check if this is a cascaded PIC */
 	sc->sc_irq = 0;
 	sc->sc_intr = NULL;
 	do {
 		struct resource_list *rl;
 
 		rl = BUS_GET_RESOURCE_LIST(device_get_parent(dev), dev);
 		if (rl == NULL)
 			break;
 		if (resource_list_find(rl, SYS_RES_IRQ, 0) == NULL)
 			break;
 
 		sc->sc_intr = bus_alloc_resource_any(dev, SYS_RES_IRQ,
 		    &sc->sc_irq, RF_ACTIVE);
 
 		/* XXX Cascaded PICs pass NULL trapframes! */
 		bus_setup_intr(dev, sc->sc_intr, INTR_TYPE_MISC | INTR_MPSAFE,
 		    openpic_intr, NULL, dev, &sc->sc_icookie);
 	} while (0);
 
 	/* Reset the PIC */
 	x = openpic_read(sc, OPENPIC_CONFIG);
 	x |= OPENPIC_CONFIG_RESET;
 	openpic_write(sc, OPENPIC_CONFIG, x);
 
 	while (openpic_read(sc, OPENPIC_CONFIG) & OPENPIC_CONFIG_RESET) {
 		powerpc_sync();
 		DELAY(100);
 	}
 
 	x = openpic_read(sc, OPENPIC_FEATURE);
 	switch (x & OPENPIC_FEATURE_VERSION_MASK) {
 	case 1:
 		sc->sc_version = "1.0";
 		break;
 	case 2:
 		sc->sc_version = "1.2";
 		break;
 	case 3:
 		sc->sc_version = "1.3";
 		break;
 	default:
 		sc->sc_version = "unknown";
 		break;
 	}
 
 	sc->sc_ncpu = ((x & OPENPIC_FEATURE_LAST_CPU_MASK) >>
 	    OPENPIC_FEATURE_LAST_CPU_SHIFT) + 1;
 	sc->sc_nirq = ((x & OPENPIC_FEATURE_LAST_IRQ_MASK) >>
 	    OPENPIC_FEATURE_LAST_IRQ_SHIFT) + 1;
 
 	/*
 	 * PSIM seems to report 1 too many IRQs and CPUs
 	 */
 	if (sc->sc_psim) {
 		sc->sc_nirq--;
 		sc->sc_ncpu--;
 	}
 
 	if (bootverbose)
 		device_printf(dev,
 		    "Version %s, supports %d CPUs and %d irqs\n",
 		    sc->sc_version, sc->sc_ncpu, sc->sc_nirq);
 
 	for (cpu = 0; cpu < sc->sc_ncpu; cpu++)
 		openpic_write(sc, OPENPIC_PCPU_TPR(cpu), 15);
 
 	/* Reset and disable all interrupts. */
 	for (irq = 0; irq < sc->sc_nirq; irq++) {
 		x = irq;                /* irq == vector. */
 		x |= OPENPIC_IMASK;
 		x |= OPENPIC_POLARITY_NEGATIVE;
 		x |= OPENPIC_SENSE_LEVEL;
 		x |= 8 << OPENPIC_PRIORITY_SHIFT;
 		openpic_write(sc, OPENPIC_SRC_VECTOR(irq), x);
 	}
 
 	/* Reset and disable all IPIs. */
 	for (ipi = 0; ipi < 4; ipi++) {
 		x = sc->sc_nirq + ipi;
 		x |= OPENPIC_IMASK;
 		x |= 15 << OPENPIC_PRIORITY_SHIFT;
 		openpic_write(sc, OPENPIC_IPI_VECTOR(ipi), x);
 	}
 
 	/* we don't need 8259 passthrough mode */
 	x = openpic_read(sc, OPENPIC_CONFIG);
 	x |= OPENPIC_CONFIG_8259_PASSTHRU_DISABLE;
 	openpic_write(sc, OPENPIC_CONFIG, x);
 
 	/* send all interrupts to cpu 0 */
 	for (irq = 0; irq < sc->sc_nirq; irq++)
 		openpic_write(sc, OPENPIC_IDEST(irq), 1 << 0);
 
 	/* clear all pending interrupts from cpu 0 */
 	for (irq = 0; irq < sc->sc_nirq; irq++) {
 		(void)openpic_read(sc, OPENPIC_PCPU_IACK(0));
 		openpic_write(sc, OPENPIC_PCPU_EOI(0), 0);
 	}
 
 	for (cpu = 0; cpu < sc->sc_ncpu; cpu++)
 		openpic_write(sc, OPENPIC_PCPU_TPR(cpu), 0);
 
 	powerpc_register_pic(dev, node, sc->sc_nirq, 4, FALSE);
 
 	/* If this is not a cascaded PIC, it must be the root PIC */
 	if (sc->sc_intr == NULL)
 		root_pic = dev;
 
 	return (0);
 }
 
 /*
  * PIC I/F methods
  */
 
 void
-openpic_bind(device_t dev, u_int irq, cpumask_t cpumask)
+openpic_bind(device_t dev, u_int irq, cpuset_t cpumask)
 {
 	struct openpic_softc *sc;
 
 	/* If we aren't directly connected to the CPU, this won't work */
 	if (dev != root_pic)
 		return;
 
 	sc = device_get_softc(dev);
-	openpic_write(sc, OPENPIC_IDEST(irq), cpumask);
+
+	/*
+	 * XXX: openpic_write() is very special and just needs a 32 bits mask.
+	 * For the moment, just play dirty and get the first half word.
+	 */
+	openpic_write(sc, OPENPIC_IDEST(irq), cpumask.__bits[0] & 0xffffffff);
 }
 
 void
 openpic_config(device_t dev, u_int irq, enum intr_trigger trig,
     enum intr_polarity pol)
 {
 	struct openpic_softc *sc;
 	uint32_t x;
 
 	sc = device_get_softc(dev);
 	x = openpic_read(sc, OPENPIC_SRC_VECTOR(irq));
 	if (pol == INTR_POLARITY_LOW)
 		x &= ~OPENPIC_POLARITY_POSITIVE;
 	else
 		x |= OPENPIC_POLARITY_POSITIVE;
 	if (trig == INTR_TRIGGER_EDGE)
 		x &= ~OPENPIC_SENSE_LEVEL;
 	else
 		x |= OPENPIC_SENSE_LEVEL;
 	openpic_write(sc, OPENPIC_SRC_VECTOR(irq), x);
 }
 
 static int
 openpic_intr(void *arg)
 {
 	device_t dev = (device_t)(arg);
 
 	/* XXX Cascaded PICs do not pass non-NULL trapframes! */
 	openpic_dispatch(dev, NULL);
 
 	return (FILTER_HANDLED);
 }
 
 void
 openpic_dispatch(device_t dev, struct trapframe *tf)
 {
 	struct openpic_softc *sc;
 	u_int cpuid, vector;
 
 	CTR1(KTR_INTR, "%s: got interrupt", __func__);
 
 	cpuid = (dev == root_pic) ? PCPU_GET(cpuid) : 0;
 
 	sc = device_get_softc(dev);
 	while (1) {
 		vector = openpic_read(sc, OPENPIC_PCPU_IACK(cpuid));
 		vector &= OPENPIC_VECTOR_MASK;
 		if (vector == 255)
 			break;
 		powerpc_dispatch_intr(vector, tf);
 	}
 }
 
 void
 openpic_enable(device_t dev, u_int irq, u_int vector)
 {
 	struct openpic_softc *sc;
 	uint32_t x;
 
 	sc = device_get_softc(dev);
 	if (irq < sc->sc_nirq) {
 		x = openpic_read(sc, OPENPIC_SRC_VECTOR(irq));
 		x &= ~(OPENPIC_IMASK | OPENPIC_VECTOR_MASK);
 		x |= vector;
 		openpic_write(sc, OPENPIC_SRC_VECTOR(irq), x);
 	} else {
 		x = openpic_read(sc, OPENPIC_IPI_VECTOR(0));
 		x &= ~(OPENPIC_IMASK | OPENPIC_VECTOR_MASK);
 		x |= vector;
 		openpic_write(sc, OPENPIC_IPI_VECTOR(0), x);
 	}
 }
 
 void
 openpic_eoi(device_t dev, u_int irq __unused)
 {
 	struct openpic_softc *sc;
 	u_int cpuid;
 
 	cpuid = (dev == root_pic) ? PCPU_GET(cpuid) : 0;
 
 	sc = device_get_softc(dev);
 	openpic_write(sc, OPENPIC_PCPU_EOI(cpuid), 0);
 }
 
 void
 openpic_ipi(device_t dev, u_int cpu)
 {
 	struct openpic_softc *sc;
 
 	KASSERT(dev == root_pic, ("Cannot send IPIs from non-root OpenPIC"));
 
 	sc = device_get_softc(dev);
 	sched_pin();
 	openpic_write(sc, OPENPIC_PCPU_IPI_DISPATCH(PCPU_GET(cpuid), 0),
 	    1u << cpu);
 	sched_unpin();
 }
 
 void
 openpic_mask(device_t dev, u_int irq)
 {
 	struct openpic_softc *sc;
 	uint32_t x;
 
 	sc = device_get_softc(dev);
 	if (irq < sc->sc_nirq) {
 		x = openpic_read(sc, OPENPIC_SRC_VECTOR(irq));
 		x |= OPENPIC_IMASK;
 		openpic_write(sc, OPENPIC_SRC_VECTOR(irq), x);
 	} else {
 		x = openpic_read(sc, OPENPIC_IPI_VECTOR(0));
 		x |= OPENPIC_IMASK;
 		openpic_write(sc, OPENPIC_IPI_VECTOR(0), x);
 	}
 }
 
 void
 openpic_unmask(device_t dev, u_int irq)
 {
 	struct openpic_softc *sc;
 	uint32_t x;
 
 	sc = device_get_softc(dev);
 	if (irq < sc->sc_nirq) {
 		x = openpic_read(sc, OPENPIC_SRC_VECTOR(irq));
 		x &= ~OPENPIC_IMASK;
 		openpic_write(sc, OPENPIC_SRC_VECTOR(irq), x);
 	} else {
 		x = openpic_read(sc, OPENPIC_IPI_VECTOR(0));
 		x &= ~OPENPIC_IMASK;
 		openpic_write(sc, OPENPIC_IPI_VECTOR(0), x);
 	}
 }
Index: head/sys/powerpc/powerpc/pic_if.m
===================================================================
--- head/sys/powerpc/powerpc/pic_if.m	(revision 222812)
+++ head/sys/powerpc/powerpc/pic_if.m	(revision 222813)
@@ -1,78 +1,79 @@
 #-
 # Copyright (c) 1998 Doug Rabson
 # All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
 # are met:
 # 1. Redistributions of source code must retain the above copyright
 #    notice, this list of conditions and the following disclaimer.
 # 2. Redistributions in binary form must reproduce the above copyright
 #    notice, this list of conditions and the following disclaimer in the
 #    documentation and/or other materials provided with the distribution.
 #
 # THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 # ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 # FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 # OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 # HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 # OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 # SUCH DAMAGE.
 #
 # from: src/sys/kern/bus_if.m,v 1.21 2002/04/21 11:16:10 markm Exp
 # $FreeBSD$
 #
 
 #include <sys/bus.h>
+#include <sys/cpuset.h>
 #include <machine/frame.h>
 
 INTERFACE pic;
 
 METHOD void bind {
 	device_t	dev;
 	u_int		irq;
-	cpumask_t	cpumask;
+	cpuset_t	cpumask;
 };
 
 METHOD void config {
 	device_t	dev;
 	u_int		irq;
 	enum intr_trigger trig;
 	enum intr_polarity pol;
 };
 
 METHOD void dispatch {
 	device_t	dev;
 	struct trapframe *tf;
 };
 
 METHOD void enable {
 	device_t	dev;
 	u_int		irq;
 	u_int		vector;
 };
 
 METHOD void eoi {
 	device_t	dev;
 	u_int		irq;
 };
 
 METHOD void ipi {
 	device_t	dev;
 	u_int		cpu;
 };
 
 METHOD void mask {
 	device_t	dev;
 	u_int		irq;
 };
 
 METHOD void unmask {
 	device_t	dev;
 	u_int		irq;
 };
 
Index: head/sys/sparc64/include/_types.h
===================================================================
--- head/sys/sparc64/include/_types.h	(revision 222812)
+++ head/sys/sparc64/include/_types.h	(revision 222813)
@@ -1,111 +1,110 @@
 /*-
  * Copyright (c) 2002 Mike Barcroft <mike@FreeBSD.org>
  * Copyright (c) 1990, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	From: @(#)ansi.h	8.2 (Berkeley) 1/4/94
  *	From: @(#)types.h	8.3 (Berkeley) 1/5/94
  * $FreeBSD$
  */
 
 #ifndef _MACHINE__TYPES_H_
 #define	_MACHINE__TYPES_H_
 
 #ifndef _SYS_CDEFS_H_
 #error this file needs sys/cdefs.h as a prerequisite
 #endif
 
 /*
  * Basic types upon which most other types are built.
  */
 typedef	__signed char		__int8_t;
 typedef	unsigned char		__uint8_t;
 typedef	short			__int16_t;
 typedef	unsigned short		__uint16_t;
 typedef	int			__int32_t;
 typedef	unsigned int		__uint32_t;
 typedef	long			__int64_t;
 typedef	unsigned long		__uint64_t;
 
 /*
  * Standard type definitions.
  */
 typedef	__int32_t	__clock_t;		/* clock()... */
-typedef	unsigned int	__cpumask_t;
 typedef	__int64_t	__critical_t;
 typedef	double		__double_t;
 typedef	float		__float_t;
 typedef	__int64_t	__intfptr_t;
 typedef	__int64_t	__intmax_t;
 typedef	__int64_t	__intptr_t;
 typedef	__int32_t	__int_fast8_t;
 typedef	__int32_t	__int_fast16_t;
 typedef	__int32_t	__int_fast32_t;
 typedef	__int64_t	__int_fast64_t;
 typedef	__int8_t	__int_least8_t;
 typedef	__int16_t	__int_least16_t;
 typedef	__int32_t	__int_least32_t;
 typedef	__int64_t	__int_least64_t;
 typedef	__int64_t	__ptrdiff_t;		/* ptr1 - ptr2 */
 typedef	__int64_t	__register_t;
 typedef	__int64_t	__segsz_t;		/* segment size (in pages) */
 typedef	__uint64_t	__size_t;		/* sizeof() */
 typedef	__int64_t	__ssize_t;		/* byte count or error */
 typedef	__int64_t	__time_t;		/* time()... */
 typedef	__uint64_t	__uintfptr_t;
 typedef	__uint64_t	__uintmax_t;
 typedef	__uint64_t	__uintptr_t;
 typedef	__uint32_t	__uint_fast8_t;
 typedef	__uint32_t	__uint_fast16_t;
 typedef	__uint32_t	__uint_fast32_t;
 typedef	__uint64_t	__uint_fast64_t;
 typedef	__uint8_t	__uint_least8_t;
 typedef	__uint16_t	__uint_least16_t;
 typedef	__uint32_t	__uint_least32_t;
 typedef	__uint64_t	__uint_least64_t;
 typedef	__uint64_t	__u_register_t;
 typedef	__uint64_t	__vm_offset_t;
 typedef	__int64_t	__vm_ooffset_t;
 typedef	__uint64_t	__vm_paddr_t;
 typedef	__uint64_t	__vm_pindex_t;
 typedef	__uint64_t	__vm_size_t;
 
 /*
  * Unusual type definitions.
  */
 #ifdef __GNUCLIKE_BUILTIN_VARARGS
 typedef __builtin_va_list	__va_list;	/* internally known to gcc */
 #else
 typedef	char *			__va_list;
 #endif /* __GNUCLIKE_BUILTIN_VARARGS */
 #if defined(__GNUCLIKE_BUILTIN_VAALIST) && !defined(__GNUC_VA_LIST) \
     && !defined(__NO_GNUC_VA_LIST)
 #define __GNUC_VA_LIST
 typedef __va_list		__gnuc_va_list;	/* compatibility w/GNU headers*/
 #endif
 
 #endif /* !_MACHINE__TYPES_H_ */
Index: head/sys/sparc64/include/ktr.h
===================================================================
--- head/sys/sparc64/include/ktr.h	(revision 222812)
+++ head/sys/sparc64/include/ktr.h	(revision 222813)
@@ -1,93 +1,95 @@
 /*-
  * Copyright (c) 1996 Berkeley Software Design, Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Berkeley Software Design Inc's name may not be used to endorse or
  *    promote products derived from this software without specific prior
  *    written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY BERKELEY SOFTWARE DESIGN INC ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL BERKELEY SOFTWARE DESIGN INC BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from BSDI $Id: ktr.h,v 1.10.2.7 2000/03/16 21:44:42 cp Exp $
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_KTR_H_
 #define _MACHINE_KTR_H_
 
 #include <sys/ktr.h>
 
 #ifndef LOCORE
 
 #define	KTR_CPU	PCPU_GET(mid)
 
 #else
 
 #define	AND(var, mask, r1, r2) \
 	SET(var, r2, r1) ; \
 	lduw	[r1], r2 ; \
 	and	r2, mask, r1
 
 #define	TEST(var, mask, r1, r2, l1) \
 	AND(var, mask, r1, r2) ; \
 	brz	r1, l1 ## f ; \
 	 nop
 
 /*
  * XXX could really use another register...
  */
 #define	ATR(desc, r1, r2, r3, l1, l2) \
 	.sect	.rodata ; \
 l1:	.asciz	desc ; \
 	.previous ; \
 	SET(ktr_idx, r2, r1) ; \
 	lduw	[r1], r2 ; \
 l2:	add	r2, 1, r3 ; \
 	set	KTR_ENTRIES - 1, r1 ; \
 	and	r3, r1, r3 ; \
 	set	ktr_idx, r1 ; \
 	casa	[r1] ASI_N, r2, r3 ; \
 	cmp	r2, r3 ; \
 	bne	%icc, l2 ## b ; \
 	 mov	r3, r2 ; \
 	SET(ktr_buf, r3, r1) ; \
 	mulx	r2, KTR_SIZEOF, r2 ; \
 	add	r1, r2, r1 ; \
 	rd	%tick, r2 ; \
 	stx	r2, [r1 + KTR_TIMESTAMP] ; \
 	lduw	[PCPU(MID)], r2 ; \
 	stw	r2, [r1 + KTR_CPU] ; \
 	stw	%g0, [r1 + KTR_LINE] ; \
 	stx	%g0, [r1 + KTR_FILE] ; \
 	SET(l1 ## b, r3, r2) ; \
 	stx	r2, [r1 + KTR_DESC]
 
 #define CATR(mask, desc, r1, r2, r3, l1, l2, l3) \
 	set	mask, r1 ; \
 	TEST(ktr_mask, r1, r2, r2, l3) ; \
 	lduw	[PCPU(MID)], r1 ; \
 	mov	1, r2 ; \
 	sllx	r2, r1, r1 ; \
+#ifdef notyet \
 	TEST(ktr_cpumask, r1, r2, r3, l3) ; \
+#endif \
 	ATR(desc, r1, r2, r3, l1, l2)
 
 #endif /* LOCORE */
 
 #endif /* !_MACHINE_KTR_H_ */
Index: head/sys/sparc64/include/pmap.h
===================================================================
--- head/sys/sparc64/include/pmap.h	(revision 222812)
+++ head/sys/sparc64/include/pmap.h	(revision 222813)
@@ -1,129 +1,130 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: hp300: @(#)pmap.h 7.2 (Berkeley) 12/16/90
  *	from: @(#)pmap.h        7.4 (Berkeley) 5/12/91
  *	from: FreeBSD: src/sys/i386/include/pmap.h,v 1.70 2000/11/30
  * $FreeBSD$
  */
 
 #ifndef	_MACHINE_PMAP_H_
 #define	_MACHINE_PMAP_H_
 
 #include <sys/queue.h>
+#include <sys/_cpuset.h>
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
 #include <machine/cache.h>
 #include <machine/tte.h>
 
 #define	PMAP_CONTEXT_MAX	8192
 
 typedef	struct pmap *pmap_t;
 
 struct md_page {
 	TAILQ_HEAD(, tte) tte_list;
 	struct	pmap *pmap;
 	uint32_t colors[DCACHE_COLORS];
 	int32_t	color;
 	uint32_t flags;
 };
 
 struct pmap {
 	struct	mtx pm_mtx;
 	struct	tte *pm_tsb;
 	vm_object_t pm_tsb_obj;
-	cpumask_t pm_active;
+	cpuset_t pm_active;
 	u_int	pm_context[MAXCPU];
 	struct	pmap_statistics pm_stats;
 };
 
 #define	PMAP_LOCK(pmap)		mtx_lock(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_ASSERT(pmap, type)					\
 				mtx_assert(&(pmap)->pm_mtx, (type))
 #define	PMAP_LOCK_DESTROY(pmap)	mtx_destroy(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_INIT(pmap)	mtx_init(&(pmap)->pm_mtx, "pmap",	\
 				    NULL, MTX_DEF | MTX_DUPOK)
 #define	PMAP_LOCKED(pmap)	mtx_owned(&(pmap)->pm_mtx)
 #define	PMAP_MTX(pmap)		(&(pmap)->pm_mtx)
 #define	PMAP_TRYLOCK(pmap)	mtx_trylock(&(pmap)->pm_mtx)
 #define	PMAP_UNLOCK(pmap)	mtx_unlock(&(pmap)->pm_mtx)
 
 #define	pmap_page_get_memattr(m)	VM_MEMATTR_DEFAULT
 #define	pmap_page_set_memattr(m, ma)	(void)0
 
 void	pmap_bootstrap(u_int cpu_impl);
 vm_paddr_t pmap_kextract(vm_offset_t va);
 void	pmap_kenter(vm_offset_t va, vm_page_t m);
 void	pmap_kremove(vm_offset_t);
 void	pmap_kenter_flags(vm_offset_t va, vm_paddr_t pa, u_long flags);
 void	pmap_kremove_flags(vm_offset_t va);
 boolean_t pmap_page_is_mapped(vm_page_t m);
 
 int	pmap_cache_enter(vm_page_t m, vm_offset_t va);
 void	pmap_cache_remove(vm_page_t m, vm_offset_t va);
 
 int	pmap_remove_tte(struct pmap *pm1, struct pmap *pm2, struct tte *tp,
 			vm_offset_t va);
 int	pmap_protect_tte(struct pmap *pm1, struct pmap *pm2, struct tte *tp,
 			 vm_offset_t va);
 
 void	pmap_map_tsb(void);
 void	pmap_set_kctx(void);
 
 #define	vtophys(va)	pmap_kextract((vm_offset_t)(va))
 
 extern	struct pmap kernel_pmap_store;
 #define	kernel_pmap	(&kernel_pmap_store)
 extern	vm_paddr_t phys_avail[];
 extern	vm_offset_t virtual_avail;
 extern	vm_offset_t virtual_end;
 
 #ifdef PMAP_STATS
 
 SYSCTL_DECL(_debug_pmap_stats);
 
 #define	PMAP_STATS_VAR(name) \
 	static long name; \
 	SYSCTL_LONG(_debug_pmap_stats, OID_AUTO, name, CTLFLAG_RW,	\
 	    &name, 0, "")
 
 #define	PMAP_STATS_INC(var) \
 	atomic_add_long(&var, 1)
 
 #else
 
 #define	PMAP_STATS_VAR(name)
 #define	PMAP_STATS_INC(var)
 
 #endif
 
 #endif /* !_MACHINE_PMAP_H_ */
Index: head/sys/sparc64/include/smp.h
===================================================================
--- head/sys/sparc64/include/smp.h	(revision 222812)
+++ head/sys/sparc64/include/smp.h	(revision 222813)
@@ -1,367 +1,378 @@
 /*-
  * Copyright (c) 2001 Jake Burkholder.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef	_MACHINE_SMP_H_
 #define	_MACHINE_SMP_H_
 
 #ifdef SMP
 
 #define	CPU_TICKSYNC		1
 #define	CPU_STICKSYNC		2
 #define	CPU_INIT		3
 #define	CPU_BOOTSTRAP		4
 
 #ifndef	LOCORE
 
+#include <sys/cpuset.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 
 #include <machine/intr_machdep.h>
 #include <machine/pcb.h>
 #include <machine/tte.h>
 
 #define	IDR_BUSY			0x0000000000000001ULL
 #define	IDR_NACK			0x0000000000000002ULL
 #define	IDR_CHEETAH_ALL_BUSY		0x5555555555555555ULL
 #define	IDR_CHEETAH_ALL_NACK		(~IDR_CHEETAH_ALL_BUSY)
 #define	IDR_CHEETAH_MAX_BN_PAIRS	32
 #define	IDR_JALAPENO_MAX_BN_PAIRS	4
 
 #define	IDC_ITID_SHIFT			14
 #define	IDC_BN_SHIFT			24
 
 #define	IPI_AST		PIL_AST
 #define	IPI_RENDEZVOUS	PIL_RENDEZVOUS
 #define	IPI_PREEMPT	PIL_PREEMPT
 #define	IPI_HARDCLOCK	PIL_HARDCLOCK
 #define	IPI_STOP	PIL_STOP
 #define	IPI_STOP_HARD	PIL_STOP
 
 #define	IPI_RETRIES	5000
 
 struct cpu_start_args {
 	u_int	csa_count;
 	u_int	csa_mid;
 	u_int	csa_state;
 	vm_offset_t csa_pcpu;
 	u_long	csa_tick;
 	u_long	csa_stick;
 	u_long	csa_ver;
 	struct	tte csa_ttes[PCPU_PAGES];
 };
 
 struct ipi_cache_args {
-	cpumask_t ica_mask;
+	cpuset_t ica_mask;
 	vm_paddr_t ica_pa;
 };
 
 struct ipi_rd_args {
-	cpumask_t ira_mask;
+	cpuset_t ira_mask;
 	register_t *ira_val;
 };
 
 struct ipi_tlb_args {
-	cpumask_t ita_mask;
+	cpuset_t ita_mask;
 	struct	pmap *ita_pmap;
 	u_long	ita_start;
 	u_long	ita_end;
 };
 #define	ita_va	ita_start
 
 struct pcpu;
 
 extern struct pcb stoppcbs[];
 
 void	cpu_mp_bootstrap(struct pcpu *pc);
 void	cpu_mp_shutdown(void);
 
-typedef	void cpu_ipi_selected_t(u_int, u_long, u_long, u_long);
+typedef	void cpu_ipi_selected_t(cpuset_t, u_long, u_long, u_long);
 extern	cpu_ipi_selected_t *cpu_ipi_selected;
 typedef	void cpu_ipi_single_t(u_int, u_long, u_long, u_long);
 extern	cpu_ipi_single_t *cpu_ipi_single;
 
 void	mp_init(u_int cpu_impl);
 
 extern	struct mtx ipi_mtx;
 extern	struct ipi_cache_args ipi_cache_args;
 extern	struct ipi_rd_args ipi_rd_args;
 extern	struct ipi_tlb_args ipi_tlb_args;
 
 extern	char *mp_tramp_code;
 extern	u_long mp_tramp_code_len;
 extern	u_long mp_tramp_tlb_slots;
 extern	u_long mp_tramp_func;
 
 extern	void mp_startup(void);
 
 extern	char tl_ipi_cheetah_dcache_page_inval[];
 extern	char tl_ipi_spitfire_dcache_page_inval[];
 extern	char tl_ipi_spitfire_icache_page_inval[];
 
 extern	char tl_ipi_level[];
 
 extern	char tl_ipi_stick_rd[];
 extern	char tl_ipi_tick_rd[];
 
 extern	char tl_ipi_tlb_context_demap[];
 extern	char tl_ipi_tlb_page_demap[];
 extern	char tl_ipi_tlb_range_demap[];
 
 static __inline void
 ipi_all_but_self(u_int ipi)
 {
 
 	cpu_ipi_selected(PCPU_GET(other_cpus), 0, (u_long)tl_ipi_level, ipi);
 }
 
 static __inline void
-ipi_selected(u_int cpus, u_int ipi)
+ipi_selected(cpuset_t cpus, u_int ipi)
 {
 
 	cpu_ipi_selected(cpus, 0, (u_long)tl_ipi_level, ipi);
 }
 
 static __inline void
 ipi_cpu(int cpu, u_int ipi)
 {
 
 	cpu_ipi_single(cpu, 0, (u_long)tl_ipi_level, ipi);
 }
 
 #if defined(_MACHINE_PMAP_H_) && defined(_SYS_MUTEX_H_)
 
 static __inline void *
 ipi_dcache_page_inval(void *func, vm_paddr_t pa)
 {
 	struct ipi_cache_args *ica;
 
 	if (smp_cpus == 1)
 		return (NULL);
 	sched_pin();
 	ica = &ipi_cache_args;
 	mtx_lock_spin(&ipi_mtx);
 	ica->ica_mask = all_cpus;
 	ica->ica_pa = pa;
 	cpu_ipi_selected(PCPU_GET(other_cpus), 0, (u_long)func, (u_long)ica);
 	return (&ica->ica_mask);
 }
 
 static __inline void *
 ipi_icache_page_inval(void *func, vm_paddr_t pa)
 {
 	struct ipi_cache_args *ica;
 
 	if (smp_cpus == 1)
 		return (NULL);
 	sched_pin();
 	ica = &ipi_cache_args;
 	mtx_lock_spin(&ipi_mtx);
 	ica->ica_mask = all_cpus;
 	ica->ica_pa = pa;
 	cpu_ipi_selected(PCPU_GET(other_cpus), 0, (u_long)func, (u_long)ica);
 	return (&ica->ica_mask);
 }
 
 static __inline void *
 ipi_rd(u_int cpu, void *func, u_long *val)
 {
 	struct ipi_rd_args *ira;
 
 	if (smp_cpus == 1)
 		return (NULL);
 	sched_pin();
 	ira = &ipi_rd_args;
 	mtx_lock_spin(&ipi_mtx);
-	ira->ira_mask = 1 << cpu | PCPU_GET(cpumask);
+	ira->ira_mask = PCPU_GET(cpumask);
+	CPU_SET(cpu, &ira->ira_mask);
 	ira->ira_val = val;
 	cpu_ipi_single(cpu, 0, (u_long)func, (u_long)ira);
 	return (&ira->ira_mask);
 }
 
 static __inline void *
 ipi_tlb_context_demap(struct pmap *pm)
 {
 	struct ipi_tlb_args *ita;
-	cpumask_t cpus;
+	cpuset_t cpus;
 
 	if (smp_cpus == 1)
 		return (NULL);
 	sched_pin();
-	if ((cpus = (pm->pm_active & PCPU_GET(other_cpus))) == 0) {
+	cpus = pm->pm_active;
+	CPU_AND(&cpus, PCPU_PTR(other_cpus));
+	if (CPU_EMPTY(&cpus)) {
 		sched_unpin();
 		return (NULL);
 	}
 	ita = &ipi_tlb_args;
 	mtx_lock_spin(&ipi_mtx);
-	ita->ita_mask = cpus | PCPU_GET(cpumask);
+	ita->ita_mask = cpus;
+	CPU_OR(&ita->ita_mask, PCPU_PTR(cpumask));
 	ita->ita_pmap = pm;
 	cpu_ipi_selected(cpus, 0, (u_long)tl_ipi_tlb_context_demap,
 	    (u_long)ita);
 	return (&ita->ita_mask);
 }
 
 static __inline void *
 ipi_tlb_page_demap(struct pmap *pm, vm_offset_t va)
 {
 	struct ipi_tlb_args *ita;
-	cpumask_t cpus;
+	cpuset_t cpus;
 
 	if (smp_cpus == 1)
 		return (NULL);
 	sched_pin();
-	if ((cpus = (pm->pm_active & PCPU_GET(other_cpus))) == 0) {
+	cpus = pm->pm_active;
+	CPU_AND(&cpus, PCPU_PTR(other_cpus));
+	if (CPU_EMPTY(&cpus)) {
 		sched_unpin();
 		return (NULL);
 	}
 	ita = &ipi_tlb_args;
 	mtx_lock_spin(&ipi_mtx);
-	ita->ita_mask = cpus | PCPU_GET(cpumask);
+	ita->ita_mask = cpus;
+	CPU_OR(&ita->ita_mask, PCPU_PTR(cpumask));
 	ita->ita_pmap = pm;
 	ita->ita_va = va;
 	cpu_ipi_selected(cpus, 0, (u_long)tl_ipi_tlb_page_demap, (u_long)ita);
 	return (&ita->ita_mask);
 }
 
 static __inline void *
 ipi_tlb_range_demap(struct pmap *pm, vm_offset_t start, vm_offset_t end)
 {
 	struct ipi_tlb_args *ita;
-	cpumask_t cpus;
+	cpuset_t cpus;
 
 	if (smp_cpus == 1)
 		return (NULL);
 	sched_pin();
-	if ((cpus = (pm->pm_active & PCPU_GET(other_cpus))) == 0) {
+	cpus = pm->pm_active;
+	CPU_AND(&cpus, PCPU_PTR(other_cpus));
+	if (CPU_EMPTY(&cpus)) {
 		sched_unpin();
 		return (NULL);
 	}
 	ita = &ipi_tlb_args;
 	mtx_lock_spin(&ipi_mtx);
-	ita->ita_mask = cpus | PCPU_GET(cpumask);
+	ita->ita_mask = cpus;
+	CPU_OR(&ita->ita_mask, PCPU_PTR(cpumask));
 	ita->ita_pmap = pm;
 	ita->ita_start = start;
 	ita->ita_end = end;
 	cpu_ipi_selected(cpus, 0, (u_long)tl_ipi_tlb_range_demap,
 	    (u_long)ita);
 	return (&ita->ita_mask);
 }
 
 static __inline void
 ipi_wait(void *cookie)
 {
-	volatile cpumask_t *mask;
+	volatile cpuset_t *mask;
 
 	if ((mask = cookie) != NULL) {
-		atomic_clear_int(mask, PCPU_GET(cpumask));
-		while (*mask != 0)
+		CPU_NAND_ATOMIC(mask, PCPU_PTR(cpumask));
+		while (!CPU_EMPTY(mask))
 			;
 		mtx_unlock_spin(&ipi_mtx);
 		sched_unpin();
 	}
 }
 
 #endif /* _MACHINE_PMAP_H_ && _SYS_MUTEX_H_ */
 
 #endif /* !LOCORE */
 
 #else
 
 #ifndef	LOCORE
 
 static __inline void *
 ipi_dcache_page_inval(void *func __unused, vm_paddr_t pa __unused)
 {
 
 	return (NULL);
 }
 
 static __inline void *
 ipi_icache_page_inval(void *func __unused, vm_paddr_t pa __unused)
 {
 
 	return (NULL);
 }
 
 static __inline void *
 ipi_rd(u_int cpu __unused, void *func __unused, u_long *val __unused)
 {
 
 	return (NULL);
 }
 
 static __inline void *
 ipi_tlb_context_demap(struct pmap *pm __unused)
 {
 
 	return (NULL);
 }
 
 static __inline void *
 ipi_tlb_page_demap(struct pmap *pm __unused, vm_offset_t va __unused)
 {
 
 	return (NULL);
 }
 
 static __inline void *
 ipi_tlb_range_demap(struct pmap *pm __unused, vm_offset_t start __unused,
     __unused vm_offset_t end)
 {
 
 	return (NULL);
 }
 
 static __inline void
 ipi_wait(void *cookie)
 {
 
 }
 
 static __inline void
 tl_ipi_cheetah_dcache_page_inval(void)
 {
 
 }
 
 static __inline void
 tl_ipi_spitfire_dcache_page_inval(void)
 {
 
 }
 
 static __inline void
 tl_ipi_spitfire_icache_page_inval(void)
 {
 
 }
 
 #endif /* !LOCORE */
 
 #endif /* SMP */
 
 #endif /* !_MACHINE_SMP_H_ */
Index: head/sys/sparc64/sparc64/genassym.c
===================================================================
--- head/sys/sparc64/sparc64/genassym.c	(revision 222812)
+++ head/sys/sparc64/sparc64/genassym.c	(revision 222813)
@@ -1,256 +1,258 @@
 /*-
  * Copyright (c) 2001 Jake Burkholder.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)genassym.c	5.11 (Berkeley) 5/10/91
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_kstack_pages.h"
 
 #include <sys/param.h>
 #include <sys/assym.h>
 #include <sys/ktr.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 #include <sys/vmmeter.h>
+#include <sys/_cpuset.h>
 
 #include <vm/vm.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 
 #ifdef SUN4U
 #include <machine/cache.h>
 #endif
 #include <machine/pcb.h>
 #include <machine/setjmp.h>
 #include <machine/smp.h>
 #include <machine/tlb.h>
 #include <machine/tte.h>
 #include <machine/vmparam.h>
 
 ASSYM(KERNBASE, KERNBASE);
 
 ASSYM(KSTACK_PAGES, KSTACK_PAGES);
 ASSYM(PCPU_PAGES, PCPU_PAGES);
 
 ASSYM(TAR_VPN_SHIFT, TAR_VPN_SHIFT);
 
+ASSYM(_NCPUBITS, _NCPUBITS);
+
 #ifdef SUN4U
 ASSYM(TLB_DEMAP_ALL, TLB_DEMAP_ALL);
 #endif
 ASSYM(TLB_DEMAP_CONTEXT, TLB_DEMAP_CONTEXT);
 ASSYM(TLB_DEMAP_NUCLEUS, TLB_DEMAP_NUCLEUS);
 ASSYM(TLB_DEMAP_PAGE, TLB_DEMAP_PAGE);
 ASSYM(TLB_DEMAP_PRIMARY, TLB_DEMAP_PRIMARY);
 
 ASSYM(INT_SHIFT, INT_SHIFT);
 ASSYM(PTR_SHIFT, PTR_SHIFT);
 
 ASSYM(PAGE_SHIFT, PAGE_SHIFT);
 ASSYM(PAGE_SHIFT_8K, PAGE_SHIFT_8K);
 ASSYM(PAGE_SHIFT_4M, PAGE_SHIFT_4M);
 ASSYM(PAGE_SIZE, PAGE_SIZE);
 ASSYM(PAGE_SIZE_4M, PAGE_SIZE_4M);
 
 #ifdef SMP
 ASSYM(CSA_PCPU, offsetof(struct cpu_start_args, csa_pcpu));
 ASSYM(CSA_STATE, offsetof(struct cpu_start_args, csa_state));
 #ifdef SUN4U
 ASSYM(CSA_MID, offsetof(struct cpu_start_args, csa_mid));
 ASSYM(CSA_STICK, offsetof(struct cpu_start_args, csa_stick));
 ASSYM(CSA_TICK, offsetof(struct cpu_start_args, csa_tick));
 ASSYM(CSA_TTES, offsetof(struct cpu_start_args, csa_ttes));
 ASSYM(CSA_VER, offsetof(struct cpu_start_args, csa_ver));
 #endif
 #endif
 
 #ifdef SUN4U
 ASSYM(DC_SIZE, offsetof(struct cacheinfo, dc_size));
 ASSYM(DC_LINESIZE, offsetof(struct cacheinfo, dc_linesize));
 ASSYM(IC_SIZE, offsetof(struct cacheinfo, ic_size));
 ASSYM(IC_LINESIZE, offsetof(struct cacheinfo, ic_linesize));
 #endif
 
 ASSYM(KTR_SIZEOF, sizeof(struct ktr_entry));
 ASSYM(KTR_LINE, offsetof(struct ktr_entry, ktr_line));
 ASSYM(KTR_FILE, offsetof(struct ktr_entry, ktr_file));
 ASSYM(KTR_DESC, offsetof(struct ktr_entry, ktr_desc));
 ASSYM(KTR_CPU, offsetof(struct ktr_entry, ktr_cpu));
 ASSYM(KTR_TIMESTAMP, offsetof(struct ktr_entry, ktr_timestamp));
 ASSYM(KTR_PARM1, offsetof(struct ktr_entry, ktr_parms[0]));
 ASSYM(KTR_PARM2, offsetof(struct ktr_entry, ktr_parms[1]));
 ASSYM(KTR_PARM3, offsetof(struct ktr_entry, ktr_parms[2]));
 ASSYM(KTR_PARM4, offsetof(struct ktr_entry, ktr_parms[3]));
 ASSYM(KTR_PARM5, offsetof(struct ktr_entry, ktr_parms[4]));
 ASSYM(KTR_PARM6, offsetof(struct ktr_entry, ktr_parms[5]));
 
 ASSYM(TTE_SHIFT, TTE_SHIFT);
 #ifdef SUN4U
 ASSYM(TTE_VPN, offsetof(struct tte, tte_vpn));
 ASSYM(TTE_DATA, offsetof(struct tte, tte_data));
 
 ASSYM(TD_V, TD_V);
 ASSYM(TD_EXEC, TD_EXEC);
 ASSYM(TD_REF, TD_REF);
 ASSYM(TD_SW, TD_SW);
 ASSYM(TD_L, TD_L);
 ASSYM(TD_CP, TD_CP);
 ASSYM(TD_CV, TD_CV);
 ASSYM(TD_W, TD_W);
 
 ASSYM(TS_MIN, TS_MIN);
 ASSYM(TS_MAX, TS_MAX);
 ASSYM(TLB_DAR_SLOT_SHIFT, TLB_DAR_SLOT_SHIFT);
 ASSYM(TLB_CXR_PGSZ_MASK, TLB_CXR_PGSZ_MASK);
 ASSYM(TLB_DIRECT_ADDRESS_MASK, TLB_DIRECT_ADDRESS_MASK);
 ASSYM(TLB_DIRECT_TO_TTE_MASK, TLB_DIRECT_TO_TTE_MASK);
 ASSYM(TV_SIZE_BITS, TV_SIZE_BITS);
 #endif
 
 ASSYM(V_INTR, offsetof(struct vmmeter, v_intr));
 
 ASSYM(MAXCOMLEN, MAXCOMLEN);
 ASSYM(PC_CURTHREAD, offsetof(struct pcpu, pc_curthread));
 ASSYM(PC_CURPCB, offsetof(struct pcpu, pc_curpcb));
 ASSYM(PC_CPUID, offsetof(struct pcpu, pc_cpuid));
-ASSYM(PC_CPUMASK, offsetof(struct pcpu, pc_cpumask));
 ASSYM(PC_IRHEAD, offsetof(struct pcpu, pc_irhead));
 ASSYM(PC_IRTAIL, offsetof(struct pcpu, pc_irtail));
 ASSYM(PC_IRFREE, offsetof(struct pcpu, pc_irfree));
 ASSYM(PC_CNT, offsetof(struct pcpu, pc_cnt));
 ASSYM(PC_SIZEOF, sizeof(struct pcpu));
 
 #ifdef SUN4U
 ASSYM(PC_CACHE, offsetof(struct pcpu, pc_cache));
 ASSYM(PC_MID, offsetof(struct pcpu, pc_mid));
 ASSYM(PC_PMAP, offsetof(struct pcpu, pc_pmap));
 ASSYM(PC_TLB_CTX, offsetof(struct pcpu, pc_tlb_ctx));
 ASSYM(PC_TLB_CTX_MAX, offsetof(struct pcpu, pc_tlb_ctx_max));
 ASSYM(PC_TLB_CTX_MIN, offsetof(struct pcpu, pc_tlb_ctx_min));
 #endif
 
 ASSYM(IR_NEXT, offsetof(struct intr_request, ir_next));
 ASSYM(IR_FUNC, offsetof(struct intr_request, ir_func));
 ASSYM(IR_ARG, offsetof(struct intr_request, ir_arg));
 ASSYM(IR_PRI, offsetof(struct intr_request, ir_pri));
 ASSYM(IR_VEC, offsetof(struct intr_request, ir_vec));
 
 #if defined(SUN4U) && defined(SMP)
 ASSYM(ICA_PA, offsetof(struct ipi_cache_args, ica_pa));
 
 ASSYM(IRA_MASK, offsetof(struct ipi_rd_args, ira_mask));
 ASSYM(IRA_VAL, offsetof(struct ipi_rd_args, ira_val));
 
 ASSYM(ITA_MASK, offsetof(struct ipi_tlb_args, ita_mask));
 ASSYM(ITA_PMAP, offsetof(struct ipi_tlb_args, ita_pmap));
 ASSYM(ITA_START, offsetof(struct ipi_tlb_args, ita_start));
 ASSYM(ITA_END, offsetof(struct ipi_tlb_args, ita_end));
 ASSYM(ITA_VA, offsetof(struct ipi_tlb_args, ita_va));
 #endif
 
 ASSYM(IV_FUNC, offsetof(struct intr_vector, iv_func));
 ASSYM(IV_ARG, offsetof(struct intr_vector, iv_arg));
 ASSYM(IV_PRI, offsetof(struct intr_vector, iv_pri));
 
 ASSYM(TDF_ASTPENDING, TDF_ASTPENDING);
 ASSYM(TDF_NEEDRESCHED, TDF_NEEDRESCHED);
 
 ASSYM(MD_UTRAP, offsetof(struct mdproc, md_utrap));
 
 ASSYM(P_COMM, offsetof(struct proc, p_comm));
 ASSYM(P_MD, offsetof(struct proc, p_md));
 ASSYM(P_PID, offsetof(struct proc, p_pid));
 ASSYM(P_VMSPACE, offsetof(struct proc, p_vmspace));
 
 ASSYM(TD_FLAGS, offsetof(struct thread, td_flags));
 ASSYM(TD_FRAME, offsetof(struct thread, td_frame));
 ASSYM(TD_KSTACK, offsetof(struct thread, td_kstack));
 ASSYM(TD_LOCK, offsetof(struct thread, td_lock));
 ASSYM(TD_PCB, offsetof(struct thread, td_pcb));
 ASSYM(TD_PROC, offsetof(struct thread, td_proc));
 ASSYM(TD_MD, offsetof(struct thread, td_md));
 ASSYM(MD_SAVED_PIL, offsetof(struct mdthread, md_saved_pil));
 
 ASSYM(PCB_SIZEOF, sizeof(struct pcb));
 ASSYM(PCB_RW, offsetof(struct pcb, pcb_rw));
 ASSYM(PCB_KFP, offsetof(struct pcb, pcb_kfp));
 ASSYM(PCB_UFP, offsetof(struct pcb, pcb_ufp));
 ASSYM(PCB_RWSP, offsetof(struct pcb, pcb_rwsp));
 ASSYM(PCB_FLAGS, offsetof(struct pcb, pcb_flags));
 ASSYM(PCB_NSAVED, offsetof(struct pcb, pcb_nsaved));
 ASSYM(PCB_PC, offsetof(struct pcb, pcb_pc));
 ASSYM(PCB_SP, offsetof(struct pcb, pcb_sp));
 ASSYM(PCB_PAD, offsetof(struct pcb, pcb_pad));
 
 ASSYM(VM_PMAP, offsetof(struct vmspace, vm_pmap));
 ASSYM(PM_ACTIVE, offsetof(struct pmap, pm_active));
 ASSYM(PM_CONTEXT, offsetof(struct pmap, pm_context));
 ASSYM(PM_TSB, offsetof(struct pmap, pm_tsb));
 
 ASSYM(_JB_FP, offsetof(struct _jmp_buf, _jb[_JB_FP]));
 ASSYM(_JB_PC, offsetof(struct _jmp_buf, _jb[_JB_PC]));
 ASSYM(_JB_SP, offsetof(struct _jmp_buf, _jb[_JB_SP]));
 ASSYM(_JB_SIGFLAG, offsetof(struct _jmp_buf, _jb[_JB_SIGFLAG]));
 ASSYM(_JB_SIGMASK, offsetof(struct _jmp_buf, _jb[_JB_SIGMASK]));
 
 ASSYM(TF_G0, offsetof(struct trapframe, tf_global[0]));
 ASSYM(TF_G1, offsetof(struct trapframe, tf_global[1]));
 ASSYM(TF_G2, offsetof(struct trapframe, tf_global[2]));
 ASSYM(TF_G3, offsetof(struct trapframe, tf_global[3]));
 ASSYM(TF_G4, offsetof(struct trapframe, tf_global[4]));
 ASSYM(TF_G5, offsetof(struct trapframe, tf_global[5]));
 ASSYM(TF_G6, offsetof(struct trapframe, tf_global[6]));
 ASSYM(TF_G7, offsetof(struct trapframe, tf_global[7]));
 ASSYM(TF_O0, offsetof(struct trapframe, tf_out[0]));
 ASSYM(TF_O1, offsetof(struct trapframe, tf_out[1]));
 ASSYM(TF_O2, offsetof(struct trapframe, tf_out[2]));
 ASSYM(TF_O3, offsetof(struct trapframe, tf_out[3]));
 ASSYM(TF_O4, offsetof(struct trapframe, tf_out[4]));
 ASSYM(TF_O5, offsetof(struct trapframe, tf_out[5]));
 ASSYM(TF_O6, offsetof(struct trapframe, tf_out[6]));
 ASSYM(TF_O7, offsetof(struct trapframe, tf_out[7]));
 ASSYM(TF_FPRS, offsetof(struct trapframe, tf_fprs));
 ASSYM(TF_FSR, offsetof(struct trapframe, tf_fsr));
 ASSYM(TF_GSR, offsetof(struct trapframe, tf_gsr));
 ASSYM(TF_PIL, offsetof(struct trapframe, tf_pil));
 #ifdef SUN4U
 ASSYM(TF_LEVEL, offsetof(struct trapframe, tf_level));
 ASSYM(TF_SFAR, offsetof(struct trapframe, tf_sfar));
 ASSYM(TF_SFSR, offsetof(struct trapframe, tf_sfsr));
 ASSYM(TF_TAR, offsetof(struct trapframe, tf_tar));
 ASSYM(TF_TYPE, offsetof(struct trapframe, tf_type));
 ASSYM(TF_Y, offsetof(struct trapframe, tf_y));
 #endif
 ASSYM(TF_TNPC, offsetof(struct trapframe, tf_tnpc));
 ASSYM(TF_TPC, offsetof(struct trapframe, tf_tpc));
 ASSYM(TF_TSTATE, offsetof(struct trapframe, tf_tstate));
 ASSYM(TF_WSTATE, offsetof(struct trapframe, tf_wstate));
 ASSYM(TF_SIZEOF, sizeof(struct trapframe));
 
 ASSYM(VM_MIN_DIRECT_ADDRESS, VM_MIN_DIRECT_ADDRESS);
 ASSYM(VM_MIN_PROM_ADDRESS, VM_MIN_PROM_ADDRESS);
 ASSYM(VM_MAX_PROM_ADDRESS, VM_MAX_PROM_ADDRESS);
Index: head/sys/sparc64/sparc64/intr_machdep.c
===================================================================
--- head/sys/sparc64/sparc64/intr_machdep.c	(revision 222812)
+++ head/sys/sparc64/sparc64/intr_machdep.c	(revision 222813)
@@ -1,547 +1,549 @@
 /*-
  * Copyright (c) 1991 The Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 /*-
  * Copyright (c) 2001 Jake Burkholder.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)isa.c	7.2 (Berkeley) 5/13/91
  *	form: src/sys/i386/isa/intr_machdep.c,v 1.57 2001/07/20
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/errno.h>
 #include <sys/interrupt.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 #include <sys/sx.h>
 
 #include <machine/frame.h>
 #include <machine/intr_machdep.h>
 
 #define	MAX_STRAY_LOG	5
 
 CTASSERT((1 << IV_SHIFT) == sizeof(struct intr_vector));
 
 ih_func_t *intr_handlers[PIL_MAX];
 uint16_t pil_countp[PIL_MAX];
 
 struct intr_vector intr_vectors[IV_MAX];
 uint16_t intr_countp[IV_MAX];
 static u_long intr_stray_count[IV_MAX];
 
 static const char *const pil_names[] = {
 	"stray",
 	"low",		/* PIL_LOW */
 	"ithrd",	/* PIL_ITHREAD */
 	"rndzvs",	/* PIL_RENDEZVOUS */
 	"ast",		/* PIL_AST */
 	"stop",		/* PIL_STOP */
 	"preempt",	/* PIL_PREEMPT */
 	"hardclock",	/* PIL_HARDCLOCK */
 	"stray", "stray", "stray", "stray",
 	"filter",	/* PIL_FILTER */
 	"bridge",	/* PIL_BRIDGE */
 	"tick",		/* PIL_TICK */
 };
 
 /* protect the intr_vectors table */
 static struct sx intr_table_lock;
 /* protect intrcnt_index */
 static struct mtx intrcnt_lock;
 
 #ifdef SMP
 static int assign_cpu;
 
 static void intr_assign_next_cpu(struct intr_vector *iv);
 static void intr_shuffle_irqs(void *arg __unused);
 #endif
 
 static int intr_assign_cpu(void *arg, u_char cpu);
 static void intr_execute_handlers(void *);
 static void intr_stray_level(struct trapframe *);
 static void intr_stray_vector(void *);
 static int intrcnt_setname(const char *, int);
 static void intrcnt_updatename(int, const char *, int);
 
 static void
 intrcnt_updatename(int vec, const char *name, int ispil)
 {
 	static int intrcnt_index, stray_pil_index, stray_vec_index;
 	int name_index;
 
 	mtx_lock_spin(&intrcnt_lock);
 	if (intrnames[0] == '\0') {
 		/* for bitbucket */
 		if (bootverbose)
 			printf("initalizing intr_countp\n");
 		intrcnt_setname("???", intrcnt_index++);
 
 		stray_vec_index = intrcnt_index++;
 		intrcnt_setname("stray", stray_vec_index);
 		for (name_index = 0; name_index < IV_MAX; name_index++)
 			intr_countp[name_index] = stray_vec_index;
 
 		stray_pil_index = intrcnt_index++;
 		intrcnt_setname("pil", stray_pil_index);
 		for (name_index = 0; name_index < PIL_MAX; name_index++)
 			pil_countp[name_index] = stray_pil_index;
 	}
 
 	if (name == NULL)
 		name = "???";
 
 	if (!ispil && intr_countp[vec] != stray_vec_index)
 		name_index = intr_countp[vec];
 	else if (ispil && pil_countp[vec] != stray_pil_index)
 		name_index = pil_countp[vec];
 	else
 		name_index = intrcnt_index++;
 
 	if (intrcnt_setname(name, name_index))
 		name_index = 0;
 
 	if (!ispil)
 		intr_countp[vec] = name_index;
 	else
 		pil_countp[vec] = name_index;
 	mtx_unlock_spin(&intrcnt_lock);
 }
 
 static int
 intrcnt_setname(const char *name, int index)
 {
 
 	if (intrnames + (MAXCOMLEN + 1) * index >= eintrnames)
 		return (E2BIG);
 	snprintf(intrnames + (MAXCOMLEN + 1) * index, MAXCOMLEN + 1, "%-*s",
 	    MAXCOMLEN, name);
 	return (0);
 }
 
 void
 intr_setup(int pri, ih_func_t *ihf, int vec, iv_func_t *ivf, void *iva)
 {
 	char pilname[MAXCOMLEN + 1];
 	register_t s;
 
 	s = intr_disable();
 	if (vec != -1) {
 		intr_vectors[vec].iv_func = ivf;
 		intr_vectors[vec].iv_arg = iva;
 		intr_vectors[vec].iv_pri = pri;
 		intr_vectors[vec].iv_vec = vec;
 	}
 	intr_handlers[pri] = ihf;
 	intr_restore(s);
 	snprintf(pilname, MAXCOMLEN + 1, "pil%d: %s", pri, pil_names[pri]);
 	intrcnt_updatename(pri, pilname, 1);
 }
 
 static void
 intr_stray_level(struct trapframe *tf)
 {
 
 	printf("stray level interrupt %ld\n", tf->tf_level);
 }
 
 static void
 intr_stray_vector(void *cookie)
 {
 	struct intr_vector *iv;
 
 	iv = cookie;
 	if (intr_stray_count[iv->iv_vec] < MAX_STRAY_LOG) {
 		printf("stray vector interrupt %d\n", iv->iv_vec);
 		intr_stray_count[iv->iv_vec]++;
 		if (intr_stray_count[iv->iv_vec] >= MAX_STRAY_LOG)
 			printf("got %d stray interrupt %d's: not logging "
 			    "anymore\n", MAX_STRAY_LOG, iv->iv_vec);
 	}
 }
 
 void
 intr_init1()
 {
 	int i;
 
 	/* Mark all interrupts as being stray. */
 	for (i = 0; i < PIL_MAX; i++)
 		intr_handlers[i] = intr_stray_level;
 	for (i = 0; i < IV_MAX; i++) {
 		intr_vectors[i].iv_func = intr_stray_vector;
 		intr_vectors[i].iv_arg = &intr_vectors[i];
 		intr_vectors[i].iv_pri = PIL_LOW;
 		intr_vectors[i].iv_vec = i;
 		intr_vectors[i].iv_refcnt = 0;
 	}
 	intr_handlers[PIL_LOW] = intr_fast;
 }
 
 void
 intr_init2()
 {
 
 	sx_init(&intr_table_lock, "intr sources");
 	mtx_init(&intrcnt_lock, "intrcnt", NULL, MTX_SPIN);
 }
 
 static int
 intr_assign_cpu(void *arg, u_char cpu)
 {
 #ifdef SMP
 	struct pcpu *pc;
 	struct intr_vector *iv;
 
 	/*
 	 * Don't do anything during early boot.  We will pick up the
 	 * assignment once the APs are started.
 	 */
 	if (assign_cpu && cpu != NOCPU) {
 		pc = pcpu_find(cpu);
 		if (pc == NULL)
 			return (EINVAL);
 		iv = arg;
 		sx_xlock(&intr_table_lock);
 		iv->iv_mid = pc->pc_mid;
 		iv->iv_ic->ic_assign(iv);
 		sx_xunlock(&intr_table_lock);
 	}
 	return (0);
 #else
 	return (EOPNOTSUPP);
 #endif
 }
 
 static void
 intr_execute_handlers(void *cookie)
 {
 	struct intr_vector *iv;
 
 	iv = cookie;
 	if (__predict_false(intr_event_handle(iv->iv_event, NULL) != 0))
 		intr_stray_vector(iv);
 }
 
 int
 intr_controller_register(int vec, const struct intr_controller *ic,
     void *icarg)
 {
 	struct intr_event *ie;
 	struct intr_vector *iv;
 	int error;
 
 	if (vec < 0 || vec >= IV_MAX)
 		return (EINVAL);
 	sx_xlock(&intr_table_lock);
 	iv = &intr_vectors[vec];
 	ie = iv->iv_event;
 	sx_xunlock(&intr_table_lock);
 	if (ie != NULL)
 		return (EEXIST);
 	error = intr_event_create(&ie, iv, 0, vec, NULL, ic->ic_clear,
 	    ic->ic_clear, intr_assign_cpu, "vec%d:", vec);
 	if (error != 0)
 		return (error);
 	sx_xlock(&intr_table_lock);
 	if (iv->iv_event != NULL) {
 		sx_xunlock(&intr_table_lock);
 		intr_event_destroy(ie);
 		return (EEXIST);
 	}
 	iv->iv_ic = ic;
 	iv->iv_icarg = icarg;
 	iv->iv_event = ie;
 	iv->iv_mid = PCPU_GET(mid);
 	sx_xunlock(&intr_table_lock);
 	return (0);
 }
 
 int
 inthand_add(const char *name, int vec, driver_filter_t *filt,
     driver_intr_t *handler, void *arg, int flags, void **cookiep)
 {
 	const struct intr_controller *ic;
 	struct intr_event *ie;
 	struct intr_handler *ih;
 	struct intr_vector *iv;
 	int error, filter;
 
 	if (vec < 0 || vec >= IV_MAX)
 		return (EINVAL);
 	/*
 	 * INTR_BRIDGE filters/handlers are special purpose only, allowing
 	 * them to be shared just would complicate things unnecessarily.
 	 */
 	if ((flags & INTR_BRIDGE) != 0 && (flags & INTR_EXCL) == 0)
 		return (EINVAL);
 	sx_xlock(&intr_table_lock);
 	iv = &intr_vectors[vec];
 	ic = iv->iv_ic;
 	ie = iv->iv_event;
 	sx_xunlock(&intr_table_lock);
 	if (ic == NULL || ie == NULL)
 		return (EINVAL);
 	error = intr_event_add_handler(ie, name, filt, handler, arg,
 	    intr_priority(flags), flags, cookiep);
 	if (error != 0)
 		return (error);
 	sx_xlock(&intr_table_lock);
 	/* Disable the interrupt while we fiddle with it. */
 	ic->ic_disable(iv);
 	iv->iv_refcnt++;
 	if (iv->iv_refcnt == 1)
 		intr_setup((flags & INTR_BRIDGE) != 0 ? PIL_BRIDGE :
 		    filt != NULL ? PIL_FILTER : PIL_ITHREAD, intr_fast,
 		    vec, intr_execute_handlers, iv);
 	else if (filt != NULL) {
 		/*
 		 * Check if we need to upgrade from PIL_ITHREAD to PIL_FILTER.
 		 * Given that apart from the on-board SCCs and UARTs shared
 		 * interrupts are rather uncommon on sparc64 this sould be
 		 * pretty rare in practice.
 		 */
 		filter = 0;
 		TAILQ_FOREACH(ih, &ie->ie_handlers, ih_next) {
 			if (ih->ih_filter != NULL && ih->ih_filter != filt) {
 				filter = 1;
 				break;
 			}
 		}
 		if (filter == 0)
 			intr_setup(PIL_FILTER, intr_fast, vec,
 			    intr_execute_handlers, iv);
 	}
 	intr_stray_count[vec] = 0;
 	intrcnt_updatename(vec, ie->ie_fullname, 0);
 #ifdef SMP
 	if (assign_cpu)
 		intr_assign_next_cpu(iv);
 #endif
 	ic->ic_enable(iv);
 	/* Ensure the interrupt is cleared, it might have triggered before. */
 	if (ic->ic_clear != NULL)
 		ic->ic_clear(iv);
 	sx_xunlock(&intr_table_lock);
 	return (0);
 }
 
 int
 inthand_remove(int vec, void *cookie)
 {
 	struct intr_vector *iv;
 	int error;
 
 	if (vec < 0 || vec >= IV_MAX)
 		return (EINVAL);
 	error = intr_event_remove_handler(cookie);
 	if (error == 0) {
 		/*
 		 * XXX: maybe this should be done regardless of whether
 		 * intr_event_remove_handler() succeeded?
 		 */
 		sx_xlock(&intr_table_lock);
 		iv = &intr_vectors[vec];
 		iv->iv_refcnt--;
 		if (iv->iv_refcnt == 0) {
 			/*
 			 * Don't disable the interrupt for now, so that
 			 * stray interrupts get detected...
 			 */
 			intr_setup(PIL_LOW, intr_fast, vec,
 			    intr_stray_vector, iv);
 		}
 		sx_xunlock(&intr_table_lock);
 	}
 	return (error);
 }
 
 /* Add a description to an active interrupt handler. */
 int
 intr_describe(int vec, void *ih, const char *descr)
 {
 	struct intr_vector *iv;
 	int error;
 
 	if (vec < 0 || vec >= IV_MAX)
 		return (EINVAL);
 	sx_xlock(&intr_table_lock);
 	iv = &intr_vectors[vec];
 	if (iv == NULL) {
 		sx_xunlock(&intr_table_lock);
 		return (EINVAL);
 	}
 	error = intr_event_describe_handler(iv->iv_event, ih, descr);
 	if (error) {
 		sx_xunlock(&intr_table_lock);
 		return (error);
 	}
 	intrcnt_updatename(vec, iv->iv_event->ie_fullname, 0);
 	sx_xunlock(&intr_table_lock);
 	return (error);
 }
 
 #ifdef SMP
 /*
  * Support for balancing interrupt sources across CPUs.  For now we just
  * allocate CPUs round-robin.
  */
 
-/* The BSP is always a valid target. */
-static cpumask_t intr_cpus = (1 << 0);
+static cpuset_t intr_cpus;
 static int current_cpu;
 
 static void
 intr_assign_next_cpu(struct intr_vector *iv)
 {
 	struct pcpu *pc;
 
 	sx_assert(&intr_table_lock, SA_XLOCKED);
 
 	/*
 	 * Assign this source to a CPU in a round-robin fashion.
 	 */
 	pc = pcpu_find(current_cpu);
 	if (pc == NULL)
 		return;
 	iv->iv_mid = pc->pc_mid;
 	iv->iv_ic->ic_assign(iv);
 	do {
 		current_cpu++;
 		if (current_cpu > mp_maxid)
 			current_cpu = 0;
-	} while (!(intr_cpus & (1 << current_cpu)));
+	} while (!CPU_ISSET(current_cpu, &intr_cpus));
 }
 
 /* Attempt to bind the specified IRQ to the specified CPU. */
 int
 intr_bind(int vec, u_char cpu)
 {
 	struct intr_vector *iv;
 	int error;
 
 	if (vec < 0 || vec >= IV_MAX)
 		return (EINVAL);
 	sx_xlock(&intr_table_lock);
 	iv = &intr_vectors[vec];
 	if (iv == NULL) {
 		sx_xunlock(&intr_table_lock);
 		return (EINVAL);
 	}
 	error = intr_event_bind(iv->iv_event, cpu);
 	sx_xunlock(&intr_table_lock);
 	return (error);
 }
 
 /*
  * Add a CPU to our mask of valid CPUs that can be destinations of
  * interrupts.
  */
 void
 intr_add_cpu(u_int cpu)
 {
 
 	if (cpu >= MAXCPU)
 		panic("%s: Invalid CPU ID", __func__);
 	if (bootverbose)
 		printf("INTR: Adding CPU %d as a target\n", cpu);
 
-	intr_cpus |= (1 << cpu);
+	CPU_SET(cpu, &intr_cpus);
 }
 
 /*
  * Distribute all the interrupt sources among the available CPUs once the
  * APs have been launched.
  */
 static void
 intr_shuffle_irqs(void *arg __unused)
 {
 	struct pcpu *pc;
 	struct intr_vector *iv;
 	int i;
+
+	/* The BSP is always a valid target. */
+	CPU_SETOF(0, &intr_cpus);
 
 	/* Don't bother on UP. */
 	if (mp_ncpus == 1)
 		return;
 
 	sx_xlock(&intr_table_lock);
 	assign_cpu = 1;
 	for (i = 0; i < IV_MAX; i++) {
 		iv = &intr_vectors[i];
 		if (iv != NULL && iv->iv_refcnt > 0) {
 			/*
 			 * If this event is already bound to a CPU,
 			 * then assign the source to that CPU instead
 			 * of picking one via round-robin.
 			 */
 			if (iv->iv_event->ie_cpu != NOCPU &&
 			    (pc = pcpu_find(iv->iv_event->ie_cpu)) != NULL) {
 				iv->iv_mid = pc->pc_mid;
 				iv->iv_ic->ic_assign(iv);
 			} else
 				intr_assign_next_cpu(iv);
 		}
 	}
 	sx_xunlock(&intr_table_lock);
 }
 SYSINIT(intr_shuffle_irqs, SI_SUB_SMP, SI_ORDER_SECOND, intr_shuffle_irqs,
     NULL);
 #endif
Index: head/sys/sparc64/sparc64/mp_exception.S
===================================================================
--- head/sys/sparc64/sparc64/mp_exception.S	(revision 222812)
+++ head/sys/sparc64/sparc64/mp_exception.S	(revision 222813)
@@ -1,300 +1,310 @@
 /*-
  * Copyright (c) 2002 Jake Burkholder.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <machine/asm.h>
 __FBSDID("$FreeBSD$");
 
 #include <machine/asi.h>
 #include <machine/asmacros.h>
 #include <machine/cache.h>
 #include <machine/ktr.h>
 #include <machine/pstate.h>
 
 #include "assym.s"
 
 	.register	%g2, #ignore
 	.register	%g3, #ignore
 
-#define	IPI_DONE(r1, r2, r3, r4) \
-	lduw	[PCPU(CPUMASK)], r4 ;  \
-	ATOMIC_CLEAR_INT(r1, r2, r3, r4)
+#define	IPI_DONE(r1, r2, r3, r4, r5)					\
+	lduw	[PCPU(CPUID)], r2 ;					\
+	mov	_NCPUBITS, r3 ;						\
+	mov	%g0, %y ;						\
+	udiv	r2, r3, r4 ;						\
+	srl	r4, 0, r5 ;						\
+	sllx	r5, PTR_SHIFT, r5 ;					\
+	add	r1, r5, r1 ;						\
+	smul	r4, r3, r3 ;						\
+	sub	r2, r3, r3 ;						\
+	mov	1, r4 ;							\
+	sllx	r4, r3, r4 ;						\
+	ATOMIC_CLEAR_LONG(r1, r2, r3, r4)
 
 /*
  * Invalidate a physical page in the data cache.  For UltraSPARC I and II.
  */
 ENTRY(tl_ipi_spitfire_dcache_page_inval)
 #if KTR_COMPILE & KTR_SMP
 	CATR(KTR_SMP, "tl_ipi_spitfire_dcache_page_inval: pa=%#lx"
 	    , %g1, %g2, %g3, 7, 8, 9)
 	ldx	[%g5 + ICA_PA], %g2
 	stx	%g2, [%g1 + KTR_PARM1]
 9:
 #endif
 
 	ldx	[%g5 + ICA_PA], %g6
 	srlx	%g6, PAGE_SHIFT - DC_TAG_SHIFT, %g6
 
 	lduw	[PCPU(CACHE) + DC_SIZE], %g3
 	lduw	[PCPU(CACHE) + DC_LINESIZE], %g4
 	sub	%g3, %g4, %g2
 
 1:	ldxa	[%g2] ASI_DCACHE_TAG, %g1
 	srlx	%g1, DC_VALID_SHIFT, %g3
 	andcc	%g3, DC_VALID_MASK, %g0
 	bz,pt	%xcc, 2f
 	 set	DC_TAG_MASK, %g3
 	sllx	%g3, DC_TAG_SHIFT, %g3
 	and	%g1, %g3, %g1
 	cmp	%g1, %g6
 	bne,a,pt %xcc, 2f
 	 nop
 	stxa	%g1, [%g2] ASI_DCACHE_TAG
 	membar	#Sync
 
 2:	brgz,pt	%g2, 1b
 	 sub	%g2, %g4, %g2
 
-	IPI_DONE(%g5, %g1, %g2, %g3)
+	IPI_DONE(%g5, %g1, %g2, %g3, %g4)
 	retry
 END(tl_ipi_spitfire_dcache_page_inval)
 
 /*
  * Invalidate a physical page in the instruction cache.  For UltraSPARC I and
  * II.
  */
 ENTRY(tl_ipi_spitfire_icache_page_inval)
 #if KTR_COMPILE & KTR_SMP
 	CATR(KTR_SMP, "tl_ipi_spitfire_icache_page_inval: pa=%#lx"
 	    , %g1, %g2, %g3, 7, 8, 9)
 	ldx	[%g5 + ICA_PA], %g2
 	stx	%g2, [%g1 + KTR_PARM1]
 9:
 #endif
 
 	ldx	[%g5 + ICA_PA], %g6
 	srlx	%g6, PAGE_SHIFT - IC_TAG_SHIFT, %g6
 
 	lduw	[PCPU(CACHE) + IC_SIZE], %g3
 	lduw	[PCPU(CACHE) + IC_LINESIZE], %g4
 	sub	%g3, %g4, %g2
 
 1:	ldda	[%g2] ASI_ICACHE_TAG, %g0 /*, %g1 */
 	srlx	%g1, IC_VALID_SHIFT, %g3
 	andcc	%g3, IC_VALID_MASK, %g0
 	bz,pt	%xcc, 2f
 	 set	IC_TAG_MASK, %g3
 	sllx	%g3, IC_TAG_SHIFT, %g3
 	and	%g1, %g3, %g1
 	cmp	%g1, %g6
 	bne,a,pt %xcc, 2f
 	 nop
 	stxa	%g1, [%g2] ASI_ICACHE_TAG
 	membar	#Sync
 
 2:	brgz,pt	%g2, 1b
 	 sub	%g2, %g4, %g2
 
-	IPI_DONE(%g5, %g1, %g2, %g3)
+	IPI_DONE(%g5, %g1, %g2, %g3, %g4)
 	retry
 END(tl_ipi_spitfire_icache_page_inval)
 
 /*
  * Invalidate a physical page in the data cache.  For UltraSPARC III.
  */
 ENTRY(tl_ipi_cheetah_dcache_page_inval)
 #if KTR_COMPILE & KTR_SMP
 	CATR(KTR_SMP, "tl_ipi_cheetah_dcache_page_inval: pa=%#lx"
 	    , %g1, %g2, %g3, 7, 8, 9)
 	ldx	[%g5 + ICA_PA], %g2
 	stx	%g2, [%g1 + KTR_PARM1]
 9:
 #endif
 
 	ldx	[%g5 + ICA_PA], %g1
 
 	set	PAGE_SIZE, %g2
 	add	%g1, %g2, %g3
 
 	lduw	[PCPU(CACHE) + DC_LINESIZE], %g2
 
 1:	stxa	%g0, [%g1] ASI_DCACHE_INVALIDATE
 	membar	#Sync
 
 	add	%g1, %g2, %g1
 	cmp	%g1, %g3
 	blt,a,pt %xcc, 1b
 	 nop
 
-	IPI_DONE(%g5, %g1, %g2, %g3)
+	IPI_DONE(%g5, %g1, %g2, %g3, %g4)
 	retry
 END(tl_ipi_cheetah_dcache_page_inval)
 
 /*
  * Trigger a softint at the desired level.
  */
 ENTRY(tl_ipi_level)
 #if KTR_COMPILE & KTR_SMP
 	CATR(KTR_SMP, "tl_ipi_level: cpuid=%d mid=%d d1=%#lx d2=%#lx"
 	    , %g1, %g2, %g3, 7, 8, 9)
 	lduw	[PCPU(CPUID)], %g2
 	stx	%g2, [%g1 + KTR_PARM1]
 	lduw	[PCPU(MID)], %g2
 	stx	%g2, [%g1 + KTR_PARM2]
 	stx	%g4, [%g1 + KTR_PARM3]
 	stx	%g5, [%g1 + KTR_PARM4]
 9:
 #endif
 
 	mov	1, %g1
 	sllx	%g1, %g5, %g1
 	wr	%g1, 0, %set_softint
 	retry
 END(tl_ipi_level)
 
 /*
  * Demap a page from the dtlb and/or itlb.
  */
 ENTRY(tl_ipi_tlb_page_demap)
 #if KTR_COMPILE & KTR_SMP
 	CATR(KTR_SMP, "ipi_tlb_page_demap: pm=%p va=%#lx"
 	    , %g1, %g2, %g3, 7, 8, 9)
 	ldx	[%g5 + ITA_PMAP], %g2
 	stx	%g2, [%g1 + KTR_PARM1]
 	ldx	[%g5 + ITA_VA], %g2
 	stx	%g2, [%g1 + KTR_PARM2]
 9:
 #endif
 
 	ldx	[%g5 + ITA_PMAP], %g1
 
 	SET(kernel_pmap_store, %g3, %g2)
 	mov	TLB_DEMAP_NUCLEUS | TLB_DEMAP_PAGE, %g3
 
 	cmp	%g1, %g2
 	movne	%xcc, TLB_DEMAP_PRIMARY | TLB_DEMAP_PAGE, %g3
 
 	ldx	[%g5 + ITA_VA], %g2
 	or	%g2, %g3, %g2
 
 	sethi	%hi(KERNBASE), %g3
 	stxa	%g0, [%g2] ASI_DMMU_DEMAP
 	stxa	%g0, [%g2] ASI_IMMU_DEMAP
 	flush	%g3
 
-	IPI_DONE(%g5, %g1, %g2, %g3)
+	IPI_DONE(%g5, %g1, %g2, %g3, %g4)
 	retry
 END(tl_ipi_tlb_page_demap)
 
 /*
  * Demap a range of pages from the dtlb and itlb.
  */
 ENTRY(tl_ipi_tlb_range_demap)
 #if KTR_COMPILE & KTR_SMP
 	CATR(KTR_SMP, "ipi_tlb_range_demap: pm=%p start=%#lx end=%#lx"
 	    , %g1, %g2, %g3, 7, 8, 9)
 	ldx	[%g5 + ITA_PMAP], %g2
 	stx	%g2, [%g1 + KTR_PARM1]
 	ldx	[%g5 + ITA_START], %g2
 	stx	%g2, [%g1 + KTR_PARM2]
 	ldx	[%g5 + ITA_END], %g2
 	stx	%g2, [%g1 + KTR_PARM3]
 9:
 #endif
 
 	ldx	[%g5 + ITA_PMAP], %g1
 
 	SET(kernel_pmap_store, %g3, %g2)
 	mov	TLB_DEMAP_NUCLEUS | TLB_DEMAP_PAGE, %g3
 
 	cmp	%g1, %g2
 	movne	%xcc, TLB_DEMAP_PRIMARY | TLB_DEMAP_PAGE, %g3
 
 	ldx	[%g5 + ITA_START], %g1
 	ldx	[%g5 + ITA_END], %g2
 
 1:	or	%g1, %g3, %g4
 	sethi	%hi(KERNBASE), %g6
 	stxa	%g0, [%g4] ASI_DMMU_DEMAP
 	stxa	%g0, [%g4] ASI_IMMU_DEMAP
 	flush	%g6
 
 	set	PAGE_SIZE, %g6
 	add	%g1, %g6, %g1
 	cmp	%g1, %g2
 	blt,a,pt %xcc, 1b
 	 nop
 
-	IPI_DONE(%g5, %g1, %g2, %g3)
+	IPI_DONE(%g5, %g1, %g2, %g3, %g4)
 	retry
 END(tl_ipi_tlb_range_demap)
 
 /*
  * Demap the primary context from the dtlb and itlb.
  */
 ENTRY(tl_ipi_tlb_context_demap)
 #if KTR_COMPILE & KTR_SMP
 	CATR(KTR_SMP, "tl_ipi_tlb_context_demap: pm=%p va=%#lx"
 	    , %g1, %g2, %g3, 7, 8, 9)
 	ldx	[%g5 + ITA_PMAP], %g2
 	stx	%g2, [%g1 + KTR_PARM1]
 	ldx	[%g5 + ITA_VA], %g2
 	stx	%g2, [%g1 + KTR_PARM2]
 9:
 #endif
 
 	mov	TLB_DEMAP_PRIMARY | TLB_DEMAP_CONTEXT, %g1
 	sethi	%hi(KERNBASE), %g3
 	stxa	%g0, [%g1] ASI_DMMU_DEMAP
 	stxa	%g0, [%g1] ASI_IMMU_DEMAP
 	flush	%g3
 
-	IPI_DONE(%g5, %g1, %g2, %g3)
+	IPI_DONE(%g5, %g1, %g2, %g3, %g4)
 	retry
 END(tl_ipi_tlb_context_demap)
 
 /*
  * Read %stick.
  */
 ENTRY(tl_ipi_stick_rd)
 	ldx	[%g5 + IRA_VAL], %g1
 	rd	%asr24, %g2
 	stx	%g2, [%g1]
 
-	IPI_DONE(%g5, %g1, %g2, %g3)
+	IPI_DONE(%g5, %g1, %g2, %g3, %g4)
 	retry
 END(tl_ipi_stick_rd)
 
 /*
  * Read %tick.
  */
 ENTRY(tl_ipi_tick_rd)
 	ldx	[%g5 + IRA_VAL], %g1
 	rd	%tick, %g2
 	stx	%g2, [%g1]
 
-	IPI_DONE(%g5, %g1, %g2, %g3)
+	IPI_DONE(%g5, %g1, %g2, %g3, %g4)
 	retry
 END(tl_ipi_tick_rd)
Index: head/sys/sparc64/sparc64/mp_machdep.c
===================================================================
--- head/sys/sparc64/sparc64/mp_machdep.c	(revision 222812)
+++ head/sys/sparc64/sparc64/mp_machdep.c	(revision 222813)
@@ -1,828 +1,848 @@
 /*-
  * Copyright (c) 1997 Berkeley Software Design, Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Berkeley Software Design Inc's name may not be used to endorse or
  *    promote products derived from this software without specific prior
  *    written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY BERKELEY SOFTWARE DESIGN INC ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL BERKELEY SOFTWARE DESIGN INC BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * from BSDI: locore.s,v 1.36.2.15 1999/08/23 22:34:41 cp Exp
  */
 /*-
  * Copyright (c) 2002 Jake Burkholder.
  * Copyright (c) 2007 - 2010 Marius Strobl <marius@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/lock.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_map.h>
 
 #include <dev/ofw/openfirm.h>
 
 #include <machine/asi.h>
 #include <machine/atomic.h>
 #include <machine/bus.h>
 #include <machine/cpu.h>
 #include <machine/md_var.h>
 #include <machine/metadata.h>
 #include <machine/ofw_machdep.h>
 #include <machine/pcb.h>
 #include <machine/smp.h>
 #include <machine/tick.h>
 #include <machine/tlb.h>
 #include <machine/tsb.h>
 #include <machine/tte.h>
 #include <machine/ver.h>
 
 #define	SUNW_STARTCPU		"SUNW,start-cpu"
 #define	SUNW_STOPSELF		"SUNW,stop-self"
 
 static ih_func_t cpu_ipi_ast;
 static ih_func_t cpu_ipi_hardclock;
 static ih_func_t cpu_ipi_preempt;
 static ih_func_t cpu_ipi_stop;
 
 /*
  * Argument area used to pass data to non-boot processors as they start up.
  * This must be statically initialized with a known invalid CPU module ID,
  * since the other processors will use it before the boot CPU enters the
  * kernel.
  */
 struct	cpu_start_args cpu_start_args = { 0, -1, -1, 0, 0, 0 };
 struct	ipi_cache_args ipi_cache_args;
 struct	ipi_rd_args ipi_rd_args;
 struct	ipi_tlb_args ipi_tlb_args;
 struct	pcb stoppcbs[MAXCPU];
 
 struct	mtx ipi_mtx;
 
 cpu_ipi_selected_t *cpu_ipi_selected;
 cpu_ipi_single_t *cpu_ipi_single;
 
 static vm_offset_t mp_tramp;
 static u_int cpuid_to_mid[MAXCPU];
 static int isjbus;
-static volatile cpumask_t shutdown_cpus;
+static volatile cpuset_t shutdown_cpus;
 
 static void ap_count(phandle_t node, u_int mid, u_int cpu_impl);
 static void ap_start(phandle_t node, u_int mid, u_int cpu_impl);
 static void cpu_mp_unleash(void *v);
 static void foreach_ap(phandle_t node, void (*func)(phandle_t node,
     u_int mid, u_int cpu_impl));
 static void sun4u_startcpu(phandle_t cpu, void *func, u_long arg);
 
 static cpu_ipi_selected_t cheetah_ipi_selected;
 static cpu_ipi_single_t cheetah_ipi_single;
 static cpu_ipi_selected_t jalapeno_ipi_selected;
 static cpu_ipi_single_t jalapeno_ipi_single;
 static cpu_ipi_selected_t spitfire_ipi_selected;
 static cpu_ipi_single_t spitfire_ipi_single;
 
 SYSINIT(cpu_mp_unleash, SI_SUB_SMP, SI_ORDER_FIRST, cpu_mp_unleash, NULL);
 
 CTASSERT(MAXCPU <= IDR_CHEETAH_MAX_BN_PAIRS);
 CTASSERT(MAXCPU <= sizeof(u_int) * NBBY);
 CTASSERT(MAXCPU <= sizeof(int) * NBBY);
 
 void
 mp_init(u_int cpu_impl)
 {
 	struct tte *tp;
 	int i;
 
 	mp_tramp = (vm_offset_t)OF_claim(NULL, PAGE_SIZE, PAGE_SIZE);
 	if (mp_tramp == (vm_offset_t)-1)
 		panic("%s", __func__);
 	bcopy(mp_tramp_code, (void *)mp_tramp, mp_tramp_code_len);
 	*(vm_offset_t *)(mp_tramp + mp_tramp_tlb_slots) = kernel_tlb_slots;
 	*(vm_offset_t *)(mp_tramp + mp_tramp_func) = (vm_offset_t)mp_startup;
 	tp = (struct tte *)(mp_tramp + mp_tramp_code_len);
 	for (i = 0; i < kernel_tlb_slots; i++) {
 		tp[i].tte_vpn = TV_VPN(kernel_tlbs[i].te_va, TS_4M);
 		tp[i].tte_data = TD_V | TD_4M | TD_PA(kernel_tlbs[i].te_pa) |
 		    TD_L | TD_CP | TD_CV | TD_P | TD_W;
 	}
 	for (i = 0; i < PAGE_SIZE; i += sizeof(vm_offset_t))
 		flush(mp_tramp + i);
 
 	/*
 	 * On UP systems cpu_ipi_selected() can be called while
 	 * cpu_mp_start() wasn't so initialize these here.
 	 */
 	if (cpu_impl == CPU_IMPL_ULTRASPARCIIIi ||
 	    cpu_impl == CPU_IMPL_ULTRASPARCIIIip) {
 		isjbus = 1;
 		cpu_ipi_selected = jalapeno_ipi_selected;
 		cpu_ipi_single = jalapeno_ipi_single;
 	} else if (cpu_impl == CPU_IMPL_SPARC64V ||
 	    cpu_impl >= CPU_IMPL_ULTRASPARCIII) {
 		cpu_ipi_selected = cheetah_ipi_selected;
 		cpu_ipi_single = cheetah_ipi_single;
 	} else {
 		cpu_ipi_selected = spitfire_ipi_selected;
 		cpu_ipi_single = spitfire_ipi_single;
 	}
 }
 
 static void
 foreach_ap(phandle_t node, void (*func)(phandle_t node, u_int mid,
     u_int cpu_impl))
 {
 	char type[sizeof("cpu")];
 	phandle_t child;
 	u_int cpuid;
 	uint32_t cpu_impl;
 
 	/* There's no need to traverse the whole OFW tree twice. */
 	if (mp_maxid > 0 && mp_ncpus >= mp_maxid + 1)
 		return;
 
 	for (; node != 0; node = OF_peer(node)) {
 		child = OF_child(node);
 		if (child > 0)
 			foreach_ap(child, func);
 		else {
 			if (OF_getprop(node, "device_type", type,
 			    sizeof(type)) <= 0)
 				continue;
 			if (strcmp(type, "cpu") != 0)
 				continue;
 			if (OF_getprop(node, "implementation#", &cpu_impl,
 			    sizeof(cpu_impl)) <= 0)
 				panic("%s: couldn't determine CPU "
 				    "implementation", __func__);
 			if (OF_getprop(node, cpu_cpuid_prop(cpu_impl), &cpuid,
 			    sizeof(cpuid)) <= 0)
 				panic("%s: couldn't determine CPU module ID",
 				    __func__);
 			if (cpuid == PCPU_GET(mid))
 				continue;
 			(*func)(node, cpuid, cpu_impl);
 		}
 	}
 }
 
 /*
  * Probe for other CPUs.
  */
 void
 cpu_mp_setmaxid()
 {
 
-	all_cpus = 1 << curcpu;
+	CPU_SETOF(curcpu, &all_cpus);
 	mp_ncpus = 1;
 	mp_maxid = 0;
 
 	foreach_ap(OF_child(OF_peer(0)), ap_count);
 }
 
 static void
 ap_count(phandle_t node __unused, u_int mid __unused, u_int cpu_impl __unused)
 {
 
 	mp_maxid++;
 }
 
 int
 cpu_mp_probe(void)
 {
 
 	return (mp_maxid > 0);
 }
 
 struct cpu_group *
 cpu_topo(void)
 {
 
 	return (smp_topo_none());
 }
 
 static void
 sun4u_startcpu(phandle_t cpu, void *func, u_long arg)
 {
 	static struct {
 		cell_t	name;
 		cell_t	nargs;
 		cell_t	nreturns;
 		cell_t	cpu;
 		cell_t	func;
 		cell_t	arg;
 	} args = {
 		(cell_t)SUNW_STARTCPU,
 		3,
 	};
 
 	args.cpu = cpu;
 	args.func = (cell_t)func;
 	args.arg = (cell_t)arg;
 	ofw_entry(&args);
 }
 
 /*
  * Fire up any non-boot processors.
  */
 void
 cpu_mp_start(void)
 {
+	cpuset_t ocpus;
 
 	mtx_init(&ipi_mtx, "ipi", NULL, MTX_SPIN);
 
 	intr_setup(PIL_AST, cpu_ipi_ast, -1, NULL, NULL);
 	intr_setup(PIL_RENDEZVOUS, (ih_func_t *)smp_rendezvous_action,
 	    -1, NULL, NULL);
 	intr_setup(PIL_STOP, cpu_ipi_stop, -1, NULL, NULL);
 	intr_setup(PIL_PREEMPT, cpu_ipi_preempt, -1, NULL, NULL);
 	intr_setup(PIL_HARDCLOCK, cpu_ipi_hardclock, -1, NULL, NULL);
 
 	cpuid_to_mid[curcpu] = PCPU_GET(mid);
 
 	foreach_ap(OF_child(OF_peer(0)), ap_start);
 	KASSERT(!isjbus || mp_ncpus <= IDR_JALAPENO_MAX_BN_PAIRS,
 	    ("%s: can only IPI a maximum of %d JBus-CPUs",
 	    __func__, IDR_JALAPENO_MAX_BN_PAIRS));
-	PCPU_SET(other_cpus, all_cpus & ~(1 << curcpu));
+	ocpus = all_cpus;
+	CPU_CLR(curcpu, &ocpus);
+	PCPU_SET(other_cpus, ocpus);
 	smp_active = 1;
 }
 
 static void
 ap_start(phandle_t node, u_int mid, u_int cpu_impl)
 {
 	volatile struct cpu_start_args *csa;
 	struct pcpu *pc;
 	register_t s;
 	vm_offset_t va;
 	u_int cpuid;
 	uint32_t clock;
 
 	if (mp_ncpus > MAXCPU)
 		return;
 
 	if (OF_getprop(node, "clock-frequency", &clock, sizeof(clock)) <= 0)
 		panic("%s: couldn't determine CPU frequency", __func__);
 	if (clock != PCPU_GET(clock))
 		tick_et_use_stick = 1;
 
 	csa = &cpu_start_args;
 	csa->csa_state = 0;
 	sun4u_startcpu(node, (void *)mp_tramp, 0);
 	s = intr_disable();
 	while (csa->csa_state != CPU_TICKSYNC)
 		;
 	membar(StoreLoad);
 	csa->csa_tick = rd(tick);
 	if (cpu_impl == CPU_IMPL_SPARC64V ||
 	    cpu_impl >= CPU_IMPL_ULTRASPARCIII) {
 		while (csa->csa_state != CPU_STICKSYNC)
 			;
 		membar(StoreLoad);
 		csa->csa_stick = rdstick();
 	}
 	while (csa->csa_state != CPU_INIT)
 		;
 	csa->csa_tick = csa->csa_stick = 0;
 	intr_restore(s);
 
 	cpuid = mp_ncpus++;
 	cpuid_to_mid[cpuid] = mid;
 	cpu_identify(csa->csa_ver, clock, cpuid);
 
 	va = kmem_alloc(kernel_map, PCPU_PAGES * PAGE_SIZE);
 	pc = (struct pcpu *)(va + (PCPU_PAGES * PAGE_SIZE)) - 1;
 	pcpu_init(pc, cpuid, sizeof(*pc));
 	dpcpu_init((void *)kmem_alloc(kernel_map, DPCPU_SIZE), cpuid);
 	pc->pc_addr = va;
 	pc->pc_clock = clock;
 	pc->pc_impl = cpu_impl;
 	pc->pc_mid = mid;
 	pc->pc_node = node;
 
 	cache_init(pc);
 
-	all_cpus |= 1 << cpuid;
+	CPU_SET(cpuid, &all_cpus);
 	intr_add_cpu(cpuid);
 }
 
 void
 cpu_mp_announce(void)
 {
 
 }
 
 static void
 cpu_mp_unleash(void *v)
 {
 	volatile struct cpu_start_args *csa;
 	struct pcpu *pc;
 	register_t s;
 	vm_offset_t va;
 	vm_paddr_t pa;
 	u_int ctx_inc;
 	u_int ctx_min;
 	int i;
 
 	ctx_min = TLB_CTX_USER_MIN;
 	ctx_inc = (TLB_CTX_USER_MAX - 1) / mp_ncpus;
 	csa = &cpu_start_args;
 	csa->csa_count = mp_ncpus;
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu) {
 		pc->pc_tlb_ctx = ctx_min;
 		pc->pc_tlb_ctx_min = ctx_min;
 		pc->pc_tlb_ctx_max = ctx_min + ctx_inc;
 		ctx_min += ctx_inc;
 
 		if (pc->pc_cpuid == curcpu)
 			continue;
 		KASSERT(pc->pc_idlethread != NULL,
 		    ("%s: idlethread", __func__));
 		pc->pc_curthread = pc->pc_idlethread;
 		pc->pc_curpcb = pc->pc_curthread->td_pcb;
 		for (i = 0; i < PCPU_PAGES; i++) {
 			va = pc->pc_addr + i * PAGE_SIZE;
 			pa = pmap_kextract(va);
 			if (pa == 0)
 				panic("%s: pmap_kextract", __func__);
 			csa->csa_ttes[i].tte_vpn = TV_VPN(va, TS_8K);
 			csa->csa_ttes[i].tte_data = TD_V | TD_8K | TD_PA(pa) |
 			    TD_L | TD_CP | TD_CV | TD_P | TD_W;
 		}
 		csa->csa_state = 0;
 		csa->csa_pcpu = pc->pc_addr;
 		csa->csa_mid = pc->pc_mid;
 		s = intr_disable();
 		while (csa->csa_state != CPU_BOOTSTRAP)
 			;
 		intr_restore(s);
 	}
 
 	membar(StoreLoad);
 	csa->csa_count = 0;
 	smp_started = 1;
 }
 
 void
 cpu_mp_bootstrap(struct pcpu *pc)
 {
+	cpuset_t ocpus;
 	volatile struct cpu_start_args *csa;
 
 	csa = &cpu_start_args;
 
 	/* Do CPU-specific initialization. */
 	if (pc->pc_impl == CPU_IMPL_SPARC64V ||
 	    pc->pc_impl >= CPU_IMPL_ULTRASPARCIII)
 		cheetah_init(pc->pc_impl);
 	/*
 	 * Enable the caches.  Note that his may include applying workarounds.
 	 */
 	cache_enable(pc->pc_impl);
 
 	/*
 	 * Clear (S)TICK timer(s) (including NPT) and ensure they are stopped.
 	 */
 	tick_clear(pc->pc_impl);
 	tick_stop(pc->pc_impl);
 
 	/* Set the kernel context. */
 	pmap_set_kctx();
 
 	/* Lock the kernel TSB in the TLB if necessary. */
 	if (tsb_kernel_ldd_phys == 0)
 		pmap_map_tsb();
 
 	/*
 	 * Flush all non-locked TLB entries possibly left over by the
 	 * firmware.
 	 */
 	tlb_flush_nonlocked();
 
 	/* Initialize global registers. */
 	cpu_setregs(pc);
 
 	/*
 	 * Enable interrupts.
 	 * Note that the PIL we be lowered indirectly via sched_throw(NULL)
 	 * when fake spinlock held by the idle thread eventually is released.
 	 */
 	wrpr(pstate, 0, PSTATE_KERNEL);
 
 	smp_cpus++;
 	KASSERT(curthread != NULL, ("%s: curthread", __func__));
-	PCPU_SET(other_cpus, all_cpus & ~(1 << curcpu));
+	ocpus = all_cpus;
+	CPU_CLR(curcpu, &ocpus);
+	PCPU_SET(other_cpus, ocpus);
 	printf("SMP: AP CPU #%d Launched!\n", curcpu);
 
 	csa->csa_count--;
 	membar(StoreLoad);
 	csa->csa_state = CPU_BOOTSTRAP;
 	while (csa->csa_count != 0)
 		;
 
 	/* Start per-CPU event timers. */
 	cpu_initclocks_ap();
 
 	/* Ok, now enter the scheduler. */
 	sched_throw(NULL);
 }
 
 void
 cpu_mp_shutdown(void)
 {
+	cpuset_t cpus;
 	int i;
 
 	critical_enter();
 	shutdown_cpus = PCPU_GET(other_cpus);
-	if (stopped_cpus != PCPU_GET(other_cpus))	/* XXX */
-		stop_cpus(stopped_cpus ^ PCPU_GET(other_cpus));
+	cpus = shutdown_cpus;
+
+	/* XXX: Stop all the CPUs which aren't already. */
+	if (CPU_CMP(&stopped_cpus, &cpus)) {
+
+		/* pc_other_cpus is just a flat "on" mask without curcpu. */
+		CPU_NAND(&cpus, &stopped_cpus);
+		stop_cpus(cpus);
+	}
 	i = 0;
-	while (shutdown_cpus != 0) {
+	while (!CPU_EMPTY(&shutdown_cpus)) {
 		if (i++ > 100000) {
 			printf("timeout shutting down CPUs.\n");
 			break;
 		}
 	}
 	critical_exit();
 }
 
 static void
 cpu_ipi_ast(struct trapframe *tf)
 {
 
 }
 
 static void
 cpu_ipi_stop(struct trapframe *tf)
 {
+	cpuset_t tcmask;
 
 	CTR2(KTR_SMP, "%s: stopped %d", __func__, curcpu);
+	sched_pin();
 	savectx(&stoppcbs[curcpu]);
-	atomic_set_acq_int(&stopped_cpus, PCPU_GET(cpumask));
-	while ((started_cpus & PCPU_GET(cpumask)) == 0) {
-		if ((shutdown_cpus & PCPU_GET(cpumask)) != 0) {
-			atomic_clear_int(&shutdown_cpus, PCPU_GET(cpumask));
+	tcmask = PCPU_GET(cpumask);
+	CPU_OR_ATOMIC(&stopped_cpus, &tcmask);
+	while (!CPU_OVERLAP(&started_cpus, &tcmask)) {
+		if (CPU_OVERLAP(&shutdown_cpus, &tcmask)) {
+			CPU_NAND_ATOMIC(&shutdown_cpus, &tcmask);
 			(void)intr_disable();
 			for (;;)
 				;
 		}
 	}
-	atomic_clear_rel_int(&started_cpus, PCPU_GET(cpumask));
-	atomic_clear_rel_int(&stopped_cpus, PCPU_GET(cpumask));
+	CPU_NAND_ATOMIC(&started_cpus, &tcmask);
+	CPU_NAND_ATOMIC(&stopped_cpus, &tcmask);
+	sched_unpin();
 	CTR2(KTR_SMP, "%s: restarted %d", __func__, curcpu);
 }
 
 static void
 cpu_ipi_preempt(struct trapframe *tf)
 {
 
 	sched_preempt(curthread);
 }
 
 static void
 cpu_ipi_hardclock(struct trapframe *tf)
 {
 	struct trapframe *oldframe;
 	struct thread *td;
 
 	critical_enter();
 	td = curthread;
 	td->td_intr_nesting_level++;
 	oldframe = td->td_intr_frame;
 	td->td_intr_frame = tf;
 	hardclockintr();
 	td->td_intr_frame = oldframe;
 	td->td_intr_nesting_level--;
 	critical_exit();
 }
 
 static void
-spitfire_ipi_selected(u_int cpus, u_long d0, u_long d1, u_long d2)
+spitfire_ipi_selected(cpuset_t cpus, u_long d0, u_long d1, u_long d2)
 {
 	u_int cpu;
 
-	while (cpus) {
-		cpu = ffs(cpus) - 1;
-		cpus &= ~(1 << cpu);
+	while ((cpu = cpusetobj_ffs(&cpus)) != 0) {
+		cpu--;
+		CPU_CLR(cpu, &cpus);
 		spitfire_ipi_single(cpu, d0, d1, d2);
 	}
 }
 
 static void
 spitfire_ipi_single(u_int cpu, u_long d0, u_long d1, u_long d2)
 {
 	register_t s;
 	u_long ids;
 	u_int mid;
 	int i;
 
 	KASSERT(cpu != curcpu, ("%s: CPU can't IPI itself", __func__));
 	KASSERT((ldxa(0, ASI_INTR_DISPATCH_STATUS) & IDR_BUSY) == 0,
 	    ("%s: outstanding dispatch", __func__));
 	mid = cpuid_to_mid[cpu];
 	for (i = 0; i < IPI_RETRIES; i++) {
 		s = intr_disable();
 		stxa(AA_SDB_INTR_D0, ASI_SDB_INTR_W, d0);
 		stxa(AA_SDB_INTR_D1, ASI_SDB_INTR_W, d1);
 		stxa(AA_SDB_INTR_D2, ASI_SDB_INTR_W, d2);
 		membar(Sync);
 		stxa(AA_INTR_SEND | (mid << IDC_ITID_SHIFT),
 		    ASI_SDB_INTR_W, 0);
 		/*
 		 * Workaround for SpitFire erratum #54; do a dummy read
 		 * from a SDB internal register before the MEMBAR #Sync
 		 * for the write to ASI_SDB_INTR_W (requiring another
 		 * MEMBAR #Sync in order to make sure the write has
 		 * occurred before the load).
 		 */
 		membar(Sync);
 		(void)ldxa(AA_SDB_CNTL_HIGH, ASI_SDB_CONTROL_R);
 		membar(Sync);
 		while (((ids = ldxa(0, ASI_INTR_DISPATCH_STATUS)) &
 		    IDR_BUSY) != 0)
 			;
 		intr_restore(s);
 		if ((ids & (IDR_BUSY | IDR_NACK)) == 0)
 			return;
 		/*
 		 * Leave interrupts enabled for a bit before retrying
 		 * in order to avoid deadlocks if the other CPU is also
 		 * trying to send an IPI.
 		 */
 		DELAY(2);
 	}
 	if (kdb_active != 0 || panicstr != NULL)
 		printf("%s: couldn't send IPI to module 0x%u\n",
 		    __func__, mid);
 	else
 		panic("%s: couldn't send IPI to module 0x%u",
 		    __func__, mid);
 }
 
 static void
 cheetah_ipi_single(u_int cpu, u_long d0, u_long d1, u_long d2)
 {
 	register_t s;
 	u_long ids;
 	u_int mid;
 	int i;
 
 	KASSERT(cpu != curcpu, ("%s: CPU can't IPI itself", __func__));
 	KASSERT((ldxa(0, ASI_INTR_DISPATCH_STATUS) &
 	    IDR_CHEETAH_ALL_BUSY) == 0,
 	    ("%s: outstanding dispatch", __func__));
 	mid = cpuid_to_mid[cpu];
 	for (i = 0; i < IPI_RETRIES; i++) {
 		s = intr_disable();
 		stxa(AA_SDB_INTR_D0, ASI_SDB_INTR_W, d0);
 		stxa(AA_SDB_INTR_D1, ASI_SDB_INTR_W, d1);
 		stxa(AA_SDB_INTR_D2, ASI_SDB_INTR_W, d2);
 		membar(Sync);
 		stxa(AA_INTR_SEND | (mid << IDC_ITID_SHIFT),
 		    ASI_SDB_INTR_W, 0);
 		membar(Sync);
 		while (((ids = ldxa(0, ASI_INTR_DISPATCH_STATUS)) &
 		    IDR_BUSY) != 0)
 			;
 		intr_restore(s);
 		if ((ids & (IDR_BUSY | IDR_NACK)) == 0)
 			return;
 		/*
 		 * Leave interrupts enabled for a bit before retrying
 		 * in order to avoid deadlocks if the other CPU is also
 		 * trying to send an IPI.
 		 */
 		DELAY(2);
 	}
 	if (kdb_active != 0 || panicstr != NULL)
 		printf("%s: couldn't send IPI to module 0x%u\n",
 		    __func__, mid);
 	else
 		panic("%s: couldn't send IPI to module 0x%u",
 		    __func__, mid);
 }
 
 static void
-cheetah_ipi_selected(u_int cpus, u_long d0, u_long d1, u_long d2)
+cheetah_ipi_selected(cpuset_t cpus, u_long d0, u_long d1, u_long d2)
 {
+	char pbuf[CPUSETBUFSIZ];
 	register_t s;
 	u_long ids;
 	u_int bnp;
 	u_int cpu;
 	int i;
 
-	KASSERT((cpus & (1 << curcpu)) == 0,
-	    ("%s: CPU can't IPI itself", __func__));
+	KASSERT(!CPU_ISSET(curcpu, &cpus), ("%s: CPU can't IPI itself",
+	    __func__));
 	KASSERT((ldxa(0, ASI_INTR_DISPATCH_STATUS) &
 	    IDR_CHEETAH_ALL_BUSY) == 0,
 	    ("%s: outstanding dispatch", __func__));
-	if (cpus == 0)
+	if (CPU_EMPTY(&cpus))
 		return;
 	ids = 0;
 	for (i = 0; i < IPI_RETRIES * mp_ncpus; i++) {
 		s = intr_disable();
 		stxa(AA_SDB_INTR_D0, ASI_SDB_INTR_W, d0);
 		stxa(AA_SDB_INTR_D1, ASI_SDB_INTR_W, d1);
 		stxa(AA_SDB_INTR_D2, ASI_SDB_INTR_W, d2);
 		membar(Sync);
 		bnp = 0;
 		for (cpu = 0; cpu < mp_ncpus; cpu++) {
-			if ((cpus & (1 << cpu)) != 0) {
+			if (CPU_ISSET(cpu, &cpus)) {
 				stxa(AA_INTR_SEND | (cpuid_to_mid[cpu] <<
 				    IDC_ITID_SHIFT) | bnp << IDC_BN_SHIFT,
 				    ASI_SDB_INTR_W, 0);
 				membar(Sync);
 				bnp++;
 			}
 		}
 		while (((ids = ldxa(0, ASI_INTR_DISPATCH_STATUS)) &
 		    IDR_CHEETAH_ALL_BUSY) != 0)
 			;
 		intr_restore(s);
 		if ((ids &
 		    (IDR_CHEETAH_ALL_BUSY | IDR_CHEETAH_ALL_NACK)) == 0)
 			return;
 		bnp = 0;
 		for (cpu = 0; cpu < mp_ncpus; cpu++) {
-			if ((cpus & (1 << cpu)) != 0) {
+			if (CPU_ISSET(cpu, &cpus)) {
 				if ((ids & (IDR_NACK << (2 * bnp))) == 0)
-					cpus &= ~(1 << cpu);
+					CPU_CLR(cpu, &cpus);
 				bnp++;
 			}
 		}
 		/*
 		 * On at least Fire V880 we may receive IDR_NACKs for
 		 * CPUs we actually haven't tried to send an IPI to,
 		 * but which apparently can be safely ignored.
 		 */
-		if (cpus == 0)
+		if (CPU_EMPTY(&cpus))
 			return;
 		/*
 		 * Leave interrupts enabled for a bit before retrying
 		 * in order to avoid deadlocks if the other CPUs are
 		 * also trying to send IPIs.
 		 */
 		DELAY(2 * mp_ncpus);
 	}
 	if (kdb_active != 0 || panicstr != NULL)
-		printf("%s: couldn't send IPI (cpus=0x%u ids=0x%lu)\n",
-		    __func__, cpus, ids);
+		printf("%s: couldn't send IPI (cpus=%s ids=0x%lu)\n",
+		    __func__, cpusetobj_strprint(pbuf, &cpus), ids);
 	else
-		panic("%s: couldn't send IPI (cpus=0x%u ids=0x%lu)",
-		    __func__, cpus, ids);
+		panic("%s: couldn't send IPI (cpus=%s ids=0x%lu)",
+		    __func__, cpusetobj_strprint(pbuf, &cpus), ids);
 }
 
 static void
 jalapeno_ipi_single(u_int cpu, u_long d0, u_long d1, u_long d2)
 {
 	register_t s;
 	u_long ids;
 	u_int busy, busynack, mid;
 	int i;
 
 	KASSERT(cpu != curcpu, ("%s: CPU can't IPI itself", __func__));
 	KASSERT((ldxa(0, ASI_INTR_DISPATCH_STATUS) &
 	    IDR_CHEETAH_ALL_BUSY) == 0,
 	    ("%s: outstanding dispatch", __func__));
 	mid = cpuid_to_mid[cpu];
 	busy = IDR_BUSY << (2 * mid);
 	busynack = (IDR_BUSY | IDR_NACK) << (2 * mid);
 	for (i = 0; i < IPI_RETRIES; i++) {
 		s = intr_disable();
 		stxa(AA_SDB_INTR_D0, ASI_SDB_INTR_W, d0);
 		stxa(AA_SDB_INTR_D1, ASI_SDB_INTR_W, d1);
 		stxa(AA_SDB_INTR_D2, ASI_SDB_INTR_W, d2);
 		membar(Sync);
 		stxa(AA_INTR_SEND | (mid << IDC_ITID_SHIFT),
 		    ASI_SDB_INTR_W, 0);
 		membar(Sync);
 		while (((ids = ldxa(0, ASI_INTR_DISPATCH_STATUS)) &
 		    busy) != 0)
 			;
 		intr_restore(s);
 		if ((ids & busynack) == 0)
 			return;
 		/*
 		 * Leave interrupts enabled for a bit before retrying
 		 * in order to avoid deadlocks if the other CPU is also
 		 * trying to send an IPI.
 		 */
 		DELAY(2);
 	}
 	if (kdb_active != 0 || panicstr != NULL)
 		printf("%s: couldn't send IPI to module 0x%u\n",
 		    __func__, mid);
 	else
 		panic("%s: couldn't send IPI to module 0x%u",
 		    __func__, mid);
 }
 
 static void
-jalapeno_ipi_selected(u_int cpus, u_long d0, u_long d1, u_long d2)
+jalapeno_ipi_selected(cpuset_t cpus, u_long d0, u_long d1, u_long d2)
 {
+	char pbuf[CPUSETBUFSIZ];
 	register_t s;
 	u_long ids;
 	u_int cpu;
 	int i;
 
-	KASSERT((cpus & (1 << curcpu)) == 0,
-	    ("%s: CPU can't IPI itself", __func__));
+	KASSERT(!CPU_ISSET(curcpu, &cpus), ("%s: CPU can't IPI itself",
+	    __func__));
 	KASSERT((ldxa(0, ASI_INTR_DISPATCH_STATUS) &
 	    IDR_CHEETAH_ALL_BUSY) == 0,
 	    ("%s: outstanding dispatch", __func__));
-	if (cpus == 0)
+	if (CPU_EMPTY(&cpus))
 		return;
 	ids = 0;
 	for (i = 0; i < IPI_RETRIES * mp_ncpus; i++) {
 		s = intr_disable();
 		stxa(AA_SDB_INTR_D0, ASI_SDB_INTR_W, d0);
 		stxa(AA_SDB_INTR_D1, ASI_SDB_INTR_W, d1);
 		stxa(AA_SDB_INTR_D2, ASI_SDB_INTR_W, d2);
 		membar(Sync);
 		for (cpu = 0; cpu < mp_ncpus; cpu++) {
-			if ((cpus & (1 << cpu)) != 0) {
+			if (CPU_ISSET(cpu, &cpus)) {
 				stxa(AA_INTR_SEND | (cpuid_to_mid[cpu] <<
 				    IDC_ITID_SHIFT), ASI_SDB_INTR_W, 0);
 				membar(Sync);
 			}
 		}
 		while (((ids = ldxa(0, ASI_INTR_DISPATCH_STATUS)) &
 		    IDR_CHEETAH_ALL_BUSY) != 0)
 			;
 		intr_restore(s);
 		if ((ids &
 		    (IDR_CHEETAH_ALL_BUSY | IDR_CHEETAH_ALL_NACK)) == 0)
 			return;
 		for (cpu = 0; cpu < mp_ncpus; cpu++)
-			if ((cpus & (1 << cpu)) != 0)
+			if (CPU_ISSET(cpu, &cpus))
 				if ((ids & (IDR_NACK <<
 				    (2 * cpuid_to_mid[cpu]))) == 0)
-					cpus &= ~(1 << cpu);
+					CPU_CLR(cpu, &cpus);
 		/*
 		 * Leave interrupts enabled for a bit before retrying
 		 * in order to avoid deadlocks if the other CPUs are
 		 * also trying to send IPIs.
 		 */
 		DELAY(2 * mp_ncpus);
 	}
 	if (kdb_active != 0 || panicstr != NULL)
-		printf("%s: couldn't send IPI (cpus=0x%u ids=0x%lu)\n",
-		    __func__, cpus, ids);
+		printf("%s: couldn't send IPI (cpus=%s ids=0x%lu)\n",
+		    __func__, cpusetobj_strprint(pbuf, &cpus), ids);
 	else
-		panic("%s: couldn't send IPI (cpus=0x%u ids=0x%lu)",
-		    __func__, cpus, ids);
+		panic("%s: couldn't send IPI (cpus=%s ids=0x%lu)",
+		    __func__, cpusetobj_strprint(pbuf, &cpus), ids);
 }
Index: head/sys/sparc64/sparc64/pmap.c
===================================================================
--- head/sys/sparc64/sparc64/pmap.c	(revision 222812)
+++ head/sys/sparc64/sparc64/pmap.c	(revision 222813)
@@ -1,2260 +1,2260 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 1994 John S. Dyson
  * All rights reserved.
  * Copyright (c) 1994 David Greenman
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *      from:   @(#)pmap.c      7.7 (Berkeley)  5/12/91
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  * Manages physical address maps.
  *
  * In addition to hardware address maps, this module is called upon to
  * provide software-use-only maps which may or may not be stored in the
  * same form as hardware maps.  These pseudo-maps are used to store
  * intermediate results from copy operations to and from address spaces.
  *
  * Since the information managed by this module is also stored by the
  * logical address mapping module, this module may throw away valid virtual
  * to physical mappings at almost any time.  However, invalidations of
  * mappings must be done as requested.
  *
  * In order to cope with hardware architectures which make virtual to
  * physical map invalidates expensive, this module may delay invalidate
  * reduced protection operations until such time as they are actually
  * necessary.  This module is given full information as to which processors
  * are currently using which maps, and to when physical maps must be made
  * correct.
  */
 
 #include "opt_kstack_pages.h"
 #include "opt_pmap.h"
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/msgbuf.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/vmmeter.h>
 
 #include <dev/ofw/openfirm.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 
 #include <machine/cache.h>
 #include <machine/frame.h>
 #include <machine/instr.h>
 #include <machine/md_var.h>
 #include <machine/metadata.h>
 #include <machine/ofw_mem.h>
 #include <machine/smp.h>
 #include <machine/tlb.h>
 #include <machine/tte.h>
 #include <machine/tsb.h>
 #include <machine/ver.h>
 
 #define	PMAP_DEBUG
 
 #ifndef	PMAP_SHPGPERPROC
 #define	PMAP_SHPGPERPROC	200
 #endif
 
 /* XXX */
 #include "opt_sched.h"
 #ifndef SCHED_4BSD
 #error "sparc64 only works with SCHED_4BSD which uses a global scheduler lock."
 #endif
 extern struct mtx sched_lock;
 
 /*
  * Virtual address of message buffer
  */
 struct msgbuf *msgbufp;
 
 /*
  * Map of physical memory reagions
  */
 vm_paddr_t phys_avail[128];
 static struct ofw_mem_region mra[128];
 struct ofw_mem_region sparc64_memreg[128];
 int sparc64_nmemreg;
 static struct ofw_map translations[128];
 static int translations_size;
 
 static vm_offset_t pmap_idle_map;
 static vm_offset_t pmap_temp_map_1;
 static vm_offset_t pmap_temp_map_2;
 
 /*
  * First and last available kernel virtual addresses
  */
 vm_offset_t virtual_avail;
 vm_offset_t virtual_end;
 vm_offset_t kernel_vm_end;
 
 vm_offset_t vm_max_kernel_address;
 
 /*
  * Kernel pmap
  */
 struct pmap kernel_pmap_store;
 
 /*
  * Allocate physical memory for use in pmap_bootstrap.
  */
 static vm_paddr_t pmap_bootstrap_alloc(vm_size_t size, uint32_t colors);
 
 static void pmap_bootstrap_set_tte(struct tte *tp, u_long vpn, u_long data);
 
 /*
  * Map the given physical page at the specified virtual address in the
  * target pmap with the protection requested.  If specified the page
  * will be wired down.
  *
  * The page queues and pmap must be locked.
  */
 static void pmap_enter_locked(pmap_t pm, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, boolean_t wired);
 
 extern int tl1_dmmu_miss_direct_patch_tsb_phys_1[];
 extern int tl1_dmmu_miss_direct_patch_tsb_phys_end_1[];
 extern int tl1_dmmu_miss_patch_asi_1[];
 extern int tl1_dmmu_miss_patch_quad_ldd_1[];
 extern int tl1_dmmu_miss_patch_tsb_1[];
 extern int tl1_dmmu_miss_patch_tsb_2[];
 extern int tl1_dmmu_miss_patch_tsb_mask_1[];
 extern int tl1_dmmu_miss_patch_tsb_mask_2[];
 extern int tl1_dmmu_prot_patch_asi_1[];
 extern int tl1_dmmu_prot_patch_quad_ldd_1[];
 extern int tl1_dmmu_prot_patch_tsb_1[];
 extern int tl1_dmmu_prot_patch_tsb_2[];
 extern int tl1_dmmu_prot_patch_tsb_mask_1[];
 extern int tl1_dmmu_prot_patch_tsb_mask_2[];
 extern int tl1_immu_miss_patch_asi_1[];
 extern int tl1_immu_miss_patch_quad_ldd_1[];
 extern int tl1_immu_miss_patch_tsb_1[];
 extern int tl1_immu_miss_patch_tsb_2[];
 extern int tl1_immu_miss_patch_tsb_mask_1[];
 extern int tl1_immu_miss_patch_tsb_mask_2[];
 
 /*
  * If user pmap is processed with pmap_remove and with pmap_remove and the
  * resident count drops to 0, there are no more pages to remove, so we
  * need not continue.
  */
 #define	PMAP_REMOVE_DONE(pm) \
 	((pm) != kernel_pmap && (pm)->pm_stats.resident_count == 0)
 
 /*
  * The threshold (in bytes) above which tsb_foreach() is used in pmap_remove()
  * and pmap_protect() instead of trying each virtual address.
  */
 #define	PMAP_TSB_THRESH	((TSB_SIZE / 2) * PAGE_SIZE)
 
 SYSCTL_NODE(_debug, OID_AUTO, pmap_stats, CTLFLAG_RD, 0, "");
 
 PMAP_STATS_VAR(pmap_nenter);
 PMAP_STATS_VAR(pmap_nenter_update);
 PMAP_STATS_VAR(pmap_nenter_replace);
 PMAP_STATS_VAR(pmap_nenter_new);
 PMAP_STATS_VAR(pmap_nkenter);
 PMAP_STATS_VAR(pmap_nkenter_oc);
 PMAP_STATS_VAR(pmap_nkenter_stupid);
 PMAP_STATS_VAR(pmap_nkremove);
 PMAP_STATS_VAR(pmap_nqenter);
 PMAP_STATS_VAR(pmap_nqremove);
 PMAP_STATS_VAR(pmap_ncache_enter);
 PMAP_STATS_VAR(pmap_ncache_enter_c);
 PMAP_STATS_VAR(pmap_ncache_enter_oc);
 PMAP_STATS_VAR(pmap_ncache_enter_cc);
 PMAP_STATS_VAR(pmap_ncache_enter_coc);
 PMAP_STATS_VAR(pmap_ncache_enter_nc);
 PMAP_STATS_VAR(pmap_ncache_enter_cnc);
 PMAP_STATS_VAR(pmap_ncache_remove);
 PMAP_STATS_VAR(pmap_ncache_remove_c);
 PMAP_STATS_VAR(pmap_ncache_remove_oc);
 PMAP_STATS_VAR(pmap_ncache_remove_cc);
 PMAP_STATS_VAR(pmap_ncache_remove_coc);
 PMAP_STATS_VAR(pmap_ncache_remove_nc);
 PMAP_STATS_VAR(pmap_nzero_page);
 PMAP_STATS_VAR(pmap_nzero_page_c);
 PMAP_STATS_VAR(pmap_nzero_page_oc);
 PMAP_STATS_VAR(pmap_nzero_page_nc);
 PMAP_STATS_VAR(pmap_nzero_page_area);
 PMAP_STATS_VAR(pmap_nzero_page_area_c);
 PMAP_STATS_VAR(pmap_nzero_page_area_oc);
 PMAP_STATS_VAR(pmap_nzero_page_area_nc);
 PMAP_STATS_VAR(pmap_nzero_page_idle);
 PMAP_STATS_VAR(pmap_nzero_page_idle_c);
 PMAP_STATS_VAR(pmap_nzero_page_idle_oc);
 PMAP_STATS_VAR(pmap_nzero_page_idle_nc);
 PMAP_STATS_VAR(pmap_ncopy_page);
 PMAP_STATS_VAR(pmap_ncopy_page_c);
 PMAP_STATS_VAR(pmap_ncopy_page_oc);
 PMAP_STATS_VAR(pmap_ncopy_page_nc);
 PMAP_STATS_VAR(pmap_ncopy_page_dc);
 PMAP_STATS_VAR(pmap_ncopy_page_doc);
 PMAP_STATS_VAR(pmap_ncopy_page_sc);
 PMAP_STATS_VAR(pmap_ncopy_page_soc);
 
 PMAP_STATS_VAR(pmap_nnew_thread);
 PMAP_STATS_VAR(pmap_nnew_thread_oc);
 
 static inline u_long dtlb_get_data(u_int slot);
 
 /*
  * Quick sort callout for comparing memory regions
  */
 static int mr_cmp(const void *a, const void *b);
 static int om_cmp(const void *a, const void *b);
 
 static int
 mr_cmp(const void *a, const void *b)
 {
 	const struct ofw_mem_region *mra;
 	const struct ofw_mem_region *mrb;
 
 	mra = a;
 	mrb = b;
 	if (mra->mr_start < mrb->mr_start)
 		return (-1);
 	else if (mra->mr_start > mrb->mr_start)
 		return (1);
 	else
 		return (0);
 }
 
 static int
 om_cmp(const void *a, const void *b)
 {
 	const struct ofw_map *oma;
 	const struct ofw_map *omb;
 
 	oma = a;
 	omb = b;
 	if (oma->om_start < omb->om_start)
 		return (-1);
 	else if (oma->om_start > omb->om_start)
 		return (1);
 	else
 		return (0);
 }
 
 static inline u_long
 dtlb_get_data(u_int slot)
 {
 
 	/*
 	 * We read ASI_DTLB_DATA_ACCESS_REG twice in order to work
 	 * around errata of USIII and beyond.
 	 */
 	(void)ldxa(TLB_DAR_SLOT(slot), ASI_DTLB_DATA_ACCESS_REG);
 	return (ldxa(TLB_DAR_SLOT(slot), ASI_DTLB_DATA_ACCESS_REG));
 }
 
 /*
  * Bootstrap the system enough to run with virtual memory.
  */
 void
 pmap_bootstrap(u_int cpu_impl)
 {
 	struct pmap *pm;
 	struct tte *tp;
 	vm_offset_t off;
 	vm_offset_t va;
 	vm_paddr_t pa;
 	vm_size_t physsz;
 	vm_size_t virtsz;
 	u_long data;
 	u_long vpn;
 	phandle_t pmem;
 	phandle_t vmem;
 	u_int dtlb_slots_avail;
 	int i;
 	int j;
 	int sz;
 	uint32_t asi;
 	uint32_t colors;
 	uint32_t ldd;
 
 	/*
 	 * Set the kernel context.
 	 */
 	pmap_set_kctx();
 
 	colors = dcache_color_ignore != 0 ? 1 : DCACHE_COLORS;
 
 	/*
 	 * Find out what physical memory is available from the PROM and
 	 * initialize the phys_avail array.  This must be done before
 	 * pmap_bootstrap_alloc is called.
 	 */
 	if ((pmem = OF_finddevice("/memory")) == -1)
 		panic("pmap_bootstrap: finddevice /memory");
 	if ((sz = OF_getproplen(pmem, "available")) == -1)
 		panic("pmap_bootstrap: getproplen /memory/available");
 	if (sizeof(phys_avail) < sz)
 		panic("pmap_bootstrap: phys_avail too small");
 	if (sizeof(mra) < sz)
 		panic("pmap_bootstrap: mra too small");
 	bzero(mra, sz);
 	if (OF_getprop(pmem, "available", mra, sz) == -1)
 		panic("pmap_bootstrap: getprop /memory/available");
 	sz /= sizeof(*mra);
 	CTR0(KTR_PMAP, "pmap_bootstrap: physical memory");
 	qsort(mra, sz, sizeof (*mra), mr_cmp);
 	physsz = 0;
 	getenv_quad("hw.physmem", &physmem);
 	physmem = btoc(physmem);
 	for (i = 0, j = 0; i < sz; i++, j += 2) {
 		CTR2(KTR_PMAP, "start=%#lx size=%#lx", mra[i].mr_start,
 		    mra[i].mr_size);
 		if (physmem != 0 && btoc(physsz + mra[i].mr_size) >= physmem) {
 			if (btoc(physsz) < physmem) {
 				phys_avail[j] = mra[i].mr_start;
 				phys_avail[j + 1] = mra[i].mr_start +
 				    (ctob(physmem) - physsz);
 				physsz = ctob(physmem);
 			}
 			break;
 		}
 		phys_avail[j] = mra[i].mr_start;
 		phys_avail[j + 1] = mra[i].mr_start + mra[i].mr_size;
 		physsz += mra[i].mr_size;
 	}
 	physmem = btoc(physsz);
 
 	/*
 	 * Calculate the size of kernel virtual memory, and the size and mask
 	 * for the kernel TSB based on the phsyical memory size but limited
 	 * by the amount of dTLB slots available for locked entries if we have
 	 * to lock the TSB in the TLB (given that for spitfire-class CPUs all
 	 * of the dt64 slots can hold locked entries but there is no large
 	 * dTLB for unlocked ones, we don't use more than half of it for the
 	 * TSB).
 	 * Note that for reasons unknown OpenSolaris doesn't take advantage of
 	 * ASI_ATOMIC_QUAD_LDD_PHYS on UltraSPARC-III.  However, given that no
 	 * public documentation is available for these, the latter just might
 	 * not support it, yet.
 	 */
 	virtsz = roundup(physsz, PAGE_SIZE_4M << (PAGE_SHIFT - TTE_SHIFT));
 	if (cpu_impl == CPU_IMPL_SPARC64V ||
 	    cpu_impl >= CPU_IMPL_ULTRASPARCIIIp)
 		tsb_kernel_ldd_phys = 1;
 	else {
 		dtlb_slots_avail = 0;
 		for (i = 0; i < dtlb_slots; i++) {
 			data = dtlb_get_data(i);
 			if ((data & (TD_V | TD_L)) != (TD_V | TD_L))
 				dtlb_slots_avail++;
 		}
 #ifdef SMP
 		dtlb_slots_avail -= PCPU_PAGES;
 #endif
 		if (cpu_impl >= CPU_IMPL_ULTRASPARCI &&
 		    cpu_impl < CPU_IMPL_ULTRASPARCIII)
 			dtlb_slots_avail /= 2;
 		virtsz = MIN(virtsz, (dtlb_slots_avail * PAGE_SIZE_4M) <<
 		    (PAGE_SHIFT - TTE_SHIFT));
 	}
 	vm_max_kernel_address = VM_MIN_KERNEL_ADDRESS + virtsz;
 	tsb_kernel_size = virtsz >> (PAGE_SHIFT - TTE_SHIFT);
 	tsb_kernel_mask = (tsb_kernel_size >> TTE_SHIFT) - 1;
 
 	/*
 	 * Allocate the kernel TSB and lock it in the TLB if necessary.
 	 */
 	pa = pmap_bootstrap_alloc(tsb_kernel_size, colors);
 	if (pa & PAGE_MASK_4M)
 		panic("pmap_bootstrap: TSB unaligned\n");
 	tsb_kernel_phys = pa;
 	if (tsb_kernel_ldd_phys == 0) {
 		tsb_kernel =
 		    (struct tte *)(VM_MIN_KERNEL_ADDRESS - tsb_kernel_size);
 		pmap_map_tsb();
 		bzero(tsb_kernel, tsb_kernel_size);
 	} else {
 		tsb_kernel =
 		    (struct tte *)TLB_PHYS_TO_DIRECT(tsb_kernel_phys);
 		aszero(ASI_PHYS_USE_EC, tsb_kernel_phys, tsb_kernel_size);
 	}
 
 	/*
 	 * Allocate and map the dynamic per-CPU area for the BSP.
 	 */
 	pa = pmap_bootstrap_alloc(DPCPU_SIZE, colors);
 	dpcpu0 = (void *)TLB_PHYS_TO_DIRECT(pa);
 
 	/*
 	 * Allocate and map the message buffer.
 	 */
 	pa = pmap_bootstrap_alloc(msgbufsize, colors);
 	msgbufp = (struct msgbuf *)TLB_PHYS_TO_DIRECT(pa);
 
 	/*
 	 * Patch the TSB addresses and mask as well as the ASIs used to load
 	 * it into the trap table.
 	 */
 
 #define	LDDA_R_I_R(rd, imm_asi, rs1, rs2)				\
 	(EIF_OP(IOP_LDST) | EIF_F3_RD(rd) | EIF_F3_OP3(INS3_LDDA) |	\
 	    EIF_F3_RS1(rs1) | EIF_F3_I(0) | EIF_F3_IMM_ASI(imm_asi) |	\
 	    EIF_F3_RS2(rs2))
 #define	OR_R_I_R(rd, imm13, rs1)					\
 	(EIF_OP(IOP_MISC) | EIF_F3_RD(rd) | EIF_F3_OP3(INS2_OR) |	\
 	    EIF_F3_RS1(rs1) | EIF_F3_I(1) | EIF_IMM(imm13, 13))
 #define	SETHI(rd, imm22)						\
 	(EIF_OP(IOP_FORM2) | EIF_F2_RD(rd) | EIF_F2_OP2(INS0_SETHI) |	\
 	    EIF_IMM((imm22) >> 10, 22))
 #define	WR_R_I(rd, imm13, rs1)						\
 	(EIF_OP(IOP_MISC) | EIF_F3_RD(rd) | EIF_F3_OP3(INS2_WR) |	\
 	    EIF_F3_RS1(rs1) | EIF_F3_I(1) | EIF_IMM(imm13, 13))
 
 #define	PATCH_ASI(addr, asi) do {					\
 	if (addr[0] != WR_R_I(IF_F3_RD(addr[0]), 0x0,			\
 	    IF_F3_RS1(addr[0])))					\
 		panic("%s: patched instructions have changed",		\
 		    __func__);						\
 	addr[0] |= EIF_IMM((asi), 13);					\
 	flush(addr);							\
 } while (0)
 
 #define	PATCH_LDD(addr, asi) do {					\
 	if (addr[0] != LDDA_R_I_R(IF_F3_RD(addr[0]), 0x0,		\
 	    IF_F3_RS1(addr[0]), IF_F3_RS2(addr[0])))			\
 		panic("%s: patched instructions have changed",		\
 		    __func__);						\
 	addr[0] |= EIF_F3_IMM_ASI(asi);					\
 	flush(addr);							\
 } while (0)
 
 #define	PATCH_TSB(addr, val) do {					\
 	if (addr[0] != SETHI(IF_F2_RD(addr[0]), 0x0) ||			\
 	    addr[1] != OR_R_I_R(IF_F3_RD(addr[1]), 0x0,			\
 	    IF_F3_RS1(addr[1]))	||					\
 	    addr[3] != SETHI(IF_F2_RD(addr[3]), 0x0))			\
 		panic("%s: patched instructions have changed",		\
 		    __func__);						\
 	addr[0] |= EIF_IMM((val) >> 42, 22);				\
 	addr[1] |= EIF_IMM((val) >> 32, 10);				\
 	addr[3] |= EIF_IMM((val) >> 10, 22);				\
 	flush(addr);							\
 	flush(addr + 1);						\
 	flush(addr + 3);						\
 } while (0)
 
 #define	PATCH_TSB_MASK(addr, val) do {					\
 	if (addr[0] != SETHI(IF_F2_RD(addr[0]), 0x0) ||			\
 	    addr[1] != OR_R_I_R(IF_F3_RD(addr[1]), 0x0,			\
 	    IF_F3_RS1(addr[1])))					\
 		panic("%s: patched instructions have changed",		\
 		    __func__);						\
 	addr[0] |= EIF_IMM((val) >> 10, 22);				\
 	addr[1] |= EIF_IMM((val), 10);					\
 	flush(addr);							\
 	flush(addr + 1);						\
 } while (0)
 
 	if (tsb_kernel_ldd_phys == 0) {
 		asi = ASI_N;
 		ldd = ASI_NUCLEUS_QUAD_LDD;
 		off = (vm_offset_t)tsb_kernel;
 	} else {
 		asi = ASI_PHYS_USE_EC;
 		ldd = ASI_ATOMIC_QUAD_LDD_PHYS;
 		off = (vm_offset_t)tsb_kernel_phys;
 	}
 	PATCH_TSB(tl1_dmmu_miss_direct_patch_tsb_phys_1, tsb_kernel_phys);
 	PATCH_TSB(tl1_dmmu_miss_direct_patch_tsb_phys_end_1,
 	    tsb_kernel_phys + tsb_kernel_size - 1);
 	PATCH_ASI(tl1_dmmu_miss_patch_asi_1, asi);
 	PATCH_LDD(tl1_dmmu_miss_patch_quad_ldd_1, ldd);
 	PATCH_TSB(tl1_dmmu_miss_patch_tsb_1, off);
 	PATCH_TSB(tl1_dmmu_miss_patch_tsb_2, off);
 	PATCH_TSB_MASK(tl1_dmmu_miss_patch_tsb_mask_1, tsb_kernel_mask);
 	PATCH_TSB_MASK(tl1_dmmu_miss_patch_tsb_mask_2, tsb_kernel_mask);
 	PATCH_ASI(tl1_dmmu_prot_patch_asi_1, asi);
 	PATCH_LDD(tl1_dmmu_prot_patch_quad_ldd_1, ldd);
 	PATCH_TSB(tl1_dmmu_prot_patch_tsb_1, off);
 	PATCH_TSB(tl1_dmmu_prot_patch_tsb_2, off);
 	PATCH_TSB_MASK(tl1_dmmu_prot_patch_tsb_mask_1, tsb_kernel_mask);
 	PATCH_TSB_MASK(tl1_dmmu_prot_patch_tsb_mask_2, tsb_kernel_mask);
 	PATCH_ASI(tl1_immu_miss_patch_asi_1, asi);
 	PATCH_LDD(tl1_immu_miss_patch_quad_ldd_1, ldd);
 	PATCH_TSB(tl1_immu_miss_patch_tsb_1, off);
 	PATCH_TSB(tl1_immu_miss_patch_tsb_2, off);
 	PATCH_TSB_MASK(tl1_immu_miss_patch_tsb_mask_1, tsb_kernel_mask);
 	PATCH_TSB_MASK(tl1_immu_miss_patch_tsb_mask_2, tsb_kernel_mask);
 
 	/*
 	 * Enter fake 8k pages for the 4MB kernel pages, so that
 	 * pmap_kextract() will work for them.
 	 */
 	for (i = 0; i < kernel_tlb_slots; i++) {
 		pa = kernel_tlbs[i].te_pa;
 		va = kernel_tlbs[i].te_va;
 		for (off = 0; off < PAGE_SIZE_4M; off += PAGE_SIZE) {
 			tp = tsb_kvtotte(va + off);
 			vpn = TV_VPN(va + off, TS_8K);
 			data = TD_V | TD_8K | TD_PA(pa + off) | TD_REF |
 			    TD_SW | TD_CP | TD_CV | TD_P | TD_W;
 			pmap_bootstrap_set_tte(tp, vpn, data);
 		}
 	}
 
 	/*
 	 * Set the start and end of KVA.  The kernel is loaded starting
 	 * at the first available 4MB super page, so we advance to the
 	 * end of the last one used for it.
 	 */
 	virtual_avail = KERNBASE + kernel_tlb_slots * PAGE_SIZE_4M;
 	virtual_end = vm_max_kernel_address;
 	kernel_vm_end = vm_max_kernel_address;
 
 	/*
 	 * Allocate kva space for temporary mappings.
 	 */
 	pmap_idle_map = virtual_avail;
 	virtual_avail += PAGE_SIZE * colors;
 	pmap_temp_map_1 = virtual_avail;
 	virtual_avail += PAGE_SIZE * colors;
 	pmap_temp_map_2 = virtual_avail;
 	virtual_avail += PAGE_SIZE * colors;
 
 	/*
 	 * Allocate a kernel stack with guard page for thread0 and map it
 	 * into the kernel TSB.  We must ensure that the virtual address is
 	 * colored properly for corresponding CPUs, since we're allocating
 	 * from phys_avail so the memory won't have an associated vm_page_t.
 	 */
 	pa = pmap_bootstrap_alloc(KSTACK_PAGES * PAGE_SIZE, colors);
 	kstack0_phys = pa;
 	virtual_avail += roundup(KSTACK_GUARD_PAGES, colors) * PAGE_SIZE;
 	kstack0 = virtual_avail;
 	virtual_avail += roundup(KSTACK_PAGES, colors) * PAGE_SIZE;
 	if (dcache_color_ignore == 0)
 		KASSERT(DCACHE_COLOR(kstack0) == DCACHE_COLOR(kstack0_phys),
 		    ("pmap_bootstrap: kstack0 miscolored"));
 	for (i = 0; i < KSTACK_PAGES; i++) {
 		pa = kstack0_phys + i * PAGE_SIZE;
 		va = kstack0 + i * PAGE_SIZE;
 		tp = tsb_kvtotte(va);
 		vpn = TV_VPN(va, TS_8K);
 		data = TD_V | TD_8K | TD_PA(pa) | TD_REF | TD_SW | TD_CP |
 		    TD_CV | TD_P | TD_W;
 		pmap_bootstrap_set_tte(tp, vpn, data);
 	}
 
 	/*
 	 * Calculate the last available physical address.
 	 */
 	for (i = 0; phys_avail[i + 2] != 0; i += 2)
 		;
 	Maxmem = sparc64_btop(phys_avail[i + 1]);
 
 	/*
 	 * Add the PROM mappings to the kernel TSB.
 	 */
 	if ((vmem = OF_finddevice("/virtual-memory")) == -1)
 		panic("pmap_bootstrap: finddevice /virtual-memory");
 	if ((sz = OF_getproplen(vmem, "translations")) == -1)
 		panic("pmap_bootstrap: getproplen translations");
 	if (sizeof(translations) < sz)
 		panic("pmap_bootstrap: translations too small");
 	bzero(translations, sz);
 	if (OF_getprop(vmem, "translations", translations, sz) == -1)
 		panic("pmap_bootstrap: getprop /virtual-memory/translations");
 	sz /= sizeof(*translations);
 	translations_size = sz;
 	CTR0(KTR_PMAP, "pmap_bootstrap: translations");
 	qsort(translations, sz, sizeof (*translations), om_cmp);
 	for (i = 0; i < sz; i++) {
 		CTR3(KTR_PMAP,
 		    "translation: start=%#lx size=%#lx tte=%#lx",
 		    translations[i].om_start, translations[i].om_size,
 		    translations[i].om_tte);
 		if ((translations[i].om_tte & TD_V) == 0)
 			continue;
 		if (translations[i].om_start < VM_MIN_PROM_ADDRESS ||
 		    translations[i].om_start > VM_MAX_PROM_ADDRESS)
 			continue;
 		for (off = 0; off < translations[i].om_size;
 		    off += PAGE_SIZE) {
 			va = translations[i].om_start + off;
 			tp = tsb_kvtotte(va);
 			vpn = TV_VPN(va, TS_8K);
 			data = ((translations[i].om_tte &
 			    ~((TD_SOFT2_MASK << TD_SOFT2_SHIFT) |
 			    (cpu_impl >= CPU_IMPL_ULTRASPARCI &&
 			    cpu_impl < CPU_IMPL_ULTRASPARCIII ?
 			    (TD_DIAG_SF_MASK << TD_DIAG_SF_SHIFT) :
 			    (TD_RSVD_CH_MASK << TD_RSVD_CH_SHIFT)) |
 			    (TD_SOFT_MASK << TD_SOFT_SHIFT))) | TD_EXEC) +
 			    off;
 			pmap_bootstrap_set_tte(tp, vpn, data);
 		}
 	}
 
 	/*
 	 * Get the available physical memory ranges from /memory/reg.  These
 	 * are only used for kernel dumps, but it may not be wise to do PROM
 	 * calls in that situation.
 	 */
 	if ((sz = OF_getproplen(pmem, "reg")) == -1)
 		panic("pmap_bootstrap: getproplen /memory/reg");
 	if (sizeof(sparc64_memreg) < sz)
 		panic("pmap_bootstrap: sparc64_memreg too small");
 	if (OF_getprop(pmem, "reg", sparc64_memreg, sz) == -1)
 		panic("pmap_bootstrap: getprop /memory/reg");
 	sparc64_nmemreg = sz / sizeof(*sparc64_memreg);
 
 	/*
 	 * Initialize the kernel pmap (which is statically allocated).
 	 * NOTE: PMAP_LOCK_INIT() is needed as part of the initialization
 	 * but sparc64 start up is not ready to initialize mutexes yet.
 	 * It is called in machdep.c.
 	 */
 	pm = kernel_pmap;
 	for (i = 0; i < MAXCPU; i++)
 		pm->pm_context[i] = TLB_CTX_KERNEL;
-	pm->pm_active = ~0;
+	CPU_FILL(&pm->pm_active);
 
 	/*
 	 * Flush all non-locked TLB entries possibly left over by the
 	 * firmware.
 	 */
 	tlb_flush_nonlocked();
 }
 
 /*
  * Map the 4MB kernel TSB pages.
  */
 void
 pmap_map_tsb(void)
 {
 	vm_offset_t va;
 	vm_paddr_t pa;
 	u_long data;
 	int i;
 
 	for (i = 0; i < tsb_kernel_size; i += PAGE_SIZE_4M) {
 		va = (vm_offset_t)tsb_kernel + i;
 		pa = tsb_kernel_phys + i;
 		data = TD_V | TD_4M | TD_PA(pa) | TD_L | TD_CP | TD_CV |
 		    TD_P | TD_W;
 		stxa(AA_DMMU_TAR, ASI_DMMU, TLB_TAR_VA(va) |
 		    TLB_TAR_CTX(TLB_CTX_KERNEL));
 		stxa_sync(0, ASI_DTLB_DATA_IN_REG, data);
 	}
 }
 
 /*
  * Set the secondary context to be the kernel context (needed for FP block
  * operations in the kernel).
  */
 void
 pmap_set_kctx(void)
 {
 
 	stxa(AA_DMMU_SCXR, ASI_DMMU, (ldxa(AA_DMMU_SCXR, ASI_DMMU) &
 	    TLB_CXR_PGSZ_MASK) | TLB_CTX_KERNEL);
 	flush(KERNBASE);
 }
 
 /*
  * Allocate a physical page of memory directly from the phys_avail map.
  * Can only be called from pmap_bootstrap before avail start and end are
  * calculated.
  */
 static vm_paddr_t
 pmap_bootstrap_alloc(vm_size_t size, uint32_t colors)
 {
 	vm_paddr_t pa;
 	int i;
 
 	size = roundup(size, PAGE_SIZE * colors);
 	for (i = 0; phys_avail[i + 1] != 0; i += 2) {
 		if (phys_avail[i + 1] - phys_avail[i] < size)
 			continue;
 		pa = phys_avail[i];
 		phys_avail[i] += size;
 		return (pa);
 	}
 	panic("pmap_bootstrap_alloc");
 }
 
 /*
  * Set a TTE.  This function is intended as a helper when tsb_kernel is
  * direct-mapped but we haven't taken over the trap table, yet, as it's the
  * case when we are taking advantage of ASI_ATOMIC_QUAD_LDD_PHYS to access
  * the kernel TSB.
  */
 void
 pmap_bootstrap_set_tte(struct tte *tp, u_long vpn, u_long data)
 {
 
 	if (tsb_kernel_ldd_phys == 0) {
 		tp->tte_vpn = vpn;
 		tp->tte_data = data;
 	} else {
 		stxa((vm_paddr_t)tp + offsetof(struct tte, tte_vpn),
 		    ASI_PHYS_USE_EC, vpn);
 		stxa((vm_paddr_t)tp + offsetof(struct tte, tte_data),
 		    ASI_PHYS_USE_EC, data);
 	}
 }
 
 /*
  * Initialize a vm_page's machine-dependent fields.
  */
 void
 pmap_page_init(vm_page_t m)
 {
 
 	TAILQ_INIT(&m->md.tte_list);
 	m->md.color = DCACHE_COLOR(VM_PAGE_TO_PHYS(m));
 	m->md.flags = 0;
 	m->md.pmap = NULL;
 }
 
 /*
  * Initialize the pmap module.
  */
 void
 pmap_init(void)
 {
 	vm_offset_t addr;
 	vm_size_t size;
 	int result;
 	int i;
 
 	for (i = 0; i < translations_size; i++) {
 		addr = translations[i].om_start;
 		size = translations[i].om_size;
 		if ((translations[i].om_tte & TD_V) == 0)
 			continue;
 		if (addr < VM_MIN_PROM_ADDRESS || addr > VM_MAX_PROM_ADDRESS)
 			continue;
 		result = vm_map_find(kernel_map, NULL, 0, &addr, size,
 		    VMFS_NO_SPACE, VM_PROT_ALL, VM_PROT_ALL, MAP_NOFAULT);
 		if (result != KERN_SUCCESS || addr != translations[i].om_start)
 			panic("pmap_init: vm_map_find");
 	}
 }
 
 /*
  * Extract the physical page address associated with the given
  * map/virtual_address pair.
  */
 vm_paddr_t
 pmap_extract(pmap_t pm, vm_offset_t va)
 {
 	struct tte *tp;
 	vm_paddr_t pa;
 
 	if (pm == kernel_pmap)
 		return (pmap_kextract(va));
 	PMAP_LOCK(pm);
 	tp = tsb_tte_lookup(pm, va);
 	if (tp == NULL)
 		pa = 0;
 	else
 		pa = TTE_GET_PA(tp) | (va & TTE_GET_PAGE_MASK(tp));
 	PMAP_UNLOCK(pm);
 	return (pa);
 }
 
 /*
  * Atomically extract and hold the physical page with the given
  * pmap and virtual address pair if that mapping permits the given
  * protection.
  */
 vm_page_t
 pmap_extract_and_hold(pmap_t pm, vm_offset_t va, vm_prot_t prot)
 {
 	struct tte *tp;
 	vm_page_t m;
 	vm_paddr_t pa;
 
 	m = NULL;
 	pa = 0;
 	PMAP_LOCK(pm);
 retry:
 	if (pm == kernel_pmap) {
 		if (va >= VM_MIN_DIRECT_ADDRESS) {
 			tp = NULL;
 			m = PHYS_TO_VM_PAGE(TLB_DIRECT_TO_PHYS(va));
 			(void)vm_page_pa_tryrelock(pm, TLB_DIRECT_TO_PHYS(va),
 			    &pa);
 			vm_page_hold(m);
 		} else {
 			tp = tsb_kvtotte(va);
 			if ((tp->tte_data & TD_V) == 0)
 				tp = NULL;
 		}
 	} else
 		tp = tsb_tte_lookup(pm, va);
 	if (tp != NULL && ((tp->tte_data & TD_SW) ||
 	    (prot & VM_PROT_WRITE) == 0)) {
 		if (vm_page_pa_tryrelock(pm, TTE_GET_PA(tp), &pa))
 			goto retry;
 		m = PHYS_TO_VM_PAGE(TTE_GET_PA(tp));
 		vm_page_hold(m);
 	}
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pm);
 	return (m);
 }
 
 /*
  * Extract the physical page address associated with the given kernel virtual
  * address.
  */
 vm_paddr_t
 pmap_kextract(vm_offset_t va)
 {
 	struct tte *tp;
 
 	if (va >= VM_MIN_DIRECT_ADDRESS)
 		return (TLB_DIRECT_TO_PHYS(va));
 	tp = tsb_kvtotte(va);
 	if ((tp->tte_data & TD_V) == 0)
 		return (0);
 	return (TTE_GET_PA(tp) | (va & TTE_GET_PAGE_MASK(tp)));
 }
 
 int
 pmap_cache_enter(vm_page_t m, vm_offset_t va)
 {
 	struct tte *tp;
 	int color;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_cache_enter: fake page"));
 	PMAP_STATS_INC(pmap_ncache_enter);
 
 	if (dcache_color_ignore != 0)
 		return (1);
 
 	/*
 	 * Find the color for this virtual address and note the added mapping.
 	 */
 	color = DCACHE_COLOR(va);
 	m->md.colors[color]++;
 
 	/*
 	 * If all existing mappings have the same color, the mapping is
 	 * cacheable.
 	 */
 	if (m->md.color == color) {
 		KASSERT(m->md.colors[DCACHE_OTHER_COLOR(color)] == 0,
 		    ("pmap_cache_enter: cacheable, mappings of other color"));
 		if (m->md.color == DCACHE_COLOR(VM_PAGE_TO_PHYS(m)))
 			PMAP_STATS_INC(pmap_ncache_enter_c);
 		else
 			PMAP_STATS_INC(pmap_ncache_enter_oc);
 		return (1);
 	}
 
 	/*
 	 * If there are no mappings of the other color, and the page still has
 	 * the wrong color, this must be a new mapping.  Change the color to
 	 * match the new mapping, which is cacheable.  We must flush the page
 	 * from the cache now.
 	 */
 	if (m->md.colors[DCACHE_OTHER_COLOR(color)] == 0) {
 		KASSERT(m->md.colors[color] == 1,
 		    ("pmap_cache_enter: changing color, not new mapping"));
 		dcache_page_inval(VM_PAGE_TO_PHYS(m));
 		m->md.color = color;
 		if (m->md.color == DCACHE_COLOR(VM_PAGE_TO_PHYS(m)))
 			PMAP_STATS_INC(pmap_ncache_enter_cc);
 		else
 			PMAP_STATS_INC(pmap_ncache_enter_coc);
 		return (1);
 	}
 
 	/*
 	 * If the mapping is already non-cacheable, just return.
 	 */
 	if (m->md.color == -1) {
 		PMAP_STATS_INC(pmap_ncache_enter_nc);
 		return (0);
 	}
 
 	PMAP_STATS_INC(pmap_ncache_enter_cnc);
 
 	/*
 	 * Mark all mappings as uncacheable, flush any lines with the other
 	 * color out of the dcache, and set the color to none (-1).
 	 */
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link) {
 		atomic_clear_long(&tp->tte_data, TD_CV);
 		tlb_page_demap(TTE_GET_PMAP(tp), TTE_GET_VA(tp));
 	}
 	dcache_page_inval(VM_PAGE_TO_PHYS(m));
 	m->md.color = -1;
 	return (0);
 }
 
 void
 pmap_cache_remove(vm_page_t m, vm_offset_t va)
 {
 	struct tte *tp;
 	int color;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	CTR3(KTR_PMAP, "pmap_cache_remove: m=%p va=%#lx c=%d", m, va,
 	    m->md.colors[DCACHE_COLOR(va)]);
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_cache_remove: fake page"));
 	PMAP_STATS_INC(pmap_ncache_remove);
 
 	if (dcache_color_ignore != 0)
 		return;
 
 	KASSERT(m->md.colors[DCACHE_COLOR(va)] > 0,
 	    ("pmap_cache_remove: no mappings %d <= 0",
 	    m->md.colors[DCACHE_COLOR(va)]));
 
 	/*
 	 * Find the color for this virtual address and note the removal of
 	 * the mapping.
 	 */
 	color = DCACHE_COLOR(va);
 	m->md.colors[color]--;
 
 	/*
 	 * If the page is cacheable, just return and keep the same color, even
 	 * if there are no longer any mappings.
 	 */
 	if (m->md.color != -1) {
 		if (m->md.color == DCACHE_COLOR(VM_PAGE_TO_PHYS(m)))
 			PMAP_STATS_INC(pmap_ncache_remove_c);
 		else
 			PMAP_STATS_INC(pmap_ncache_remove_oc);
 		return;
 	}
 
 	KASSERT(m->md.colors[DCACHE_OTHER_COLOR(color)] != 0,
 	    ("pmap_cache_remove: uncacheable, no mappings of other color"));
 
 	/*
 	 * If the page is not cacheable (color is -1), and the number of
 	 * mappings for this color is not zero, just return.  There are
 	 * mappings of the other color still, so remain non-cacheable.
 	 */
 	if (m->md.colors[color] != 0) {
 		PMAP_STATS_INC(pmap_ncache_remove_nc);
 		return;
 	}
 
 	/*
 	 * The number of mappings for this color is now zero.  Recache the
 	 * other colored mappings, and change the page color to the other
 	 * color.  There should be no lines in the data cache for this page,
 	 * so flushing should not be needed.
 	 */
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link) {
 		atomic_set_long(&tp->tte_data, TD_CV);
 		tlb_page_demap(TTE_GET_PMAP(tp), TTE_GET_VA(tp));
 	}
 	m->md.color = DCACHE_OTHER_COLOR(color);
 
 	if (m->md.color == DCACHE_COLOR(VM_PAGE_TO_PHYS(m)))
 		PMAP_STATS_INC(pmap_ncache_remove_cc);
 	else
 		PMAP_STATS_INC(pmap_ncache_remove_coc);
 }
 
 /*
  * Map a wired page into kernel virtual address space.
  */
 void
 pmap_kenter(vm_offset_t va, vm_page_t m)
 {
 	vm_offset_t ova;
 	struct tte *tp;
 	vm_page_t om;
 	u_long data;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_STATS_INC(pmap_nkenter);
 	tp = tsb_kvtotte(va);
 	CTR4(KTR_PMAP, "pmap_kenter: va=%#lx pa=%#lx tp=%p data=%#lx",
 	    va, VM_PAGE_TO_PHYS(m), tp, tp->tte_data);
 	if (DCACHE_COLOR(VM_PAGE_TO_PHYS(m)) != DCACHE_COLOR(va)) {
 		CTR5(KTR_SPARE2,
 	"pmap_kenter: off color va=%#lx pa=%#lx o=%p ot=%d pi=%#lx",
 		    va, VM_PAGE_TO_PHYS(m), m->object,
 		    m->object ? m->object->type : -1,
 		    m->pindex);
 		PMAP_STATS_INC(pmap_nkenter_oc);
 	}
 	if ((tp->tte_data & TD_V) != 0) {
 		om = PHYS_TO_VM_PAGE(TTE_GET_PA(tp));
 		ova = TTE_GET_VA(tp);
 		if (m == om && va == ova) {
 			PMAP_STATS_INC(pmap_nkenter_stupid);
 			return;
 		}
 		TAILQ_REMOVE(&om->md.tte_list, tp, tte_link);
 		pmap_cache_remove(om, ova);
 		if (va != ova)
 			tlb_page_demap(kernel_pmap, ova);
 	}
 	data = TD_V | TD_8K | VM_PAGE_TO_PHYS(m) | TD_REF | TD_SW | TD_CP |
 	    TD_P | TD_W;
 	if (pmap_cache_enter(m, va) != 0)
 		data |= TD_CV;
 	tp->tte_vpn = TV_VPN(va, TS_8K);
 	tp->tte_data = data;
 	TAILQ_INSERT_TAIL(&m->md.tte_list, tp, tte_link);
 }
 
 /*
  * Map a wired page into kernel virtual address space.  This additionally
  * takes a flag argument which is or'ed to the TTE data.  This is used by
  * sparc64_bus_mem_map().
  * NOTE: if the mapping is non-cacheable, it's the caller's responsibility
  * to flush entries that might still be in the cache, if applicable.
  */
 void
 pmap_kenter_flags(vm_offset_t va, vm_paddr_t pa, u_long flags)
 {
 	struct tte *tp;
 
 	tp = tsb_kvtotte(va);
 	CTR4(KTR_PMAP, "pmap_kenter_flags: va=%#lx pa=%#lx tp=%p data=%#lx",
 	    va, pa, tp, tp->tte_data);
 	tp->tte_vpn = TV_VPN(va, TS_8K);
 	tp->tte_data = TD_V | TD_8K | TD_PA(pa) | TD_REF | TD_P | flags;
 }
 
 /*
  * Remove a wired page from kernel virtual address space.
  */
 void
 pmap_kremove(vm_offset_t va)
 {
 	struct tte *tp;
 	vm_page_t m;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_STATS_INC(pmap_nkremove);
 	tp = tsb_kvtotte(va);
 	CTR3(KTR_PMAP, "pmap_kremove: va=%#lx tp=%p data=%#lx", va, tp,
 	    tp->tte_data);
 	if ((tp->tte_data & TD_V) == 0)
 		return;
 	m = PHYS_TO_VM_PAGE(TTE_GET_PA(tp));
 	TAILQ_REMOVE(&m->md.tte_list, tp, tte_link);
 	pmap_cache_remove(m, va);
 	TTE_ZERO(tp);
 }
 
 /*
  * Inverse of pmap_kenter_flags, used by bus_space_unmap().
  */
 void
 pmap_kremove_flags(vm_offset_t va)
 {
 	struct tte *tp;
 
 	tp = tsb_kvtotte(va);
 	CTR3(KTR_PMAP, "pmap_kremove_flags: va=%#lx tp=%p data=%#lx", va, tp,
 	    tp->tte_data);
 	TTE_ZERO(tp);
 }
 
 /*
  * Map a range of physical addresses into kernel virtual address space.
  *
  * The value passed in *virt is a suggested virtual address for the mapping.
  * Architectures which can support a direct-mapped physical to virtual region
  * can return the appropriate address within that region, leaving '*virt'
  * unchanged.
  */
 vm_offset_t
 pmap_map(vm_offset_t *virt, vm_paddr_t start, vm_paddr_t end, int prot)
 {
 
 	return (TLB_PHYS_TO_DIRECT(start));
 }
 
 /*
  * Map a list of wired pages into kernel virtual address space.  This is
  * intended for temporary mappings which do not need page modification or
  * references recorded.  Existing mappings in the region are overwritten.
  */
 void
 pmap_qenter(vm_offset_t sva, vm_page_t *m, int count)
 {
 	vm_offset_t va;
 	int locked;
 
 	PMAP_STATS_INC(pmap_nqenter);
 	va = sva;
 	if (!(locked = mtx_owned(&vm_page_queue_mtx)))
 		vm_page_lock_queues();
 	while (count-- > 0) {
 		pmap_kenter(va, *m);
 		va += PAGE_SIZE;
 		m++;
 	}
 	if (!locked)
 		vm_page_unlock_queues();
 	tlb_range_demap(kernel_pmap, sva, va);
 }
 
 /*
  * Remove page mappings from kernel virtual address space.  Intended for
  * temporary mappings entered by pmap_qenter.
  */
 void
 pmap_qremove(vm_offset_t sva, int count)
 {
 	vm_offset_t va;
 	int locked;
 
 	PMAP_STATS_INC(pmap_nqremove);
 	va = sva;
 	if (!(locked = mtx_owned(&vm_page_queue_mtx)))
 		vm_page_lock_queues();
 	while (count-- > 0) {
 		pmap_kremove(va);
 		va += PAGE_SIZE;
 	}
 	if (!locked)
 		vm_page_unlock_queues();
 	tlb_range_demap(kernel_pmap, sva, va);
 }
 
 /*
  * Initialize the pmap associated with process 0.
  */
 void
 pmap_pinit0(pmap_t pm)
 {
 	int i;
 
 	PMAP_LOCK_INIT(pm);
 	for (i = 0; i < MAXCPU; i++)
 		pm->pm_context[i] = TLB_CTX_KERNEL;
-	pm->pm_active = 0;
+	CPU_ZERO(&pm->pm_active);
 	pm->pm_tsb = NULL;
 	pm->pm_tsb_obj = NULL;
 	bzero(&pm->pm_stats, sizeof(pm->pm_stats));
 }
 
 /*
  * Initialize a preallocated and zeroed pmap structure, such as one in a
  * vmspace structure.
  */
 int
 pmap_pinit(pmap_t pm)
 {
 	vm_page_t ma[TSB_PAGES];
 	vm_page_t m;
 	int i;
 
 	PMAP_LOCK_INIT(pm);
 
 	/*
 	 * Allocate KVA space for the TSB.
 	 */
 	if (pm->pm_tsb == NULL) {
 		pm->pm_tsb = (struct tte *)kmem_alloc_nofault(kernel_map,
 		    TSB_BSIZE);
 		if (pm->pm_tsb == NULL) {
 			PMAP_LOCK_DESTROY(pm);
 			return (0);
 		}
 	}
 
 	/*
 	 * Allocate an object for it.
 	 */
 	if (pm->pm_tsb_obj == NULL)
 		pm->pm_tsb_obj = vm_object_allocate(OBJT_PHYS, TSB_PAGES);
 
 	mtx_lock_spin(&sched_lock);
 	for (i = 0; i < MAXCPU; i++)
 		pm->pm_context[i] = -1;
-	pm->pm_active = 0;
+	CPU_ZERO(&pm->pm_active);
 	mtx_unlock_spin(&sched_lock);
 
 	VM_OBJECT_LOCK(pm->pm_tsb_obj);
 	for (i = 0; i < TSB_PAGES; i++) {
 		m = vm_page_grab(pm->pm_tsb_obj, i, VM_ALLOC_NOBUSY |
 		    VM_ALLOC_RETRY | VM_ALLOC_WIRED | VM_ALLOC_ZERO);
 		m->valid = VM_PAGE_BITS_ALL;
 		m->md.pmap = pm;
 		ma[i] = m;
 	}
 	VM_OBJECT_UNLOCK(pm->pm_tsb_obj);
 	pmap_qenter((vm_offset_t)pm->pm_tsb, ma, TSB_PAGES);
 
 	bzero(&pm->pm_stats, sizeof(pm->pm_stats));
 	return (1);
 }
 
 /*
  * Release any resources held by the given physical map.
  * Called when a pmap initialized by pmap_pinit is being released.
  * Should only be called if the map contains no valid mappings.
  */
 void
 pmap_release(pmap_t pm)
 {
 	vm_object_t obj;
 	vm_page_t m;
 	struct pcpu *pc;
 
 	CTR2(KTR_PMAP, "pmap_release: ctx=%#x tsb=%p",
 	    pm->pm_context[curcpu], pm->pm_tsb);
 	KASSERT(pmap_resident_count(pm) == 0,
 	    ("pmap_release: resident pages %ld != 0",
 	    pmap_resident_count(pm)));
 
 	/*
 	 * After the pmap was freed, it might be reallocated to a new process.
 	 * When switching, this might lead us to wrongly assume that we need
 	 * not switch contexts because old and new pmap pointer are equal.
 	 * Therefore, make sure that this pmap is not referenced by any PCPU
 	 * pointer any more.  This could happen in two cases:
 	 * - A process that referenced the pmap is currently exiting on a CPU.
 	 *   However, it is guaranteed to not switch in any more after setting
 	 *   its state to PRS_ZOMBIE.
 	 * - A process that referenced this pmap ran on a CPU, but we switched
 	 *   to a kernel thread, leaving the pmap pointer unchanged.
 	 */
 	mtx_lock_spin(&sched_lock);
 	STAILQ_FOREACH(pc, &cpuhead, pc_allcpu)
 		if (pc->pc_pmap == pm)
 			pc->pc_pmap = NULL;
 	mtx_unlock_spin(&sched_lock);
 
 	obj = pm->pm_tsb_obj;
 	VM_OBJECT_LOCK(obj);
 	KASSERT(obj->ref_count == 1, ("pmap_release: tsbobj ref count != 1"));
 	while (!TAILQ_EMPTY(&obj->memq)) {
 		m = TAILQ_FIRST(&obj->memq);
 		m->md.pmap = NULL;
 		m->wire_count--;
 		atomic_subtract_int(&cnt.v_wire_count, 1);
 		vm_page_free_zero(m);
 	}
 	VM_OBJECT_UNLOCK(obj);
 	pmap_qremove((vm_offset_t)pm->pm_tsb, TSB_PAGES);
 	PMAP_LOCK_DESTROY(pm);
 }
 
 /*
  * Grow the number of kernel page table entries.  Unneeded.
  */
 void
 pmap_growkernel(vm_offset_t addr)
 {
 
 	panic("pmap_growkernel: can't grow kernel");
 }
 
 int
 pmap_remove_tte(struct pmap *pm, struct pmap *pm2, struct tte *tp,
     vm_offset_t va)
 {
 	vm_page_t m;
 	u_long data;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	data = atomic_readandclear_long(&tp->tte_data);
 	if ((data & TD_FAKE) == 0) {
 		m = PHYS_TO_VM_PAGE(TD_PA(data));
 		TAILQ_REMOVE(&m->md.tte_list, tp, tte_link);
 		if ((data & TD_WIRED) != 0)
 			pm->pm_stats.wired_count--;
 		if ((data & TD_PV) != 0) {
 			if ((data & TD_W) != 0)
 				vm_page_dirty(m);
 			if ((data & TD_REF) != 0)
 				vm_page_flag_set(m, PG_REFERENCED);
 			if (TAILQ_EMPTY(&m->md.tte_list))
 				vm_page_flag_clear(m, PG_WRITEABLE);
 			pm->pm_stats.resident_count--;
 		}
 		pmap_cache_remove(m, va);
 	}
 	TTE_ZERO(tp);
 	if (PMAP_REMOVE_DONE(pm))
 		return (0);
 	return (1);
 }
 
 /*
  * Remove the given range of addresses from the specified map.
  */
 void
 pmap_remove(pmap_t pm, vm_offset_t start, vm_offset_t end)
 {
 	struct tte *tp;
 	vm_offset_t va;
 
 	CTR3(KTR_PMAP, "pmap_remove: ctx=%#lx start=%#lx end=%#lx",
 	    pm->pm_context[curcpu], start, end);
 	if (PMAP_REMOVE_DONE(pm))
 		return;
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	if (end - start > PMAP_TSB_THRESH) {
 		tsb_foreach(pm, NULL, start, end, pmap_remove_tte);
 		tlb_context_demap(pm);
 	} else {
 		for (va = start; va < end; va += PAGE_SIZE)
 			if ((tp = tsb_tte_lookup(pm, va)) != NULL &&
 			    !pmap_remove_tte(pm, NULL, tp, va))
 				break;
 		tlb_range_demap(pm, start, end - 1);
 	}
 	PMAP_UNLOCK(pm);
 	vm_page_unlock_queues();
 }
 
 void
 pmap_remove_all(vm_page_t m)
 {
 	struct pmap *pm;
 	struct tte *tpn;
 	struct tte *tp;
 	vm_offset_t va;
 
 	vm_page_lock_queues();
 	for (tp = TAILQ_FIRST(&m->md.tte_list); tp != NULL; tp = tpn) {
 		tpn = TAILQ_NEXT(tp, tte_link);
 		if ((tp->tte_data & TD_PV) == 0)
 			continue;
 		pm = TTE_GET_PMAP(tp);
 		va = TTE_GET_VA(tp);
 		PMAP_LOCK(pm);
 		if ((tp->tte_data & TD_WIRED) != 0)
 			pm->pm_stats.wired_count--;
 		if ((tp->tte_data & TD_REF) != 0)
 			vm_page_flag_set(m, PG_REFERENCED);
 		if ((tp->tte_data & TD_W) != 0)
 			vm_page_dirty(m);
 		tp->tte_data &= ~TD_V;
 		tlb_page_demap(pm, va);
 		TAILQ_REMOVE(&m->md.tte_list, tp, tte_link);
 		pm->pm_stats.resident_count--;
 		pmap_cache_remove(m, va);
 		TTE_ZERO(tp);
 		PMAP_UNLOCK(pm);
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 int
 pmap_protect_tte(struct pmap *pm, struct pmap *pm2, struct tte *tp,
     vm_offset_t va)
 {
 	u_long data;
 	vm_page_t m;
 
 	data = atomic_clear_long(&tp->tte_data, TD_SW | TD_W);
 	if ((data & (TD_PV | TD_W)) == (TD_PV | TD_W)) {
 		m = PHYS_TO_VM_PAGE(TD_PA(data));
 		vm_page_dirty(m);
 	}
 	return (1);
 }
 
 /*
  * Set the physical protection on the specified range of this map as requested.
  */
 void
 pmap_protect(pmap_t pm, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
 {
 	vm_offset_t va;
 	struct tte *tp;
 
 	CTR4(KTR_PMAP, "pmap_protect: ctx=%#lx sva=%#lx eva=%#lx prot=%#lx",
 	    pm->pm_context[curcpu], sva, eva, prot);
 
 	if ((prot & VM_PROT_READ) == VM_PROT_NONE) {
 		pmap_remove(pm, sva, eva);
 		return;
 	}
 
 	if (prot & VM_PROT_WRITE)
 		return;
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	if (eva - sva > PMAP_TSB_THRESH) {
 		tsb_foreach(pm, NULL, sva, eva, pmap_protect_tte);
 		tlb_context_demap(pm);
 	} else {
 		for (va = sva; va < eva; va += PAGE_SIZE)
 			if ((tp = tsb_tte_lookup(pm, va)) != NULL)
 				pmap_protect_tte(pm, NULL, tp, va);
 		tlb_range_demap(pm, sva, eva - 1);
 	}
 	PMAP_UNLOCK(pm);
 	vm_page_unlock_queues();
 }
 
 /*
  * Map the given physical page at the specified virtual address in the
  * target pmap with the protection requested.  If specified the page
  * will be wired down.
  */
 void
 pmap_enter(pmap_t pm, vm_offset_t va, vm_prot_t access, vm_page_t m,
     vm_prot_t prot, boolean_t wired)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	pmap_enter_locked(pm, va, m, prot, wired);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 /*
  * Map the given physical page at the specified virtual address in the
  * target pmap with the protection requested.  If specified the page
  * will be wired down.
  *
  * The page queues and pmap must be locked.
  */
 static void
 pmap_enter_locked(pmap_t pm, vm_offset_t va, vm_page_t m, vm_prot_t prot,
     boolean_t wired)
 {
 	struct tte *tp;
 	vm_paddr_t pa;
 	u_long data;
 	int i;
 
 	mtx_assert(&vm_page_queue_mtx, MA_OWNED);
 	PMAP_LOCK_ASSERT(pm, MA_OWNED);
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0 ||
 	    (m->oflags & VPO_BUSY) != 0 || VM_OBJECT_LOCKED(m->object),
 	    ("pmap_enter_locked: page %p is not busy", m));
 	PMAP_STATS_INC(pmap_nenter);
 	pa = VM_PAGE_TO_PHYS(m);
 
 	/*
 	 * If this is a fake page from the device_pager, but it covers actual
 	 * physical memory, convert to the real backing page.
 	 */
 	if ((m->flags & PG_FICTITIOUS) != 0) {
 		for (i = 0; phys_avail[i + 1] != 0; i += 2) {
 			if (pa >= phys_avail[i] && pa <= phys_avail[i + 1]) {
 				m = PHYS_TO_VM_PAGE(pa);
 				break;
 			}
 		}
 	}
 
 	CTR6(KTR_PMAP,
 	    "pmap_enter_locked: ctx=%p m=%p va=%#lx pa=%#lx prot=%#x wired=%d",
 	    pm->pm_context[curcpu], m, va, pa, prot, wired);
 
 	/*
 	 * If there is an existing mapping, and the physical address has not
 	 * changed, must be protection or wiring change.
 	 */
 	if ((tp = tsb_tte_lookup(pm, va)) != NULL && TTE_GET_PA(tp) == pa) {
 		CTR0(KTR_PMAP, "pmap_enter_locked: update");
 		PMAP_STATS_INC(pmap_nenter_update);
 
 		/*
 		 * Wiring change, just update stats.
 		 */
 		if (wired) {
 			if ((tp->tte_data & TD_WIRED) == 0) {
 				tp->tte_data |= TD_WIRED;
 				pm->pm_stats.wired_count++;
 			}
 		} else {
 			if ((tp->tte_data & TD_WIRED) != 0) {
 				tp->tte_data &= ~TD_WIRED;
 				pm->pm_stats.wired_count--;
 			}
 		}
 
 		/*
 		 * Save the old bits and clear the ones we're interested in.
 		 */
 		data = tp->tte_data;
 		tp->tte_data &= ~(TD_EXEC | TD_SW | TD_W);
 
 		/*
 		 * If we're turning off write permissions, sense modify status.
 		 */
 		if ((prot & VM_PROT_WRITE) != 0) {
 			tp->tte_data |= TD_SW;
 			if (wired)
 				tp->tte_data |= TD_W;
 			if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0)
 				vm_page_flag_set(m, PG_WRITEABLE);
 		} else if ((data & TD_W) != 0)
 			vm_page_dirty(m);
 
 		/*
 		 * If we're turning on execute permissions, flush the icache.
 		 */
 		if ((prot & VM_PROT_EXECUTE) != 0) {
 			if ((data & TD_EXEC) == 0)
 				icache_page_inval(pa);
 			tp->tte_data |= TD_EXEC;
 		}
 
 		/*
 		 * Delete the old mapping.
 		 */
 		tlb_page_demap(pm, TTE_GET_VA(tp));
 	} else {
 		/*
 		 * If there is an existing mapping, but its for a different
 		 * physical address, delete the old mapping.
 		 */
 		if (tp != NULL) {
 			CTR0(KTR_PMAP, "pmap_enter_locked: replace");
 			PMAP_STATS_INC(pmap_nenter_replace);
 			pmap_remove_tte(pm, NULL, tp, va);
 			tlb_page_demap(pm, va);
 		} else {
 			CTR0(KTR_PMAP, "pmap_enter_locked: new");
 			PMAP_STATS_INC(pmap_nenter_new);
 		}
 
 		/*
 		 * Now set up the data and install the new mapping.
 		 */
 		data = TD_V | TD_8K | TD_PA(pa);
 		if (pm == kernel_pmap)
 			data |= TD_P;
 		if ((prot & VM_PROT_WRITE) != 0) {
 			data |= TD_SW;
 			if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0)
 				vm_page_flag_set(m, PG_WRITEABLE);
 		}
 		if (prot & VM_PROT_EXECUTE) {
 			data |= TD_EXEC;
 			icache_page_inval(pa);
 		}
 
 		/*
 		 * If its wired update stats.  We also don't need reference or
 		 * modify tracking for wired mappings, so set the bits now.
 		 */
 		if (wired) {
 			pm->pm_stats.wired_count++;
 			data |= TD_REF | TD_WIRED;
 			if ((prot & VM_PROT_WRITE) != 0)
 				data |= TD_W;
 		}
 
 		tsb_tte_enter(pm, m, va, TS_8K, data);
 	}
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 pmap_enter_object(pmap_t pm, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	vm_page_t m;
 	vm_pindex_t diff, psize;
 
 	psize = atop(end - start);
 	m = m_start;
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		pmap_enter_locked(pm, start + ptoa(diff), m, prot &
 		    (VM_PROT_READ | VM_PROT_EXECUTE), FALSE);
 		m = TAILQ_NEXT(m, listq);
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 void
 pmap_enter_quick(pmap_t pm, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 
 	vm_page_lock_queues();
 	PMAP_LOCK(pm);
 	pmap_enter_locked(pm, va, m, prot & (VM_PROT_READ | VM_PROT_EXECUTE),
 	    FALSE);
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(pm);
 }
 
 void
 pmap_object_init_pt(pmap_t pm, vm_offset_t addr, vm_object_t object,
     vm_pindex_t pindex, vm_size_t size)
 {
 
 	VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
 	KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG,
 	    ("pmap_object_init_pt: non-device object"));
 }
 
 /*
  * Change the wiring attribute for a map/virtual-address pair.
  * The mapping must already exist in the pmap.
  */
 void
 pmap_change_wiring(pmap_t pm, vm_offset_t va, boolean_t wired)
 {
 	struct tte *tp;
 	u_long data;
 
 	PMAP_LOCK(pm);
 	if ((tp = tsb_tte_lookup(pm, va)) != NULL) {
 		if (wired) {
 			data = atomic_set_long(&tp->tte_data, TD_WIRED);
 			if ((data & TD_WIRED) == 0)
 				pm->pm_stats.wired_count++;
 		} else {
 			data = atomic_clear_long(&tp->tte_data, TD_WIRED);
 			if ((data & TD_WIRED) != 0)
 				pm->pm_stats.wired_count--;
 		}
 	}
 	PMAP_UNLOCK(pm);
 }
 
 static int
 pmap_copy_tte(pmap_t src_pmap, pmap_t dst_pmap, struct tte *tp,
     vm_offset_t va)
 {
 	vm_page_t m;
 	u_long data;
 
 	if ((tp->tte_data & TD_FAKE) != 0)
 		return (1);
 	if (tsb_tte_lookup(dst_pmap, va) == NULL) {
 		data = tp->tte_data &
 		    ~(TD_PV | TD_REF | TD_SW | TD_CV | TD_W);
 		m = PHYS_TO_VM_PAGE(TTE_GET_PA(tp));
 		tsb_tte_enter(dst_pmap, m, va, TS_8K, data);
 	}
 	return (1);
 }
 
 void
 pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr,
     vm_size_t len, vm_offset_t src_addr)
 {
 	struct tte *tp;
 	vm_offset_t va;
 
 	if (dst_addr != src_addr)
 		return;
 	vm_page_lock_queues();
 	if (dst_pmap < src_pmap) {
 		PMAP_LOCK(dst_pmap);
 		PMAP_LOCK(src_pmap);
 	} else {
 		PMAP_LOCK(src_pmap);
 		PMAP_LOCK(dst_pmap);
 	}
 	if (len > PMAP_TSB_THRESH) {
 		tsb_foreach(src_pmap, dst_pmap, src_addr, src_addr + len,
 		    pmap_copy_tte);
 		tlb_context_demap(dst_pmap);
 	} else {
 		for (va = src_addr; va < src_addr + len; va += PAGE_SIZE)
 			if ((tp = tsb_tte_lookup(src_pmap, va)) != NULL)
 				pmap_copy_tte(src_pmap, dst_pmap, tp, va);
 		tlb_range_demap(dst_pmap, src_addr, src_addr + len - 1);
 	}
 	vm_page_unlock_queues();
 	PMAP_UNLOCK(src_pmap);
 	PMAP_UNLOCK(dst_pmap);
 }
 
 void
 pmap_zero_page(vm_page_t m)
 {
 	struct tte *tp;
 	vm_offset_t va;
 	vm_paddr_t pa;
 
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_zero_page: fake page"));
 	PMAP_STATS_INC(pmap_nzero_page);
 	pa = VM_PAGE_TO_PHYS(m);
 	if (dcache_color_ignore != 0 || m->md.color == DCACHE_COLOR(pa)) {
 		PMAP_STATS_INC(pmap_nzero_page_c);
 		va = TLB_PHYS_TO_DIRECT(pa);
 		cpu_block_zero((void *)va, PAGE_SIZE);
 	} else if (m->md.color == -1) {
 		PMAP_STATS_INC(pmap_nzero_page_nc);
 		aszero(ASI_PHYS_USE_EC, pa, PAGE_SIZE);
 	} else {
 		PMAP_STATS_INC(pmap_nzero_page_oc);
 		PMAP_LOCK(kernel_pmap);
 		va = pmap_temp_map_1 + (m->md.color * PAGE_SIZE);
 		tp = tsb_kvtotte(va);
 		tp->tte_data = TD_V | TD_8K | TD_PA(pa) | TD_CP | TD_CV | TD_W;
 		tp->tte_vpn = TV_VPN(va, TS_8K);
 		cpu_block_zero((void *)va, PAGE_SIZE);
 		tlb_page_demap(kernel_pmap, va);
 		PMAP_UNLOCK(kernel_pmap);
 	}
 }
 
 void
 pmap_zero_page_area(vm_page_t m, int off, int size)
 {
 	struct tte *tp;
 	vm_offset_t va;
 	vm_paddr_t pa;
 
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_zero_page_area: fake page"));
 	KASSERT(off + size <= PAGE_SIZE, ("pmap_zero_page_area: bad off/size"));
 	PMAP_STATS_INC(pmap_nzero_page_area);
 	pa = VM_PAGE_TO_PHYS(m);
 	if (dcache_color_ignore != 0 || m->md.color == DCACHE_COLOR(pa)) {
 		PMAP_STATS_INC(pmap_nzero_page_area_c);
 		va = TLB_PHYS_TO_DIRECT(pa);
 		bzero((void *)(va + off), size);
 	} else if (m->md.color == -1) {
 		PMAP_STATS_INC(pmap_nzero_page_area_nc);
 		aszero(ASI_PHYS_USE_EC, pa + off, size);
 	} else {
 		PMAP_STATS_INC(pmap_nzero_page_area_oc);
 		PMAP_LOCK(kernel_pmap);
 		va = pmap_temp_map_1 + (m->md.color * PAGE_SIZE);
 		tp = tsb_kvtotte(va);
 		tp->tte_data = TD_V | TD_8K | TD_PA(pa) | TD_CP | TD_CV | TD_W;
 		tp->tte_vpn = TV_VPN(va, TS_8K);
 		bzero((void *)(va + off), size);
 		tlb_page_demap(kernel_pmap, va);
 		PMAP_UNLOCK(kernel_pmap);
 	}
 }
 
 void
 pmap_zero_page_idle(vm_page_t m)
 {
 	struct tte *tp;
 	vm_offset_t va;
 	vm_paddr_t pa;
 
 	KASSERT((m->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_zero_page_idle: fake page"));
 	PMAP_STATS_INC(pmap_nzero_page_idle);
 	pa = VM_PAGE_TO_PHYS(m);
 	if (dcache_color_ignore != 0 || m->md.color == DCACHE_COLOR(pa)) {
 		PMAP_STATS_INC(pmap_nzero_page_idle_c);
 		va = TLB_PHYS_TO_DIRECT(pa);
 		cpu_block_zero((void *)va, PAGE_SIZE);
 	} else if (m->md.color == -1) {
 		PMAP_STATS_INC(pmap_nzero_page_idle_nc);
 		aszero(ASI_PHYS_USE_EC, pa, PAGE_SIZE);
 	} else {
 		PMAP_STATS_INC(pmap_nzero_page_idle_oc);
 		va = pmap_idle_map + (m->md.color * PAGE_SIZE);
 		tp = tsb_kvtotte(va);
 		tp->tte_data = TD_V | TD_8K | TD_PA(pa) | TD_CP | TD_CV | TD_W;
 		tp->tte_vpn = TV_VPN(va, TS_8K);
 		cpu_block_zero((void *)va, PAGE_SIZE);
 		tlb_page_demap(kernel_pmap, va);
 	}
 }
 
 void
 pmap_copy_page(vm_page_t msrc, vm_page_t mdst)
 {
 	vm_offset_t vdst;
 	vm_offset_t vsrc;
 	vm_paddr_t pdst;
 	vm_paddr_t psrc;
 	struct tte *tp;
 
 	KASSERT((mdst->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_copy_page: fake dst page"));
 	KASSERT((msrc->flags & PG_FICTITIOUS) == 0,
 	    ("pmap_copy_page: fake src page"));
 	PMAP_STATS_INC(pmap_ncopy_page);
 	pdst = VM_PAGE_TO_PHYS(mdst);
 	psrc = VM_PAGE_TO_PHYS(msrc);
 	if (dcache_color_ignore != 0 ||
 	    (msrc->md.color == DCACHE_COLOR(psrc) &&
 	    mdst->md.color == DCACHE_COLOR(pdst))) {
 		PMAP_STATS_INC(pmap_ncopy_page_c);
 		vdst = TLB_PHYS_TO_DIRECT(pdst);
 		vsrc = TLB_PHYS_TO_DIRECT(psrc);
 		cpu_block_copy((void *)vsrc, (void *)vdst, PAGE_SIZE);
 	} else if (msrc->md.color == -1 && mdst->md.color == -1) {
 		PMAP_STATS_INC(pmap_ncopy_page_nc);
 		ascopy(ASI_PHYS_USE_EC, psrc, pdst, PAGE_SIZE);
 	} else if (msrc->md.color == -1) {
 		if (mdst->md.color == DCACHE_COLOR(pdst)) {
 			PMAP_STATS_INC(pmap_ncopy_page_dc);
 			vdst = TLB_PHYS_TO_DIRECT(pdst);
 			ascopyfrom(ASI_PHYS_USE_EC, psrc, (void *)vdst,
 			    PAGE_SIZE);
 		} else {
 			PMAP_STATS_INC(pmap_ncopy_page_doc);
 			PMAP_LOCK(kernel_pmap);
 			vdst = pmap_temp_map_1 + (mdst->md.color * PAGE_SIZE);
 			tp = tsb_kvtotte(vdst);
 			tp->tte_data =
 			    TD_V | TD_8K | TD_PA(pdst) | TD_CP | TD_CV | TD_W;
 			tp->tte_vpn = TV_VPN(vdst, TS_8K);
 			ascopyfrom(ASI_PHYS_USE_EC, psrc, (void *)vdst,
 			    PAGE_SIZE);
 			tlb_page_demap(kernel_pmap, vdst);
 			PMAP_UNLOCK(kernel_pmap);
 		}
 	} else if (mdst->md.color == -1) {
 		if (msrc->md.color == DCACHE_COLOR(psrc)) {
 			PMAP_STATS_INC(pmap_ncopy_page_sc);
 			vsrc = TLB_PHYS_TO_DIRECT(psrc);
 			ascopyto((void *)vsrc, ASI_PHYS_USE_EC, pdst,
 			    PAGE_SIZE);
 		} else {
 			PMAP_STATS_INC(pmap_ncopy_page_soc);
 			PMAP_LOCK(kernel_pmap);
 			vsrc = pmap_temp_map_1 + (msrc->md.color * PAGE_SIZE);
 			tp = tsb_kvtotte(vsrc);
 			tp->tte_data =
 			    TD_V | TD_8K | TD_PA(psrc) | TD_CP | TD_CV | TD_W;
 			tp->tte_vpn = TV_VPN(vsrc, TS_8K);
 			ascopyto((void *)vsrc, ASI_PHYS_USE_EC, pdst,
 			    PAGE_SIZE);
 			tlb_page_demap(kernel_pmap, vsrc);
 			PMAP_UNLOCK(kernel_pmap);
 		}
 	} else {
 		PMAP_STATS_INC(pmap_ncopy_page_oc);
 		PMAP_LOCK(kernel_pmap);
 		vdst = pmap_temp_map_1 + (mdst->md.color * PAGE_SIZE);
 		tp = tsb_kvtotte(vdst);
 		tp->tte_data =
 		    TD_V | TD_8K | TD_PA(pdst) | TD_CP | TD_CV | TD_W;
 		tp->tte_vpn = TV_VPN(vdst, TS_8K);
 		vsrc = pmap_temp_map_2 + (msrc->md.color * PAGE_SIZE);
 		tp = tsb_kvtotte(vsrc);
 		tp->tte_data =
 		    TD_V | TD_8K | TD_PA(psrc) | TD_CP | TD_CV | TD_W;
 		tp->tte_vpn = TV_VPN(vsrc, TS_8K);
 		cpu_block_copy((void *)vsrc, (void *)vdst, PAGE_SIZE);
 		tlb_page_demap(kernel_pmap, vdst);
 		tlb_page_demap(kernel_pmap, vsrc);
 		PMAP_UNLOCK(kernel_pmap);
 	}
 }
 
 /*
  * Returns true if the pmap's pv is one of the first
  * 16 pvs linked to from this page.  This count may
  * be changed upwards or downwards in the future; it
  * is only necessary that true be returned for a small
  * subset of pmaps for proper page aging.
  */
 boolean_t
 pmap_page_exists_quick(pmap_t pm, vm_page_t m)
 {
 	struct tte *tp;
 	int loops;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_page_exists_quick: page %p is not managed", m));
 	loops = 0;
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link) {
 		if ((tp->tte_data & TD_PV) == 0)
 			continue;
 		if (TTE_GET_PMAP(tp) == pm) {
 			rv = TRUE;
 			break;
 		}
 		if (++loops >= 16)
 			break;
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Return the number of managed mappings to the given physical page
  * that are wired.
  */
 int
 pmap_page_wired_mappings(vm_page_t m)
 {
 	struct tte *tp;
 	int count;
 
 	count = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		return (count);
 	vm_page_lock_queues();
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link)
 		if ((tp->tte_data & (TD_PV | TD_WIRED)) == (TD_PV | TD_WIRED))
 			count++;
 	vm_page_unlock_queues();
 	return (count);
 }
 
 /*
  * Remove all pages from specified address space, this aids process exit
  * speeds.  This is much faster than pmap_remove n the case of running down
  * an entire address space.  Only works for the current pmap.
  */
 void
 pmap_remove_pages(pmap_t pm)
 {
 
 }
 
 /*
  * Returns TRUE if the given page has a managed mapping.
  */
 boolean_t
 pmap_page_is_mapped(vm_page_t m)
 {
 	struct tte *tp;
 	boolean_t rv;
 
 	rv = FALSE;
 	if ((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) != 0)
 		return (rv);
 	vm_page_lock_queues();
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link)
 		if ((tp->tte_data & TD_PV) != 0) {
 			rv = TRUE;
 			break;
 		}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  * Return a count of reference bits for a page, clearing those bits.
  * It is not necessary for every reference bit to be cleared, but it
  * is necessary that 0 only be returned when there are truly no
  * reference bits set.
  *
  * XXX: The exact number of bits to check and clear is a matter that
  * should be tested and standardized at some point in the future for
  * optimal aging of shared pages.
  */
 int
 pmap_ts_referenced(vm_page_t m)
 {
 	struct tte *tpf;
 	struct tte *tpn;
 	struct tte *tp;
 	u_long data;
 	int count;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_ts_referenced: page %p is not managed", m));
 	count = 0;
 	vm_page_lock_queues();
 	if ((tp = TAILQ_FIRST(&m->md.tte_list)) != NULL) {
 		tpf = tp;
 		do {
 			tpn = TAILQ_NEXT(tp, tte_link);
 			TAILQ_REMOVE(&m->md.tte_list, tp, tte_link);
 			TAILQ_INSERT_TAIL(&m->md.tte_list, tp, tte_link);
 			if ((tp->tte_data & TD_PV) == 0)
 				continue;
 			data = atomic_clear_long(&tp->tte_data, TD_REF);
 			if ((data & TD_REF) != 0 && ++count > 4)
 				break;
 		} while ((tp = tpn) != NULL && tp != tpf);
 	}
 	vm_page_unlock_queues();
 	return (count);
 }
 
 boolean_t
 pmap_is_modified(vm_page_t m)
 {
 	struct tte *tp;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_modified: page %p is not managed", m));
 	rv = FALSE;
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no TTEs can have TD_W set.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return (rv);
 	vm_page_lock_queues();
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link) {
 		if ((tp->tte_data & TD_PV) == 0)
 			continue;
 		if ((tp->tte_data & TD_W) != 0) {
 			rv = TRUE;
 			break;
 		}
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 /*
  *	pmap_is_prefaultable:
  *
  *	Return whether or not the specified virtual address is elgible
  *	for prefault.
  */
 boolean_t
 pmap_is_prefaultable(pmap_t pmap, vm_offset_t addr)
 {
 	boolean_t rv;
 
 	PMAP_LOCK(pmap);
 	rv = tsb_tte_lookup(pmap, addr) == NULL;
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 /*
  * Return whether or not the specified physical page was referenced
  * in any physical maps.
  */
 boolean_t
 pmap_is_referenced(vm_page_t m)
 {
 	struct tte *tp;
 	boolean_t rv;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_is_referenced: page %p is not managed", m));
 	rv = FALSE;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link) {
 		if ((tp->tte_data & TD_PV) == 0)
 			continue;
 		if ((tp->tte_data & TD_REF) != 0) {
 			rv = TRUE;
 			break;
 		}
 	}
 	vm_page_unlock_queues();
 	return (rv);
 }
 
 void
 pmap_clear_modify(vm_page_t m)
 {
 	struct tte *tp;
 	u_long data;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_modify: page %p is not managed", m));
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	KASSERT((m->oflags & VPO_BUSY) == 0,
 	    ("pmap_clear_modify: page %p is busy", m));
 
 	/*
 	 * If the page is not PG_WRITEABLE, then no TTEs can have TD_W set.
 	 * If the object containing the page is locked and the page is not
 	 * VPO_BUSY, then PG_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link) {
 		if ((tp->tte_data & TD_PV) == 0)
 			continue;
 		data = atomic_clear_long(&tp->tte_data, TD_W);
 		if ((data & TD_W) != 0)
 			tlb_page_demap(TTE_GET_PMAP(tp), TTE_GET_VA(tp));
 	}
 	vm_page_unlock_queues();
 }
 
 void
 pmap_clear_reference(vm_page_t m)
 {
 	struct tte *tp;
 	u_long data;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_clear_reference: page %p is not managed", m));
 	vm_page_lock_queues();
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link) {
 		if ((tp->tte_data & TD_PV) == 0)
 			continue;
 		data = atomic_clear_long(&tp->tte_data, TD_REF);
 		if ((data & TD_REF) != 0)
 			tlb_page_demap(TTE_GET_PMAP(tp), TTE_GET_VA(tp));
 	}
 	vm_page_unlock_queues();
 }
 
 void
 pmap_remove_write(vm_page_t m)
 {
 	struct tte *tp;
 	u_long data;
 
 	KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0,
 	    ("pmap_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not VPO_BUSY, then PG_WRITEABLE cannot be set by
 	 * another thread while the object is locked.  Thus, if PG_WRITEABLE
 	 * is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_LOCK_ASSERT(m->object, MA_OWNED);
 	if ((m->oflags & VPO_BUSY) == 0 &&
 	    (m->flags & PG_WRITEABLE) == 0)
 		return;
 	vm_page_lock_queues();
 	TAILQ_FOREACH(tp, &m->md.tte_list, tte_link) {
 		if ((tp->tte_data & TD_PV) == 0)
 			continue;
 		data = atomic_clear_long(&tp->tte_data, TD_SW | TD_W);
 		if ((data & TD_W) != 0) {
 			vm_page_dirty(m);
 			tlb_page_demap(TTE_GET_PMAP(tp), TTE_GET_VA(tp));
 		}
 	}
 	vm_page_flag_clear(m, PG_WRITEABLE);
 	vm_page_unlock_queues();
 }
 
 int
 pmap_mincore(pmap_t pm, vm_offset_t addr, vm_paddr_t *locked_pa)
 {
 
 	/* TODO; */
 	return (0);
 }
 
 /*
  * Activate a user pmap.  The pmap must be activated before its address space
  * can be accessed in any way.
  */
 void
 pmap_activate(struct thread *td)
 {
 	struct vmspace *vm;
 	struct pmap *pm;
 	int context;
 
 	vm = td->td_proc->p_vmspace;
 	pm = vmspace_pmap(vm);
 
 	mtx_lock_spin(&sched_lock);
 
 	context = PCPU_GET(tlb_ctx);
 	if (context == PCPU_GET(tlb_ctx_max)) {
 		tlb_flush_user();
 		context = PCPU_GET(tlb_ctx_min);
 	}
 	PCPU_SET(tlb_ctx, context + 1);
 
 	pm->pm_context[curcpu] = context;
-	pm->pm_active |= PCPU_GET(cpumask);
+	CPU_OR(&pm->pm_active, PCPU_PTR(cpumask));
 	PCPU_SET(pmap, pm);
 
 	stxa(AA_DMMU_TSB, ASI_DMMU, pm->pm_tsb);
 	stxa(AA_IMMU_TSB, ASI_IMMU, pm->pm_tsb);
 	stxa(AA_DMMU_PCXR, ASI_DMMU, (ldxa(AA_DMMU_PCXR, ASI_DMMU) &
 	    TLB_CXR_PGSZ_MASK) | context);
 	flush(KERNBASE);
 
 	mtx_unlock_spin(&sched_lock);
 }
 
 void
 pmap_sync_icache(pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 
 }
 
 /*
  * Increase the starting virtual address of the given mapping if a
  * different alignment might result in more superpage mappings.
  */
 void
 pmap_align_superpage(vm_object_t object, vm_ooffset_t offset,
     vm_offset_t *addr, vm_size_t size)
 {
 
 }
Index: head/sys/sparc64/sparc64/swtch.S
===================================================================
--- head/sys/sparc64/sparc64/swtch.S	(revision 222812)
+++ head/sys/sparc64/sparc64/swtch.S	(revision 222813)
@@ -1,288 +1,306 @@
 /*-
  * Copyright (c) 2001 Jake Burkholder.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <machine/asm.h>
 __FBSDID("$FreeBSD$");
 
 #include <machine/asmacros.h>
 #include <machine/asi.h>
 #include <machine/fsr.h>
 #include <machine/ktr.h>
 #include <machine/pcb.h>
 #include <machine/tstate.h>
 
 #include "assym.s"
 
 	.register	%g2, #ignore
 	.register	%g3, #ignore
 
 /*
  * void cpu_throw(struct thread *old, struct thread *new)
  */
 ENTRY(cpu_throw)
 	save	%sp, -CCFSZ, %sp
 	flushw
 	ba	%xcc, .Lsw1
 	 mov	%g0, %i2
 END(cpu_throw)
 
 /*
  * void cpu_switch(struct thread *old, struct thread *new, struct mtx *mtx)
  */
 ENTRY(cpu_switch)
 	save	%sp, -CCFSZ, %sp
 
 	/*
 	 * If the current thread was using floating point in the kernel, save
 	 * its context.  The userland floating point context has already been
 	 * saved in that case.
 	 */
 	rd	%fprs, %l2
 	andcc	%l2, FPRS_FEF, %g0
 	bz,a,pt	%xcc, 1f
 	 nop
 	call	savefpctx
 	 add	PCB_REG, PCB_KFP, %o0
 	ba,a	%xcc, 2f
 	 nop
 
 	/*
 	 * If the current thread was using floating point in userland, save
 	 * its context.
 	 */
 1:	sub	PCB_REG, TF_SIZEOF, %l2
 	ldx	[%l2 + TF_FPRS], %l3
 	andcc	%l3, FPRS_FEF, %g0
 	bz,a,pt	%xcc, 2f
 	 nop
 	call	savefpctx
 	 add	PCB_REG, PCB_UFP, %o0
 	andn	%l3, FPRS_FEF, %l3
 	stx	%l3, [%l2 + TF_FPRS]
 
 	ldx	[PCB_REG + PCB_FLAGS], %l3
 	or	%l3, PCB_FEF, %l3
 	stx	%l3, [PCB_REG + PCB_FLAGS]
 
 	/*
 	 * Flush the windows out to the stack and save the current frame
 	 * pointer and program counter.
 	 */
 2:	flushw
 	wrpr	%g0, 0, %cleanwin
 	stx	%fp, [PCB_REG + PCB_SP]
 	stx	%i7, [PCB_REG + PCB_PC]
 
 	/*
 	 * Load the new thread's frame pointer and program counter, and set
 	 * the current thread and pcb.
 	 */
 .Lsw1:
 #if KTR_COMPILE & KTR_PROC
 	CATR(KTR_PROC, "cpu_switch: new td=%p pc=%#lx fp=%#lx"
 	    , %g1, %g2, %g3, 8, 9, 10)
 	stx	%i1, [%g1 + KTR_PARM1]
 	ldx	[%i1 + TD_PCB], %g2
 	ldx	[%g2 + PCB_PC], %g3
 	stx	%g3, [%g1 + KTR_PARM2]
 	ldx	[%g2 + PCB_SP], %g3
 	stx	%g3, [%g1 + KTR_PARM3]
 10:
 #endif
 	ldx	[%i1 + TD_PCB], %l0
 
 	stx	%i1, [PCPU(CURTHREAD)]
 	stx	%l0, [PCPU(CURPCB)]
 
 	wrpr	%g0, PSTATE_NORMAL, %pstate
 	mov	%l0, PCB_REG
 	wrpr	%g0, PSTATE_ALT, %pstate
 	mov	%l0, PCB_REG
 	wrpr	%g0, PSTATE_KERNEL, %pstate
 
 	ldx	[PCB_REG + PCB_SP], %fp
 	ldx	[PCB_REG + PCB_PC], %i7
 	sub	%fp, CCFSZ, %sp
 
 	/*
 	 * Point to the pmaps of the new process, and of the last non-kernel
 	 * process to run.
 	 */
 	ldx	[%i1 + TD_PROC], %l1
 	ldx	[PCPU(PMAP)], %l2
 	ldx	[%l1 + P_VMSPACE], %i5
 	add	%i5, VM_PMAP, %l1
 
 #if KTR_COMPILE & KTR_PROC
 	CATR(KTR_PROC, "cpu_switch: new pmap=%p old pmap=%p"
 	    , %g1, %g2, %g3, 8, 9, 10)
 	stx	%l1, [%g1 + KTR_PARM1]
 	stx	%l2, [%g1 + KTR_PARM2]
 10:
 #endif
 
 	/*
 	 * If they are the same we are done.
 	 */
 	cmp	%l2, %l1
 	be,a,pn	%xcc, 7f
 	 nop
 
 	/*
 	 * If the new process is a kernel thread we can just leave the old
 	 * context active and avoid recycling its context number.
 	 */
 	SET(vmspace0, %i4, %i3)
 	cmp	%i5, %i3
 	be,a,pn	%xcc, 7f
 	 nop
 
 	/*
 	 * If there was no non-kernel pmap, don't try to deactivate it.
 	 */
 	brz,pn	%l2, 3f
-	 lduw	[PCPU(CPUMASK)], %l4
+	 lduw	[PCPU(CPUID)], %l3
 
 	/*
 	 * Mark the pmap of the last non-kernel vmspace to run as no longer
 	 * active on this CPU.
 	 */
-	lduw	[%l2 + PM_ACTIVE], %l3
-	andn	%l3, %l4, %l3
-	stw	%l3, [%l2 + PM_ACTIVE]
+	mov	_NCPUBITS, %l5
+	mov	%g0, %y
+	udiv	%l3, %l5, %l6
+	srl	%l6, 0, %l4
+	sllx	%l4, PTR_SHIFT, %l4
+	add	%l4, PM_ACTIVE, %l4
+	smul	%l6, %l5, %l5
+	sub	%l3, %l5, %l5
+	mov	1, %l6
+	sllx	%l6, %l5, %l5
+	ldx	[%l2 + %l4], %l6
+	andn	%l6, %l5, %l6
+	stx	%l6, [%l2 + %l4]
 
 	/*
 	 * Take away its context number.
 	 */
-	lduw	[PCPU(CPUID)], %l3
 	sllx	%l3, INT_SHIFT, %l3
 	add	%l2, PM_CONTEXT, %l4
 	mov	-1, %l5
 	stw	%l5, [%l3 + %l4]
 
 3:	cmp	%i2, %g0
 	be,pn	%xcc, 4f
 	 lduw	[PCPU(TLB_CTX_MAX)], %i4
 	stx	%i2, [%i0 + TD_LOCK]
 
 	/*
 	 * Find a new TLB context.  If we've run out we have to flush all
 	 * user mappings from the TLB and reset the context numbers.
 	 */
 4:	lduw	[PCPU(TLB_CTX)], %i3
 	cmp	%i3, %i4
 	bne,a,pt %xcc, 5f
 	 nop
 	SET(tlb_flush_user, %i5, %i4)
 	ldx	[%i4], %i5
 	call	%i5
 	 lduw	[PCPU(TLB_CTX_MIN)], %i3
 
 	/*
 	 * Advance next free context.
 	 */
 5:	add	%i3, 1, %i4
 	stw	%i4, [PCPU(TLB_CTX)]
 
 	/*
 	 * Set the new context number in the pmap.
 	 */
-	lduw	[PCPU(CPUID)], %i4
-	sllx	%i4, INT_SHIFT, %i4
+	lduw	[PCPU(CPUID)], %l3
+	sllx	%l3, INT_SHIFT, %i4
 	add	%l1, PM_CONTEXT, %i5
 	stw	%i3, [%i4 + %i5]
 
 	/*
 	 * Mark the pmap as active on this CPU.
 	 */
-	lduw	[%l1 + PM_ACTIVE], %i4
-	lduw	[PCPU(CPUMASK)], %i5
-	or	%i4, %i5, %i4
-	stw	%i4, [%l1 + PM_ACTIVE]
+	mov	_NCPUBITS, %l5
+	mov	%g0, %y
+	udiv	%l3, %l5, %l6
+	srl	%l6, 0, %l4
+	sllx	%l4, PTR_SHIFT, %l4
+	add	%l4, PM_ACTIVE, %l4
+	smul	%l6, %l5, %l5
+	sub	%l3, %l5, %l5
+	mov	1, %l6
+	sllx	%l6, %l5, %l5
+	ldx	[%l1 + %l4], %l6
+	or	%l6, %l5, %l6
+	stx	%l6, [%l1 + %l4]
 
 	/*
 	 * Make note of the change in pmap.
 	 */
 	stx	%l1, [PCPU(PMAP)]
 
 	/*
 	 * Fiddle the hardware bits.  Set the TSB registers and install the
 	 * new context number in the CPU.
 	 */
 	ldx	[%l1 + PM_TSB], %i4
 	mov	AA_DMMU_TSB, %i5
 	stxa	%i4, [%i5] ASI_DMMU
 	mov	AA_IMMU_TSB, %i5
 	stxa	%i4, [%i5] ASI_IMMU
 	setx	TLB_CXR_PGSZ_MASK, %i5, %i4
 	mov	AA_DMMU_PCXR, %i5
 	ldxa	[%i5] ASI_DMMU, %l1
 	and	%l1, %i4, %l1
 	or	%i3, %l1, %i3
 	sethi	%hi(KERNBASE), %i4
 	stxa	%i3, [%i5] ASI_DMMU
 	flush	%i4
 
 	/*
 	 * Done, return and load the new process's window from the stack.
 	 */
 
 6:	ret
 	 restore
 
 7:	cmp	%i2, %g0
 	be,a,pn	%xcc, 6b
 	 nop
 	stx	%i2, [%i0 + TD_LOCK]
 	ret
 	 restore
 END(cpu_switch)
 
 ENTRY(savectx)
 	save	%sp, -CCFSZ, %sp
 	flushw
 	call	savefpctx
 	 add	%i0, PCB_UFP, %o0
 	stx	%fp, [%i0 + PCB_SP]
 	stx	%i7, [%i0 + PCB_PC]
 	ret
 	 restore %g0, 0, %o0
 END(savectx)
 
 /*
  * void savefpctx(uint32_t *);
  */
 ENTRY(savefpctx)
 	wr	%g0, FPRS_FEF, %fprs
 	wr	%g0, ASI_BLK_S, %asi
 	stda	%f0, [%o0 + (0 * 64)] %asi
 	stda	%f16, [%o0 + (1 * 64)] %asi
 	stda	%f32, [%o0 + (2 * 64)] %asi
 	stda	%f48, [%o0 + (3 * 64)] %asi
 	membar	#Sync
 	retl
 	 wr	%g0, 0, %fprs
 END(savefpctx)
Index: head/sys/sparc64/sparc64/tlb.c
===================================================================
--- head/sys/sparc64/sparc64/tlb.c	(revision 222812)
+++ head/sys/sparc64/sparc64/tlb.c	(revision 222813)
@@ -1,147 +1,147 @@
 /*-
  * Copyright (c) 2001 Jake Burkholder.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_pmap.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/ktr.h>
 #include <sys/pcpu.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <machine/pmap.h>
 #include <machine/smp.h>
 #include <machine/tlb.h>
 #include <machine/vmparam.h>
 
 PMAP_STATS_VAR(tlb_ncontext_demap);
 PMAP_STATS_VAR(tlb_npage_demap);
 PMAP_STATS_VAR(tlb_nrange_demap);
 
 tlb_flush_nonlocked_t *tlb_flush_nonlocked;
 tlb_flush_user_t *tlb_flush_user;
 
 /*
  * Some tlb operations must be atomic, so no interrupt or trap can be allowed
  * while they are in progress. Traps should not happen, but interrupts need to
  * be explicitely disabled. critical_enter() cannot be used here, since it only
  * disables soft interrupts.
  */
 
 void
 tlb_context_demap(struct pmap *pm)
 {
 	void *cookie;
 	register_t s;
 
 	/*
 	 * It is important that we are not interrupted or preempted while
 	 * doing the IPIs. The interrupted CPU may hold locks, and since
 	 * it will wait for the CPU that sent the IPI, this can lead
 	 * to a deadlock when an interrupt comes in on that CPU and it's
 	 * handler tries to grab one of that locks. This will only happen for
 	 * spin locks, but these IPI types are delivered even if normal
 	 * interrupts are disabled, so the lock critical section will not
 	 * protect the target processor from entering the IPI handler with
 	 * the lock held.
 	 */
 	PMAP_STATS_INC(tlb_ncontext_demap);
 	cookie = ipi_tlb_context_demap(pm);
 	s = intr_disable();
-	if (pm->pm_active & PCPU_GET(cpumask)) {
+	if (CPU_OVERLAP(&pm->pm_active, PCPU_PTR(cpumask))) {
 		KASSERT(pm->pm_context[curcpu] != -1,
 		    ("tlb_context_demap: inactive pmap?"));
 		stxa(TLB_DEMAP_PRIMARY | TLB_DEMAP_CONTEXT, ASI_DMMU_DEMAP, 0);
 		stxa(TLB_DEMAP_PRIMARY | TLB_DEMAP_CONTEXT, ASI_IMMU_DEMAP, 0);
 		flush(KERNBASE);
 	}
 	intr_restore(s);
 	ipi_wait(cookie);
 }
 
 void
 tlb_page_demap(struct pmap *pm, vm_offset_t va)
 {
 	u_long flags;
 	void *cookie;
 	register_t s;
 
 	PMAP_STATS_INC(tlb_npage_demap);
 	cookie = ipi_tlb_page_demap(pm, va);
 	s = intr_disable();
-	if (pm->pm_active & PCPU_GET(cpumask)) {
+	if (CPU_OVERLAP(&pm->pm_active, PCPU_PTR(cpumask))) {
 		KASSERT(pm->pm_context[curcpu] != -1,
 		    ("tlb_page_demap: inactive pmap?"));
 		if (pm == kernel_pmap)
 			flags = TLB_DEMAP_NUCLEUS | TLB_DEMAP_PAGE;
 		else
 			flags = TLB_DEMAP_PRIMARY | TLB_DEMAP_PAGE;
 
 		stxa(TLB_DEMAP_VA(va) | flags, ASI_DMMU_DEMAP, 0);
 		stxa(TLB_DEMAP_VA(va) | flags, ASI_IMMU_DEMAP, 0);
 		flush(KERNBASE);
 	}
 	intr_restore(s);
 	ipi_wait(cookie);
 }
 
 void
 tlb_range_demap(struct pmap *pm, vm_offset_t start, vm_offset_t end)
 {
 	vm_offset_t va;
 	void *cookie;
 	u_long flags;
 	register_t s;
 
 	PMAP_STATS_INC(tlb_nrange_demap);
 	cookie = ipi_tlb_range_demap(pm, start, end);
 	s = intr_disable();
-	if (pm->pm_active & PCPU_GET(cpumask)) {
+	if (CPU_OVERLAP(&pm->pm_active, PCPU_PTR(cpumask))) {
 		KASSERT(pm->pm_context[curcpu] != -1,
 		    ("tlb_range_demap: inactive pmap?"));
 		if (pm == kernel_pmap)
 			flags = TLB_DEMAP_NUCLEUS | TLB_DEMAP_PAGE;
 		else
 			flags = TLB_DEMAP_PRIMARY | TLB_DEMAP_PAGE;
 
 		for (va = start; va < end; va += PAGE_SIZE) {
 			stxa(TLB_DEMAP_VA(va) | flags, ASI_DMMU_DEMAP, 0);
 			stxa(TLB_DEMAP_VA(va) | flags, ASI_IMMU_DEMAP, 0);
 			flush(KERNBASE);
 		}
 	}
 	intr_restore(s);
 	ipi_wait(cookie);
 }
Index: head/sys/sys/_cpuset.h
===================================================================
--- head/sys/sys/_cpuset.h	(nonexistent)
+++ head/sys/sys/_cpuset.h	(revision 222813)
@@ -0,0 +1,52 @@
+/*-
+ * Copyright (c) 2008,	Jeffrey Roberson <jeff@freebsd.org>
+ * All rights reserved.
+ *
+ * Copyright (c) 2008 Nokia Corporation
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#ifndef _SYS__CPUSET_H_
+#define	_SYS__CPUSET_H_
+
+#ifdef _KERNEL
+#define	CPU_SETSIZE	MAXCPU
+#endif
+
+#define	CPU_MAXSIZE	128
+
+#ifndef	CPU_SETSIZE
+#define	CPU_SETSIZE	CPU_MAXSIZE
+#endif
+
+#define	_NCPUBITS	(sizeof(long) * NBBY)	/* bits per mask */
+#define	_NCPUWORDS	howmany(CPU_SETSIZE, _NCPUBITS)
+
+typedef	struct _cpuset {
+	long	__bits[howmany(CPU_SETSIZE, _NCPUBITS)];
+} cpuset_t;
+
+#endif /* !_SYS__CPUSET_H_ */

Property changes on: head/sys/sys/_cpuset.h
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: head/sys/sys/_rmlock.h
===================================================================
--- head/sys/sys/_rmlock.h	(revision 222812)
+++ head/sys/sys/_rmlock.h	(revision 222813)
@@ -1,66 +1,66 @@
 /*-
  * Copyright (c) 2007 Stephan Uphoff <ups@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _SYS__RMLOCK_H_
 #define	_SYS__RMLOCK_H_
 
 /* 
  * XXXUPS remove as soon as we have per cpu variable
  * linker sets and  can define rm_queue in _rm_lock.h
 */
 #include <sys/pcpu.h>
 /*
  * Mostly reader/occasional writer lock.
  */
 
 LIST_HEAD(rmpriolist,rm_priotracker);
 
 struct rmlock {
 	struct lock_object lock_object; 
-	volatile cpumask_t rm_writecpus;
+	volatile cpuset_t rm_writecpus;
 	LIST_HEAD(,rm_priotracker) rm_activeReaders;
 	union {
 		struct mtx _rm_lock_mtx;
 		struct sx _rm_lock_sx;
 	} _rm_lock;
 };
 #define	rm_lock_mtx	_rm_lock._rm_lock_mtx
 #define	rm_lock_sx	_rm_lock._rm_lock_sx
 
 struct rm_priotracker {
 	struct rm_queue rmp_cpuQueue; /* Must be first */
 	struct rmlock *rmp_rmlock;
 	struct thread *rmp_thread;
 	int rmp_flags;
 	LIST_ENTRY(rm_priotracker) rmp_qentry;
 };
 
 #endif /* !_SYS__RMLOCK_H_ */
Index: head/sys/sys/_stdint.h
===================================================================
--- head/sys/sys/_stdint.h	(revision 222812)
+++ head/sys/sys/_stdint.h	(revision 222813)

Property changes on: head/sys/sys/_stdint.h
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: head/sys/sys/cpuset.h
===================================================================
--- head/sys/sys/cpuset.h	(revision 222812)
+++ head/sys/sys/cpuset.h	(revision 222813)
@@ -1,197 +1,228 @@
 /*-
  * Copyright (c) 2008,	Jeffrey Roberson <jeff@freebsd.org>
  * All rights reserved.
  *
  * Copyright (c) 2008 Nokia Corporation
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _SYS_CPUSET_H_
 #define	_SYS_CPUSET_H_
 
-#ifdef _KERNEL
-#define	CPU_SETSIZE	MAXCPU
-#endif
+#include <sys/_cpuset.h>
 
-#define	CPU_MAXSIZE	128
+#define	CPUSETBUFSIZ	((2 + sizeof(long) * 2) * _NCPUWORDS)
 
-#ifndef	CPU_SETSIZE
-#define	CPU_SETSIZE	CPU_MAXSIZE
-#endif
-
-#define	_NCPUBITS	(sizeof(long) * NBBY)	/* bits per mask */
-#define	_NCPUWORDS	howmany(CPU_SETSIZE, _NCPUBITS)
-
-typedef	struct _cpuset {
-	long	__bits[howmany(CPU_SETSIZE, _NCPUBITS)];
-} cpuset_t;
-
 #define	__cpuset_mask(n)	((long)1 << ((n) % _NCPUBITS))
 #define	CPU_CLR(n, p)	((p)->__bits[(n)/_NCPUBITS] &= ~__cpuset_mask(n))
 #define	CPU_COPY(f, t)	(void)(*(t) = *(f))
 #define	CPU_ISSET(n, p)	(((p)->__bits[(n)/_NCPUBITS] & __cpuset_mask(n)) != 0)
 #define	CPU_SET(n, p)	((p)->__bits[(n)/_NCPUBITS] |= __cpuset_mask(n))
 #define	CPU_ZERO(p) do {				\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		(p)->__bits[__i] = 0;			\
 } while (0)
 
 #define	CPU_FILL(p) do {				\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		(p)->__bits[__i] = -1;			\
 } while (0)
 
+#define	CPU_SETOF(n, p) do {					\
+	CPU_ZERO(p);						\
+	((p)->__bits[(n)/_NCPUBITS] = __cpuset_mask(n));	\
+} while (0)
+
 /* Is p empty. */
 #define	CPU_EMPTY(p) __extension__ ({			\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		if ((p)->__bits[__i])			\
 			break;				\
 	__i == _NCPUWORDS;				\
 })
 
+/* Is p full set. */
+#define	CPU_ISFULLSET(p) __extension__ ({		\
+	__size_t __i;					\
+	for (__i = 0; __i < _NCPUWORDS; __i++)		\
+		if ((p)->__bits[__i] != (long)-1)	\
+			break;				\
+	__i == _NCPUWORDS;				\
+})
+
 /* Is c a subset of p. */
 #define	CPU_SUBSET(p, c) __extension__ ({		\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		if (((c)->__bits[__i] &			\
 		    (p)->__bits[__i]) !=		\
 		    (c)->__bits[__i])			\
 			break;				\
 	__i == _NCPUWORDS;				\
 })
 
 /* Are there any common bits between b & c? */
 #define	CPU_OVERLAP(p, c) __extension__ ({		\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		if (((c)->__bits[__i] &			\
 		    (p)->__bits[__i]) != 0)		\
 			break;				\
 	__i != _NCPUWORDS;				\
 })
 
 /* Compare two sets, returns 0 if equal 1 otherwise. */
 #define	CPU_CMP(p, c) __extension__ ({			\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		if (((c)->__bits[__i] !=		\
 		    (p)->__bits[__i]))			\
 			break;				\
 	__i != _NCPUWORDS;				\
 })
 
 #define	CPU_OR(d, s) do {				\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		(d)->__bits[__i] |= (s)->__bits[__i];	\
 } while (0)
 
 #define	CPU_AND(d, s) do {				\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		(d)->__bits[__i] &= (s)->__bits[__i];	\
 } while (0)
 
 #define	CPU_NAND(d, s) do {				\
 	__size_t __i;					\
 	for (__i = 0; __i < _NCPUWORDS; __i++)		\
 		(d)->__bits[__i] &= ~(s)->__bits[__i];	\
 } while (0)
 
+#define	CPU_CLR_ATOMIC(n, p)						\
+	atomic_clear_long(&(p)->__bits[(n)/_NCPUBITS], __cpuset_mask(n))
+
+#define	CPU_SET_ATOMIC(n, p)						\
+	atomic_set_long(&(p)->__bits[(n)/_NCPUBITS], __cpuset_mask(n))
+
+#define	CPU_OR_ATOMIC(d, s) do {			\
+	__size_t __i;					\
+	for (__i = 0; __i < _NCPUWORDS; __i++)		\
+		atomic_set_long(&(d)->__bits[__i],	\
+		    (s)->__bits[__i]);			\
+} while (0)
+
+#define	CPU_NAND_ATOMIC(d, s) do {			\
+	__size_t __i;					\
+	for (__i = 0; __i < _NCPUWORDS; __i++)		\
+		atomic_clear_long(&(d)->__bits[__i],	\
+		    (s)->__bits[__i]);			\
+} while (0)
+
+#define	CPU_COPY_STORE_REL(f, t) do {				\
+	__size_t __i;						\
+	for (__i = 0; __i < _NCPUWORDS; __i++)			\
+		atomic_store_rel_long(&(t)->__bits[__i],	\
+		    (f)->__bits[__i]);				\
+} while (0)
+
 /*
  * Valid cpulevel_t values.
  */
 #define	CPU_LEVEL_ROOT		1	/* All system cpus. */
 #define	CPU_LEVEL_CPUSET	2	/* Available cpus for which. */
 #define	CPU_LEVEL_WHICH		3	/* Actual mask/id for which. */
 
 /*
  * Valid cpuwhich_t values.
  */
 #define	CPU_WHICH_TID		1	/* Specifies a thread id. */
 #define	CPU_WHICH_PID		2	/* Specifies a process id. */
 #define	CPU_WHICH_CPUSET	3	/* Specifies a set id. */
 #define	CPU_WHICH_IRQ		4	/* Specifies an irq #. */
 #define	CPU_WHICH_JAIL		5	/* Specifies a jail id. */
 
 /*
  * Reserved cpuset identifiers.
  */
 #define	CPUSET_INVALID	-1
 #define	CPUSET_DEFAULT	0
 
 #ifdef _KERNEL
 LIST_HEAD(setlist, cpuset);
 
 /*
  * cpusets encapsulate cpu binding information for one or more threads.
  *
  * 	a - Accessed with atomics.
  *	s - Set at creation, never modified.  Only a ref required to read.
  *	c - Locked internally by a cpuset lock.
  *
  * The bitmask is only modified while holding the cpuset lock.  It may be
  * read while only a reference is held but the consumer must be prepared
  * to deal with inconsistent results.
  */
 struct cpuset {
 	cpuset_t		cs_mask;	/* bitmask of valid cpus. */
 	volatile u_int		cs_ref;		/* (a) Reference count. */
 	int			cs_flags;	/* (s) Flags from below. */
 	cpusetid_t		cs_id;		/* (s) Id or INVALID. */
 	struct cpuset		*cs_parent;	/* (s) Pointer to our parent. */
 	LIST_ENTRY(cpuset)	cs_link;	/* (c) All identified sets. */
 	LIST_ENTRY(cpuset)	cs_siblings;	/* (c) Sibling set link. */
 	struct setlist		cs_children;	/* (c) List of children. */
 };
 
 #define CPU_SET_ROOT    0x0001  /* Set is a root set. */
 #define CPU_SET_RDONLY  0x0002  /* No modification allowed. */
 
 extern cpuset_t *cpuset_root;
 struct prison;
 struct proc;
 
 struct cpuset *cpuset_thread0(void);
 struct cpuset *cpuset_ref(struct cpuset *);
 void	cpuset_rel(struct cpuset *);
 int	cpuset_setthread(lwpid_t id, cpuset_t *);
 int	cpuset_create_root(struct prison *, struct cpuset **);
 int	cpuset_setproc_update_set(struct proc *, struct cpuset *);
+int	cpusetobj_ffs(const cpuset_t *);
+char	*cpusetobj_strprint(char *, const cpuset_t *);
+int	cpusetobj_strscan(cpuset_t *, const char *);
 
 #else
 __BEGIN_DECLS
 int	cpuset(cpusetid_t *);
 int	cpuset_setid(cpuwhich_t, id_t, cpusetid_t);
 int	cpuset_getid(cpulevel_t, cpuwhich_t, id_t, cpusetid_t *);
 int	cpuset_getaffinity(cpulevel_t, cpuwhich_t, id_t, size_t, cpuset_t *);
 int	cpuset_setaffinity(cpulevel_t, cpuwhich_t, id_t, size_t, const cpuset_t *);
 __END_DECLS
 #endif
 #endif /* !_SYS_CPUSET_H_ */
Index: head/sys/sys/ktr.h
===================================================================
--- head/sys/sys/ktr.h	(revision 222812)
+++ head/sys/sys/ktr.h	(revision 222813)
@@ -1,269 +1,272 @@
 /*-
  * Copyright (c) 1996 Berkeley Software Design, Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Berkeley Software Design Inc's name may not be used to endorse or
  *    promote products derived from this software without specific prior
  *    written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY BERKELEY SOFTWARE DESIGN INC ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL BERKELEY SOFTWARE DESIGN INC BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from BSDI $Id: ktr.h,v 1.10.2.7 2000/03/16 21:44:42 cp Exp $
  * $FreeBSD$
  */
 
 /*
  *	Wraparound kernel trace buffer support.
  */
 
 #ifndef _SYS_KTR_H_
 #define _SYS_KTR_H_
 
 /*
  * Trace classes
  *
  * Two of the trace classes (KTR_DEV and KTR_SUBSYS) are special in that
  * they are really placeholders so that indvidual drivers and subsystems
  * can map their internal tracing to the general class when they wish to
  * have tracing enabled and map it to 0 when they don't.
  */
 #define	KTR_GEN		0x00000001		/* General (TR) */
 #define	KTR_NET		0x00000002		/* Network */
 #define	KTR_DEV		0x00000004		/* Device driver */
 #define	KTR_LOCK	0x00000008		/* MP locking */
 #define	KTR_SMP		0x00000010		/* MP general */
 #define	KTR_SUBSYS	0x00000020		/* Subsystem. */
 #define	KTR_PMAP	0x00000040		/* Pmap tracing */
 #define	KTR_MALLOC	0x00000080		/* Malloc tracing */
 #define	KTR_TRAP	0x00000100		/* Trap processing */
 #define	KTR_INTR	0x00000200		/* Interrupt tracing */
 #define	KTR_SIG		0x00000400		/* Signal processing */
 #define	KTR_SPARE2	0x00000800		/* XXX Used by cxgb */
 #define	KTR_PROC	0x00001000		/* Process scheduling */
 #define	KTR_SYSC	0x00002000		/* System call */
 #define	KTR_INIT	0x00004000		/* System initialization */
 #define	KTR_SPARE3	0x00008000		/* XXX Used by cxgb */
 #define	KTR_SPARE4	0x00010000		/* XXX Used by cxgb */
 #define	KTR_EVH		0x00020000		/* Eventhandler */
 #define	KTR_VFS		0x00040000		/* VFS events */
 #define	KTR_VOP		0x00080000		/* Auto-generated vop events */
 #define	KTR_VM		0x00100000		/* The virtual memory system */
 #define	KTR_INET	0x00200000		/* IPv4 stack */
 #define	KTR_RUNQ	0x00400000		/* Run queue */
 #define	KTR_CONTENTION	0x00800000		/* Lock contention */
 #define	KTR_UMA		0x01000000		/* UMA slab allocator */
 #define	KTR_CALLOUT	0x02000000		/* Callouts and timeouts */
 #define	KTR_GEOM	0x04000000		/* GEOM I/O events */
 #define	KTR_BUSDMA	0x08000000		/* busdma(9) events */
 #define	KTR_INET6	0x10000000		/* IPv6 stack */
 #define	KTR_SCHED	0x20000000		/* Machine parsed sched info. */
 #define	KTR_BUF		0x40000000		/* Buffer cache */
 #define	KTR_ALL		0x7fffffff
 
 /* Trace classes to compile in */
 #ifdef KTR
 #ifndef KTR_COMPILE
 #define	KTR_COMPILE	(KTR_ALL)
 #endif
 #else	/* !KTR */
 #undef KTR_COMPILE
 #define KTR_COMPILE 0
 #endif	/* KTR */
 
 /*
  * Version number for ktr_entry struct.  Increment this when you break binary
  * compatibility.
  */
 #define	KTR_VERSION	2
 
 #define	KTR_PARMS	6
 
 #ifndef LOCORE
 
+#include <sys/param.h>
+#include <sys/_cpuset.h>
+
 struct ktr_entry {
 	u_int64_t ktr_timestamp;
 	int	ktr_cpu;
 	int	ktr_line;
 	const	char *ktr_file;
 	const	char *ktr_desc;
 	struct	thread *ktr_thread;
 	u_long	ktr_parms[KTR_PARMS];
 };
 
-extern int ktr_cpumask;
+extern cpuset_t ktr_cpumask;
 extern int ktr_mask;
 extern int ktr_entries;
 extern int ktr_verbose;
 
 extern volatile int ktr_idx;
 extern struct ktr_entry ktr_buf[];
 
 #ifdef KTR
 
 void	ktr_tracepoint(u_int mask, const char *file, int line,
 	    const char *format, u_long arg1, u_long arg2, u_long arg3,
 	    u_long arg4, u_long arg5, u_long arg6);
 
 #define CTR6(m, format, p1, p2, p3, p4, p5, p6) do {			\
 	if (KTR_COMPILE & (m))						\
 		ktr_tracepoint((m), __FILE__, __LINE__, format,		\
 		    (u_long)(p1), (u_long)(p2), (u_long)(p3),		\
 		    (u_long)(p4), (u_long)(p5), (u_long)(p6));		\
 	} while(0)
 #define CTR0(m, format)			CTR6(m, format, 0, 0, 0, 0, 0, 0)
 #define CTR1(m, format, p1)		CTR6(m, format, p1, 0, 0, 0, 0, 0)
 #define	CTR2(m, format, p1, p2)		CTR6(m, format, p1, p2, 0, 0, 0, 0)
 #define	CTR3(m, format, p1, p2, p3)	CTR6(m, format, p1, p2, p3, 0, 0, 0)
 #define	CTR4(m, format, p1, p2, p3, p4)	CTR6(m, format, p1, p2, p3, p4, 0, 0)
 #define	CTR5(m, format, p1, p2, p3, p4, p5)	CTR6(m, format, p1, p2, p3, p4, p5, 0)
 #else	/* KTR */
 #define	CTR0(m, d)			(void)0
 #define	CTR1(m, d, p1)			(void)0
 #define	CTR2(m, d, p1, p2)		(void)0
 #define	CTR3(m, d, p1, p2, p3)		(void)0
 #define	CTR4(m, d, p1, p2, p3, p4)	(void)0
 #define	CTR5(m, d, p1, p2, p3, p4, p5)	(void)0
 #define	CTR6(m, d, p1, p2, p3, p4, p5, p6)	(void)0
 #endif	/* KTR */
 
 #define	TR0(d)				CTR0(KTR_GEN, d)
 #define	TR1(d, p1)			CTR1(KTR_GEN, d, p1)
 #define	TR2(d, p1, p2)			CTR2(KTR_GEN, d, p1, p2)
 #define	TR3(d, p1, p2, p3)		CTR3(KTR_GEN, d, p1, p2, p3)
 #define	TR4(d, p1, p2, p3, p4)		CTR4(KTR_GEN, d, p1, p2, p3, p4)
 #define	TR5(d, p1, p2, p3, p4, p5)	CTR5(KTR_GEN, d, p1, p2, p3, p4, p5)
 #define	TR6(d, p1, p2, p3, p4, p5, p6)	CTR6(KTR_GEN, d, p1, p2, p3, p4, p5, p6)
 
 /*
  * The event macros implement KTR graphic plotting facilities provided
  * by src/tools/sched/schedgraph.py.  Three generic types of events are
  * supported: states, counters, and points.
  *
  * m is the ktr class for ktr_mask.
  * ident is the string identifier that owns the event (ie: "thread 10001")
  * etype is the type of event to plot (state, counter, point)
  * edat is the event specific data (state name, counter value, point name)
  * up to four attributes may be supplied as a name, value pair of arguments.
  *
  * etype and attribute names must be string constants.  This minimizes the
  * number of ktr slots required by construction the final format strings
  * at compile time.  Both must also include a colon and format specifier
  * (ie. "prio:%d", prio).  It is recommended that string arguments be
  * contained within escaped quotes if they may contain ',' or ':' characters.
  *
  * The special attribute (KTR_ATTR_LINKED, ident) creates a reference to another
  * id on the graph for easy traversal of related graph elements.
  */
 
 #define	KTR_ATTR_LINKED	"linkedto:\"%s\""
 #define	KTR_EFMT(egroup, ident, etype)					\
 	    "KTRGRAPH group:\"" egroup "\", id:\"%s\", " etype ", attributes: "
 
 #define	KTR_EVENT0(m, egroup, ident, etype, edat)			\
 	CTR2(m,	KTR_EFMT(egroup, ident, etype) "none", ident, edat)
 #define	KTR_EVENT1(m, egroup, ident, etype, edat, a0, v0)		\
 	CTR3(m, KTR_EFMT(egroup, ident, etype) a0, ident, edat, (v0))
 #define	KTR_EVENT2(m, egroup, ident, etype, edat, a0, v0, a1, v1)	\
 	CTR4(m, KTR_EFMT(egroup, ident, etype) a0 ", " a1,		\
 	    ident, edat, (v0), (v1))
 #define	KTR_EVENT3(m, egroup, ident, etype, edat, a0, v0, a1, v1, a2, v2)\
 	CTR5(m,KTR_EFMT(egroup, ident, etype) a0 ", " a1 ", " a2,	\
 	    ident, edat, (v0), (v1), (v2))
 #define	KTR_EVENT4(m, egroup, ident, etype, edat,			\
 	    a0, v0, a1, v1, a2, v2, a3, v3)				\
 	CTR6(m,KTR_EFMT(egroup, ident, etype) a0 ", " a1 ", " a2 ", " a3,\
 	     ident, edat, (v0), (v1), (v2), (v3))
 
 /*
  * State functions graph state changes on an ident.
  */
 #define KTR_STATE0(m, egroup, ident, state)				\
 	KTR_EVENT0(m, egroup, ident, "state:\"%s\"", state)
 #define KTR_STATE1(m, egroup, ident, state, a0, v0)			\
 	KTR_EVENT1(m, egroup, ident, "state:\"%s\"", state, a0, (v0))
 #define KTR_STATE2(m, egroup, ident, state, a0, v0, a1, v1)		\
 	KTR_EVENT2(m, egroup, ident, "state:\"%s\"", state, a0, (v0), a1, (v1))
 #define KTR_STATE3(m, egroup, ident, state, a0, v0, a1, v1, a2, v2)	\
 	KTR_EVENT3(m, egroup, ident, "state:\"%s\"",			\
 	    state, a0, (v0), a1, (v1), a2, (v2))
 #define KTR_STATE4(m, egroup, ident, state, a0, v0, a1, v1, a2, v2, a3, v3)\
 	KTR_EVENT4(m, egroup, ident, "state:\"%s\"",			\
 	    state, a0, (v0), a1, (v1), a2, (v2), a3, (v3))
 
 /*
  * Counter functions graph counter values.  The counter id
  * must not be intermixed with a state id. 
  */
 #define	KTR_COUNTER0(m, egroup, ident, counter)				\
 	KTR_EVENT0(m, egroup, ident, "counter:%d", counter)
 #define	KTR_COUNTER1(m, egroup, ident, edat, a0, v0)			\
 	KTR_EVENT1(m, egroup, ident, "counter:%d", counter, a0, (v0))
 #define	KTR_COUNTER2(m, egroup, ident, counter, a0, v0, a1, v1)		\
 	KTR_EVENT2(m, egroup, ident, "counter:%d", counter, a0, (v0), a1, (v1))
 #define	KTR_COUNTER3(m, egroup, ident, counter, a0, v0, a1, v1, a2, v2)	\
 	KTR_EVENT3(m, egroup, ident, "counter:%d",			\
 	    counter, a0, (v0), a1, (v1), a2, (v2))
 #define	KTR_COUNTER4(m, egroup, ident, counter, a0, v0, a1, v1, a2, v2, a3, v3)\
 	KTR_EVENT4(m, egroup, ident, "counter:%d",			\
 	    counter, a0, (v0), a1, (v1), a2, (v2), a3, (v3))
 
 /*
  * Point functions plot points of interest on counter or state graphs.
  */
 #define	KTR_POINT0(m, egroup, ident, point)				\
 	KTR_EVENT0(m, egroup, ident, "point:\"%s\"", point)
 #define	KTR_POINT1(m, egroup, ident, point, a0, v0)			\
 	KTR_EVENT1(m, egroup, ident, "point:\"%s\"", point, a0, (v0))
 #define	KTR_POINT2(m, egroup, ident, point, a0, v0, a1, v1)		\
 	KTR_EVENT2(m, egroup, ident, "point:\"%s\"", point, a0, (v0), a1, (v1))
 #define	KTR_POINT3(m, egroup, ident, point, a0, v0, a1, v1, a2, v2)	\
 	KTR_EVENT3(m, egroup, ident, "point:\"%s\"", point,		\
 	    a0, (v0), a1, (v1), a2, (v2))
 #define	KTR_POINT4(m, egroup, ident, point, a0, v0, a1, v1, a2, v2, a3, v3)\
 	KTR_EVENT4(m, egroup, ident, "point:\"%s\"",			\
 	    point, a0, (v0), a1, (v1), a2, (v2), a3, (v3))
 
 /*
  * Trace initialization events, similar to CTR with KTR_INIT, but
  * completely ifdef'ed out if KTR_INIT isn't in KTR_COMPILE (to
  * save string space, the compiler doesn't optimize out strings
  * for the conditional ones above).
  */
 #if (KTR_COMPILE & KTR_INIT) != 0
 #define	ITR0(d)				CTR0(KTR_INIT, d)
 #define	ITR1(d, p1)			CTR1(KTR_INIT, d, p1)
 #define	ITR2(d, p1, p2)			CTR2(KTR_INIT, d, p1, p2)
 #define	ITR3(d, p1, p2, p3)		CTR3(KTR_INIT, d, p1, p2, p3)
 #define	ITR4(d, p1, p2, p3, p4)		CTR4(KTR_INIT, d, p1, p2, p3, p4)
 #define	ITR5(d, p1, p2, p3, p4, p5)	CTR5(KTR_INIT, d, p1, p2, p3, p4, p5)
 #define	ITR6(d, p1, p2, p3, p4, p5, p6)	CTR6(KTR_INIT, d, p1, p2, p3, p4, p5, p6)
 #else
 #define	ITR0(d)
 #define	ITR1(d, p1)
 #define	ITR2(d, p1, p2)
 #define	ITR3(d, p1, p2, p3)
 #define	ITR4(d, p1, p2, p3, p4)
 #define	ITR5(d, p1, p2, p3, p4, p5)
 #define	ITR6(d, p1, p2, p3, p4, p5, p6)
 #endif
 
 #endif /* !LOCORE */
 
 #endif /* !_SYS_KTR_H_ */
Index: head/sys/sys/pcpu.h
===================================================================
--- head/sys/sys/pcpu.h	(revision 222812)
+++ head/sys/sys/pcpu.h	(revision 222813)
@@ -1,235 +1,246 @@
 /*-
  * Copyright (c) 2001 Wind River Systems, Inc.
  * All rights reserved.
  * Written by: John Baldwin <jhb@FreeBSD.org>
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _SYS_PCPU_H_
 #define	_SYS_PCPU_H_
 
 #ifdef LOCORE
 #error "no assembler-serviceable parts inside"
 #endif
 
+#include <sys/_cpuset.h>
 #include <sys/queue.h>
 #include <sys/vmmeter.h>
 #include <sys/resource.h>
 #include <machine/pcpu.h>
 
 #define	DPCPU_SETNAME		"set_pcpu"
 #define	DPCPU_SYMPREFIX		"pcpu_entry_"
 
 #ifdef _KERNEL
 
 /*
  * Define a set for pcpu data.
  */
 extern uintptr_t *__start_set_pcpu;
 __GLOBL(__start_set_pcpu);
 extern uintptr_t *__stop_set_pcpu;
 __GLOBL(__stop_set_pcpu);
 
 /*
  * Array of dynamic pcpu base offsets.  Indexed by id.
  */
 extern uintptr_t dpcpu_off[];
 
 /*
  * Convenience defines.
  */
 #define	DPCPU_START		((uintptr_t)&__start_set_pcpu)
 #define	DPCPU_STOP		((uintptr_t)&__stop_set_pcpu)
 #define	DPCPU_BYTES		(DPCPU_STOP - DPCPU_START)
 #define	DPCPU_MODMIN		2048
 #define	DPCPU_SIZE		roundup2(DPCPU_BYTES, PAGE_SIZE)
 #define	DPCPU_MODSIZE		(DPCPU_SIZE - (DPCPU_BYTES - DPCPU_MODMIN))
 
 /*
  * Declaration and definition.
  */
 #define	DPCPU_NAME(n)		pcpu_entry_##n
 #define	DPCPU_DECLARE(t, n)	extern t DPCPU_NAME(n)
 #define	DPCPU_DEFINE(t, n)	t DPCPU_NAME(n) __section(DPCPU_SETNAME) __used
 
 /*
  * Accessors with a given base.
  */
 #define	_DPCPU_PTR(b, n)						\
     (__typeof(DPCPU_NAME(n))*)((b) + (uintptr_t)&DPCPU_NAME(n))
 #define	_DPCPU_GET(b, n)	(*_DPCPU_PTR(b, n))
 #define	_DPCPU_SET(b, n, v)	(*_DPCPU_PTR(b, n) = v)
 
 /*
  * Accessors for the current cpu.
  */
 #define	DPCPU_PTR(n)		_DPCPU_PTR(PCPU_GET(dynamic), n)
 #define	DPCPU_GET(n)		(*DPCPU_PTR(n))
 #define	DPCPU_SET(n, v)		(*DPCPU_PTR(n) = v)
 
 /*
  * Accessors for remote cpus.
  */
 #define	DPCPU_ID_PTR(i, n)	_DPCPU_PTR(dpcpu_off[(i)], n)
 #define	DPCPU_ID_GET(i, n)	(*DPCPU_ID_PTR(i, n))
 #define	DPCPU_ID_SET(i, n, v)	(*DPCPU_ID_PTR(i, n) = v)
 
 /*
  * Utility macros.
  */
 #define	DPCPU_SUM(n) __extension__					\
 ({									\
 	u_int _i;							\
 	__typeof(*DPCPU_PTR(n)) sum;					\
 									\
 	sum = 0;							\
 	CPU_FOREACH(_i) {						\
 		sum += *DPCPU_ID_PTR(_i, n);				\
 	}								\
 	sum;								\
 })
 
 #define	DPCPU_VARSUM(n, var) __extension__				\
 ({									\
 	u_int _i;							\
 	__typeof((DPCPU_PTR(n))->var) sum;				\
 									\
 	sum = 0;							\
 	CPU_FOREACH(_i) {						\
 		sum += (DPCPU_ID_PTR(_i, n))->var;			\
 	}								\
 	sum;								\
 })
 
 #define	DPCPU_ZERO(n) do {						\
 	u_int _i;							\
 									\
 	CPU_FOREACH(_i) {						\
 		bzero(DPCPU_ID_PTR(_i, n), sizeof(*DPCPU_PTR(n)));	\
 	}								\
 } while(0)
 
 #endif /* _KERNEL */
 
 /* 
  * XXXUPS remove as soon as we have per cpu variable
  * linker sets and can define rm_queue in _rm_lock.h
  */
 struct rm_queue {
 	struct rm_queue* volatile rmq_next;
 	struct rm_queue* volatile rmq_prev;
 };
 
 #define	PCPU_NAME_LEN (sizeof("CPU ") + sizeof(__XSTRING(MAXCPU) + 1))
 
 /*
  * This structure maps out the global data that needs to be kept on a
  * per-cpu basis.  The members are accessed via the PCPU_GET/SET/PTR
  * macros defined in <machine/pcpu.h>.  Machine dependent fields are
  * defined in the PCPU_MD_FIELDS macro defined in <machine/pcpu.h>.
  */
 struct pcpu {
 	struct thread	*pc_curthread;		/* Current thread */
 	struct thread	*pc_idlethread;		/* Idle thread */
 	struct thread	*pc_fpcurthread;	/* Fp state owner */
 	struct thread	*pc_deadthread;		/* Zombie thread or NULL */
 	struct pcb	*pc_curpcb;		/* Current pcb */
 	uint64_t	pc_switchtime;		/* cpu_ticks() at last csw */
 	int		pc_switchticks;		/* `ticks' at last csw */
 	u_int		pc_cpuid;		/* This cpu number */
-	cpumask_t	pc_cpumask;		/* This cpu mask */
-	cpumask_t	pc_other_cpus;		/* Mask of all other cpus */
 	STAILQ_ENTRY(pcpu) pc_allcpu;
 	struct lock_list_entry *pc_spinlocks;
 #ifdef KTR
 	char		pc_name[PCPU_NAME_LEN];	/* String name for KTR */
 #endif
 	struct vmmeter	pc_cnt;			/* VM stats counters */
 	long		pc_cp_time[CPUSTATES];	/* statclock ticks */
 	struct device	*pc_device;
 	void		*pc_netisr;		/* netisr SWI cookie */
 	int		pc_dnweight;		/* vm_page_dontneed() */
 	int		pc_domain;		/* Memory domain. */
 
 	/*
 	 * Stuff for read mostly lock
 	 *
 	 * XXXUPS remove as soon as we have per cpu variable
 	 * linker sets.
 	 */
 	struct rm_queue	pc_rm_queue;
 
 	uintptr_t	pc_dynamic;		/* Dynamic per-cpu data area */
 
 	/*
 	 * Keep MD fields last, so that CPU-specific variations on a
 	 * single architecture don't result in offset variations of
 	 * the machine-independent fields of the pcpu.  Even though
 	 * the pcpu structure is private to the kernel, some ports
 	 * (e.g., lsof, part of gtop) define _KERNEL and include this
 	 * header.  While strictly speaking this is wrong, there's no
 	 * reason not to keep the offsets of the MI fields constant
 	 * if only to make kernel debugging easier.
 	 */
 	PCPU_MD_FIELDS;
+
+	/*
+	 * XXX
+	 * For the time being, keep the cpuset_t objects as the very last
+	 * members of the structure.
+	 * They are actually tagged to be removed soon, but as long as this
+	 * does not happen, it is necessary to find a way to implement
+	 * easilly interfaces to userland and leaving them last makes that
+	 * possible.
+	 */
+	cpuset_t	pc_cpumask;		/* This cpu mask */
+	cpuset_t	pc_other_cpus;		/* Mask of all other cpus */
 } __aligned(CACHE_LINE_SIZE);
 
 #ifdef _KERNEL
 
 STAILQ_HEAD(cpuhead, pcpu);
 
 extern struct cpuhead cpuhead;
 extern struct pcpu *cpuid_to_pcpu[MAXCPU];
 
 #define	curcpu		PCPU_GET(cpuid)
 #define	curproc		(curthread->td_proc)
 #ifndef curthread
 #define	curthread	PCPU_GET(curthread)
 #endif
 #define	curvidata	PCPU_GET(vidata)
 
 /*
  * Machine dependent callouts.  cpu_pcpu_init() is responsible for
  * initializing machine dependent fields of struct pcpu, and
  * db_show_mdpcpu() is responsible for handling machine dependent
  * fields for the DDB 'show pcpu' command.
  */
 void	cpu_pcpu_init(struct pcpu *pcpu, int cpuid, size_t size);
 void	db_show_mdpcpu(struct pcpu *pcpu);
 
 void	*dpcpu_alloc(int size);
 void	dpcpu_copy(void *s, int size);
 void	dpcpu_free(void *s, int size);
 void	dpcpu_init(void *dpcpu, int cpuid);
 void	pcpu_destroy(struct pcpu *pcpu);
 struct	pcpu *pcpu_find(u_int cpuid);
 void	pcpu_init(struct pcpu *pcpu, int cpuid, size_t size);
 
 #endif /* _KERNEL */
 
 #endif /* !_SYS_PCPU_H_ */
Index: head/sys/sys/pmckern.h
===================================================================
--- head/sys/sys/pmckern.h	(revision 222812)
+++ head/sys/sys/pmckern.h	(revision 222813)
@@ -1,140 +1,140 @@
 /*-
  * Copyright (c) 2003-2007, Joseph Koshy
  * Copyright (c) 2007 The FreeBSD Foundation
  * All rights reserved.
  *
  * Portions of this software were developed by A. Joseph Koshy under
  * sponsorship from the FreeBSD Foundation and Google, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /*
  * PMC interface used by the base kernel.
  */
 
 #ifndef _SYS_PMCKERN_H_
 #define _SYS_PMCKERN_H_
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/proc.h>
 #include <sys/sx.h>
 
 #define	PMC_FN_PROCESS_EXEC		1
 #define	PMC_FN_CSW_IN			2
 #define	PMC_FN_CSW_OUT			3
 #define	PMC_FN_DO_SAMPLES		4
 #define	PMC_FN_KLD_LOAD			5
 #define	PMC_FN_KLD_UNLOAD		6
 #define	PMC_FN_MMAP			7
 #define	PMC_FN_MUNMAP			8
 #define	PMC_FN_USER_CALLCHAIN		9
 
 struct pmckern_procexec {
 	int		pm_credentialschanged;
 	uintfptr_t	pm_entryaddr;
 };
 
 struct pmckern_map_in {
 	void		*pm_file;	/* filename or vnode pointer */
 	uintfptr_t	pm_address;	/* address object is loaded at */
 };
 
 struct pmckern_map_out {
 	uintfptr_t	pm_address;	/* start address of region */
 	size_t		pm_size;	/* size of unmapped region */
 };
 
 /* hook */
 extern int (*pmc_hook)(struct thread *_td, int _function, void *_arg);
 extern int (*pmc_intr)(int _cpu, struct trapframe *_frame);
 
 /* SX lock protecting the hook */
 extern struct sx pmc_sx;
 
 /* Per-cpu flags indicating availability of sampling data */
-extern volatile cpumask_t pmc_cpumask;
+extern volatile cpuset_t pmc_cpumask;
 
 /* Count of system-wide sampling PMCs in existence */
 extern volatile int pmc_ss_count;
 
 /* kernel version number */
 extern const int pmc_kernel_version;
 
 /* Hook invocation; for use within the kernel */
 #define	PMC_CALL_HOOK(t, cmd, arg)		\
 do {						\
 	sx_slock(&pmc_sx);			\
 	if (pmc_hook != NULL)			\
 		(pmc_hook)((t), (cmd), (arg));	\
 	sx_sunlock(&pmc_sx);			\
 } while (0)
 
 /* Hook invocation that needs an exclusive lock */
 #define	PMC_CALL_HOOK_X(t, cmd, arg)		\
 do {						\
 	sx_xlock(&pmc_sx);			\
 	if (pmc_hook != NULL)			\
 		(pmc_hook)((t), (cmd), (arg));	\
 	sx_xunlock(&pmc_sx);			\
 } while (0)
 
 /*
  * Some hook invocations (e.g., from context switch and clock handling
  * code) need to be lock-free.
  */
 #define	PMC_CALL_HOOK_UNLOCKED(t, cmd, arg)	\
 do {						\
 	if (pmc_hook != NULL)			\
 		(pmc_hook)((t), (cmd), (arg));	\
 } while (0)
 
 #define	PMC_SWITCH_CONTEXT(t,cmd)	PMC_CALL_HOOK_UNLOCKED(t,cmd,NULL)
 
 /* Check if a process is using HWPMCs.*/
 #define PMC_PROC_IS_USING_PMCS(p)				\
 	(__predict_false(atomic_load_acq_int(&(p)->p_flag) &	\
 	    P_HWPMC))
 
 #define	PMC_SYSTEM_SAMPLING_ACTIVE()		(pmc_ss_count > 0)
 
 /* Check if a CPU has recorded samples. */
-#define	PMC_CPU_HAS_SAMPLES(C)	(__predict_false(pmc_cpumask & (1 << (C))))
+#define	PMC_CPU_HAS_SAMPLES(C)	(__predict_false(CPU_ISSET(C, &pmc_cpumask)))
 
 /*
  * Helper functions.
  */
 int		pmc_cpu_is_disabled(int _cpu);  /* deprecated */
 int		pmc_cpu_is_active(int _cpu);
 int		pmc_cpu_is_present(int _cpu);
 int		pmc_cpu_is_primary(int _cpu);
 unsigned int	pmc_cpu_max(void);
 
 #ifdef	INVARIANTS
 int		pmc_cpu_max_active(void);
 #endif
 
 #endif /* _SYS_PMCKERN_H_ */
Index: head/sys/sys/smp.h
===================================================================
--- head/sys/sys/smp.h	(revision 222812)
+++ head/sys/sys/smp.h	(revision 222813)
@@ -1,183 +1,185 @@
 /*-
  * ----------------------------------------------------------------------------
  * "THE BEER-WARE LICENSE" (Revision 42):
  * <phk@FreeBSD.org> wrote this file.  As long as you retain this notice you
  * can do whatever you want with this stuff. If we meet some day, and you think
  * this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
  * ----------------------------------------------------------------------------
  *
  * $FreeBSD$
  */
 
 #ifndef _SYS_SMP_H_
 #define _SYS_SMP_H_
 
 #ifdef _KERNEL
 
 #ifndef LOCORE
 
+#include <sys/cpuset.h>
+
 /*
  * Topology of a NUMA or HTT system.
  *
  * The top level topology is an array of pointers to groups.  Each group
  * contains a bitmask of cpus in its group or subgroups.  It may also
  * contain a pointer to an array of child groups.
  *
  * The bitmasks at non leaf groups may be used by consumers who support
  * a smaller depth than the hardware provides.
  *
  * The topology may be omitted by systems where all CPUs are equal.
  */
 
 struct cpu_group {
 	struct cpu_group *cg_parent;	/* Our parent group. */
 	struct cpu_group *cg_child;	/* Optional children groups. */
-	cpumask_t	cg_mask;	/* Mask of cpus in this group. */
+	cpuset_t	cg_mask;	/* Mask of cpus in this group. */
 	int32_t		cg_count;	/* Count of cpus in this group. */
 	int16_t		cg_children;	/* Number of children groups. */
 	int8_t		cg_level;	/* Shared cache level. */
 	int8_t		cg_flags;	/* Traversal modifiers. */
 };
 
 typedef struct cpu_group *cpu_group_t;
 
 /*
  * Defines common resources for CPUs in the group.  The highest level
  * resource should be used when multiple are shared.
  */
 #define	CG_SHARE_NONE	0
 #define	CG_SHARE_L1	1
 #define	CG_SHARE_L2	2
 #define	CG_SHARE_L3	3
 
 /*
  * Behavior modifiers for load balancing and affinity.
  */
 #define	CG_FLAG_HTT	0x01		/* Schedule the alternate core last. */
 #define	CG_FLAG_SMT	0x02		/* New age htt, less crippled. */
 #define	CG_FLAG_THREAD	(CG_FLAG_HTT | CG_FLAG_SMT)	/* Any threading. */
 
 /*
  * Convenience routines for building topologies.
  */
 #ifdef SMP
 struct cpu_group *smp_topo(void);
 struct cpu_group *smp_topo_none(void);
 struct cpu_group *smp_topo_1level(int l1share, int l1count, int l1flags);
 struct cpu_group *smp_topo_2level(int l2share, int l2count, int l1share,
     int l1count, int l1flags);
 struct cpu_group *smp_topo_find(struct cpu_group *top, int cpu);
 
 extern void (*cpustop_restartfunc)(void);
 extern int smp_active;
 extern int smp_cpus;
-extern volatile cpumask_t started_cpus;
-extern volatile cpumask_t stopped_cpus;
-extern cpumask_t hlt_cpus_mask;
-extern cpumask_t logical_cpus_mask;
+extern volatile cpuset_t started_cpus;
+extern volatile cpuset_t stopped_cpus;
+extern cpuset_t hlt_cpus_mask;
+extern cpuset_t logical_cpus_mask;
 #endif /* SMP */
 
 extern u_int mp_maxid;
 extern int mp_maxcpus;
 extern int mp_ncpus;
 extern volatile int smp_started;
 
-extern cpumask_t all_cpus;
+extern cpuset_t all_cpus;
 
 /*
  * Macro allowing us to determine whether a CPU is absent at any given
  * time, thus permitting us to configure sparse maps of cpuid-dependent
  * (per-CPU) structures.
  */
-#define	CPU_ABSENT(x_cpu)	((all_cpus & (1 << (x_cpu))) == 0)
+#define	CPU_ABSENT(x_cpu)	(!CPU_ISSET(x_cpu, &all_cpus))
 
 /*
  * Macros to iterate over non-absent CPUs.  CPU_FOREACH() takes an
  * integer iterator and iterates over the available set of CPUs.
  * CPU_FIRST() returns the id of the first non-absent CPU.  CPU_NEXT()
  * returns the id of the next non-absent CPU.  It will wrap back to
  * CPU_FIRST() once the end of the list is reached.  The iterators are
  * currently implemented via inline functions.
  */
 #define	CPU_FOREACH(i)							\
 	for ((i) = 0; (i) <= mp_maxid; (i)++)				\
 		if (!CPU_ABSENT((i)))
 
 static __inline int
 cpu_first(void)
 {
 	int i;
 
 	for (i = 0;; i++)
 		if (!CPU_ABSENT(i))
 			return (i);
 }
 
 static __inline int
 cpu_next(int i)
 {
 
 	for (;;) {
 		i++;
 		if (i > mp_maxid)
 			i = 0;
 		if (!CPU_ABSENT(i))
 			return (i);
 	}
 }
 
 #define	CPU_FIRST()	cpu_first()
 #define	CPU_NEXT(i)	cpu_next((i))
 
 #ifdef SMP
 /*
  * Machine dependent functions used to initialize MP support.
  *
  * The cpu_mp_probe() should check to see if MP support is present and return
  * zero if it is not or non-zero if it is.  If MP support is present, then
  * cpu_mp_start() will be called so that MP can be enabled.  This function
  * should do things such as startup secondary processors.  It should also
  * setup mp_ncpus, all_cpus, and smp_cpus.  It should also ensure that
  * smp_active and smp_started are initialized at the appropriate time.
  * Once cpu_mp_start() returns, machine independent MP startup code will be
  * executed and a simple message will be output to the console.  Finally,
  * cpu_mp_announce() will be called so that machine dependent messages about
  * the MP support may be output to the console if desired.
  *
  * The cpu_setmaxid() function is called very early during the boot process
  * so that the MD code may set mp_maxid to provide an upper bound on CPU IDs
  * that other subsystems may use.  If a platform is not able to determine
  * the exact maximum ID that early, then it may set mp_maxid to MAXCPU - 1.
  */
 struct thread;
 
 struct cpu_group *cpu_topo(void);
 void	cpu_mp_announce(void);
 int	cpu_mp_probe(void);
 void	cpu_mp_setmaxid(void);
 void	cpu_mp_start(void);
 
 void	forward_signal(struct thread *);
-int	restart_cpus(cpumask_t);
-int	stop_cpus(cpumask_t);
-int	stop_cpus_hard(cpumask_t);
+int	restart_cpus(cpuset_t);
+int	stop_cpus(cpuset_t);
+int	stop_cpus_hard(cpuset_t);
 #if defined(__amd64__)
-int	suspend_cpus(cpumask_t);
+int	suspend_cpus(cpuset_t);
 #endif
 void	smp_rendezvous_action(void);
 extern	struct mtx smp_ipi_mtx;
 
 #endif /* SMP */
 void	smp_no_rendevous_barrier(void *);
 void	smp_rendezvous(void (*)(void *), 
 		       void (*)(void *),
 		       void (*)(void *),
 		       void *arg);
-void	smp_rendezvous_cpus(cpumask_t,
+void	smp_rendezvous_cpus(cpuset_t,
 		       void (*)(void *), 
 		       void (*)(void *),
 		       void (*)(void *),
 		       void *arg);
 #endif /* !LOCORE */
 #endif /* _KERNEL */
 #endif /* _SYS_SMP_H_ */
Index: head/sys/sys/types.h
===================================================================
--- head/sys/sys/types.h	(revision 222812)
+++ head/sys/sys/types.h	(revision 222813)
@@ -1,312 +1,311 @@
 /*-
  * Copyright (c) 1982, 1986, 1991, 1993, 1994
  *	The Regents of the University of California.  All rights reserved.
  * (c) UNIX System Laboratories, Inc.
  * All or some portions of this file are derived from material licensed
  * to the University of California by American Telephone and Telegraph
  * Co. or Unix System Laboratories, Inc. and are reproduced herein with
  * the permission of UNIX System Laboratories, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	@(#)types.h	8.6 (Berkeley) 2/19/95
  * $FreeBSD$
  */
 
 #ifndef _SYS_TYPES_H_
 #define	_SYS_TYPES_H_
 
 #include <sys/cdefs.h>
 
 /* Machine type dependent parameters. */
 #include <machine/endian.h>
 #include <sys/_types.h>
 
 #include <sys/_pthreadtypes.h>
 
 #if __BSD_VISIBLE
 typedef	unsigned char	u_char;
 typedef	unsigned short	u_short;
 typedef	unsigned int	u_int;
 typedef	unsigned long	u_long;
 #ifndef _KERNEL
 typedef	unsigned short	ushort;		/* Sys V compatibility */
 typedef	unsigned int	uint;		/* Sys V compatibility */
 #endif
 #endif
 
 /*
  * XXX POSIX sized integrals that should appear only in <sys/stdint.h>.
  */
 #include <sys/_stdint.h>
 
 typedef __uint8_t	u_int8_t;	/* unsigned integrals (deprecated) */
 typedef __uint16_t	u_int16_t;
 typedef __uint32_t	u_int32_t;
 typedef __uint64_t	u_int64_t;
 
 typedef	__uint64_t	u_quad_t;	/* quads (deprecated) */
 typedef	__int64_t	quad_t;
 typedef	quad_t *	qaddr_t;
 
 typedef	char *		caddr_t;	/* core address */
 typedef	__const char *	c_caddr_t;	/* core address, pointer to const */
 typedef	__volatile char *v_caddr_t;	/* core address, pointer to volatile */
 
 #ifndef _BLKSIZE_T_DECLARED
 typedef	__blksize_t	blksize_t;
 #define	_BLKSIZE_T_DECLARED
 #endif
 
 typedef	__cpuwhich_t	cpuwhich_t;
 typedef	__cpulevel_t	cpulevel_t;
 typedef	__cpusetid_t	cpusetid_t;
 
 #ifndef _BLKCNT_T_DECLARED
 typedef	__blkcnt_t	blkcnt_t;
 #define	_BLKCNT_T_DECLARED
 #endif
 
 #ifndef _CLOCK_T_DECLARED
 typedef	__clock_t	clock_t;
 #define	_CLOCK_T_DECLARED
 #endif
 
 #ifndef _CLOCKID_T_DECLARED
 typedef	__clockid_t	clockid_t;
 #define	_CLOCKID_T_DECLARED
 #endif
 
-typedef	__cpumask_t	cpumask_t;
 typedef	__critical_t	critical_t;	/* Critical section value */
 typedef	__int64_t	daddr_t;	/* disk address */
 
 #ifndef _DEV_T_DECLARED
 typedef	__dev_t		dev_t;		/* device number or struct cdev */
 #define	_DEV_T_DECLARED
 #endif
 
 #ifndef _FFLAGS_T_DECLARED
 typedef	__fflags_t	fflags_t;	/* file flags */
 #define	_FFLAGS_T_DECLARED
 #endif
 
 typedef	__fixpt_t	fixpt_t;	/* fixed point number */
 
 #ifndef _FSBLKCNT_T_DECLARED		/* for statvfs() */
 typedef	__fsblkcnt_t	fsblkcnt_t;
 typedef	__fsfilcnt_t	fsfilcnt_t;
 #define	_FSBLKCNT_T_DECLARED
 #endif
 
 #ifndef _GID_T_DECLARED
 typedef	__gid_t		gid_t;		/* group id */
 #define	_GID_T_DECLARED
 #endif
 
 #ifndef _IN_ADDR_T_DECLARED
 typedef	__uint32_t	in_addr_t;	/* base type for internet address */
 #define	_IN_ADDR_T_DECLARED
 #endif
 
 #ifndef _IN_PORT_T_DECLARED
 typedef	__uint16_t	in_port_t;
 #define	_IN_PORT_T_DECLARED
 #endif
 
 #ifndef _ID_T_DECLARED
 typedef	__id_t		id_t;		/* can hold a uid_t or pid_t */
 #define	_ID_T_DECLARED
 #endif
 
 #ifndef _INO_T_DECLARED
 typedef	__ino_t		ino_t;		/* inode number */
 #define	_INO_T_DECLARED
 #endif
 
 #ifndef _KEY_T_DECLARED
 typedef	__key_t		key_t;		/* IPC key (for Sys V IPC) */
 #define	_KEY_T_DECLARED
 #endif
 
 #ifndef _LWPID_T_DECLARED
 typedef	__lwpid_t	lwpid_t;	/* Thread ID (a.k.a. LWP) */
 #define	_LWPID_T_DECLARED
 #endif
 
 #ifndef _MODE_T_DECLARED
 typedef	__mode_t	mode_t;		/* permissions */
 #define	_MODE_T_DECLARED
 #endif
 
 #ifndef _ACCMODE_T_DECLARED
 typedef	__accmode_t	accmode_t;	/* access permissions */
 #define	_ACCMODE_T_DECLARED
 #endif
 
 #ifndef _NLINK_T_DECLARED
 typedef	__nlink_t	nlink_t;	/* link count */
 #define	_NLINK_T_DECLARED
 #endif
 
 #ifndef _OFF_T_DECLARED
 typedef	__off_t		off_t;		/* file offset */
 #define	_OFF_T_DECLARED
 #endif
 
 #ifndef _PID_T_DECLARED
 typedef	__pid_t		pid_t;		/* process id */
 #define	_PID_T_DECLARED
 #endif
 
 typedef	__register_t	register_t;
 
 #ifndef _RLIM_T_DECLARED
 typedef	__rlim_t	rlim_t;		/* resource limit */
 #define	_RLIM_T_DECLARED
 #endif
 
 typedef	__segsz_t	segsz_t;	/* segment size (in pages) */
 
 #ifndef _SIZE_T_DECLARED
 typedef	__size_t	size_t;
 #define	_SIZE_T_DECLARED
 #endif
 
 #ifndef _SSIZE_T_DECLARED
 typedef	__ssize_t	ssize_t;
 #define	_SSIZE_T_DECLARED
 #endif
 
 #ifndef _SUSECONDS_T_DECLARED
 typedef	__suseconds_t	suseconds_t;	/* microseconds (signed) */
 #define	_SUSECONDS_T_DECLARED
 #endif
 
 #ifndef _TIME_T_DECLARED
 typedef	__time_t	time_t;
 #define	_TIME_T_DECLARED
 #endif
 
 #ifndef _TIMER_T_DECLARED
 typedef	__timer_t	timer_t;
 #define	_TIMER_T_DECLARED
 #endif
 
 #ifndef _MQD_T_DECLARED
 typedef	__mqd_t	mqd_t;
 #define	_MQD_T_DECLARED
 #endif
 
 typedef	__u_register_t	u_register_t;
 
 #ifndef _UID_T_DECLARED
 typedef	__uid_t		uid_t;		/* user id */
 #define	_UID_T_DECLARED
 #endif
 
 #ifndef _USECONDS_T_DECLARED
 typedef	__useconds_t	useconds_t;	/* microseconds (unsigned) */
 #define	_USECONDS_T_DECLARED
 #endif
 
 typedef	__vm_offset_t	vm_offset_t;
 typedef	__vm_ooffset_t	vm_ooffset_t;
 typedef	__vm_paddr_t	vm_paddr_t;
 typedef	__vm_pindex_t	vm_pindex_t;
 typedef	__vm_size_t	vm_size_t;
 
 #ifdef _KERNEL
 typedef	int		boolean_t;
 typedef	struct device	*device_t;
 typedef	__intfptr_t	intfptr_t;
 
 /*
  * XXX this is fixed width for historical reasons.  It should have had type
  * __int_fast32_t.  Fixed-width types should not be used unless binary
  * compatibility is essential.  Least-width types should be used even less
  * since they provide smaller benefits.
  *
  * XXX should be MD.
  *
  * XXX this is bogus in -current, but still used for spl*().
  */
 typedef	__uint32_t	intrmask_t;	/* Interrupt mask (spl, xxx_imask...) */
 
 typedef	__uintfptr_t	uintfptr_t;
 typedef	__uint64_t	uoff_t;
 typedef	char		vm_memattr_t;	/* memory attribute codes */
 typedef	struct vm_page	*vm_page_t;
 
 #define offsetof(type, field) __offsetof(type, field)
 
 #endif /* !_KERNEL */
 
 /*
  * The following are all things that really shouldn't exist in this header,
  * since its purpose is to provide typedefs, not miscellaneous doodads.
  */
 #if __BSD_VISIBLE
 
 #include <sys/select.h>
 
 /*
  * minor() gives a cookie instead of an index since we don't want to
  * change the meanings of bits 0-15 or waste time and space shifting
  * bits 16-31 for devices that don't use them.
  */
 #define	major(x)	((int)(((u_int)(x) >> 8)&0xff))	/* major number */
 #define	minor(x)	((int)((x)&0xffff00ff))		/* minor number */
 #define	makedev(x,y)	((dev_t)(((x) << 8) | (y)))	/* create dev_t */
 
 /*
  * These declarations belong elsewhere, but are repeated here and in
  * <stdio.h> to give broken programs a better chance of working with
  * 64-bit off_t's.
  */
 #ifndef _KERNEL
 __BEGIN_DECLS
 #ifndef _FTRUNCATE_DECLARED
 #define	_FTRUNCATE_DECLARED
 int	 ftruncate(int, off_t);
 #endif
 #ifndef _LSEEK_DECLARED
 #define	_LSEEK_DECLARED
 off_t	 lseek(int, off_t, int);
 #endif
 #ifndef _MMAP_DECLARED
 #define	_MMAP_DECLARED
 void *	 mmap(void *, size_t, int, int, int, off_t);
 #endif
 #ifndef _TRUNCATE_DECLARED
 #define	_TRUNCATE_DECLARED
 int	 truncate(const char *, off_t);
 #endif
 __END_DECLS
 #endif /* !_KERNEL */
 
 #endif /* __BSD_VISIBLE */
 
 #endif /* !_SYS_TYPES_H_ */
Index: head/sys/x86/x86/local_apic.c
===================================================================
--- head/sys/x86/x86/local_apic.c	(revision 222812)
+++ head/sys/x86/x86/local_apic.c	(revision 222813)
@@ -1,1508 +1,1508 @@
 /*-
  * Copyright (c) 2003 John Baldwin <jhb@FreeBSD.org>
  * Copyright (c) 1996, by Steve Passe
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. The name of the developer may NOT be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  * 3. Neither the name of the author nor the names of any co-contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * Local APIC support on Pentium and later processors.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_hwpmc_hooks.h"
 #include "opt_kdtrace.h"
 
 #include "opt_ddb.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/timeet.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <x86/apicreg.h>
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/frame.h>
 #include <machine/intr_machdep.h>
 #include <machine/apicvar.h>
 #include <x86/mca.h>
 #include <machine/md_var.h>
 #include <machine/smp.h>
 #include <machine/specialreg.h>
 
 #ifdef DDB
 #include <sys/interrupt.h>
 #include <ddb/ddb.h>
 #endif
 
 #ifdef __amd64__
 #define	SDT_APIC	SDT_SYSIGT
 #define	SDT_APICT	SDT_SYSIGT
 #define	GSEL_APIC	0
 #else
 #define	SDT_APIC	SDT_SYS386IGT
 #define	SDT_APICT	SDT_SYS386TGT
 #define	GSEL_APIC	GSEL(GCODE_SEL, SEL_KPL)
 #endif
 
 /* Sanity checks on IDT vectors. */
 CTASSERT(APIC_IO_INTS + APIC_NUM_IOINTS == APIC_TIMER_INT);
 CTASSERT(APIC_TIMER_INT < APIC_LOCAL_INTS);
 CTASSERT(APIC_LOCAL_INTS == 240);
 CTASSERT(IPI_STOP < APIC_SPURIOUS_INT);
 
 /* Magic IRQ values for the timer and syscalls. */
 #define	IRQ_TIMER	(NUM_IO_INTS + 1)
 #define	IRQ_SYSCALL	(NUM_IO_INTS + 2)
 #define	IRQ_DTRACE_RET	(NUM_IO_INTS + 3)
 
 /*
  * Support for local APICs.  Local APICs manage interrupts on each
  * individual processor as opposed to I/O APICs which receive interrupts
  * from I/O devices and then forward them on to the local APICs.
  *
  * Local APICs can also send interrupts to each other thus providing the
  * mechanism for IPIs.
  */
 
 struct lvt {
 	u_int lvt_edgetrigger:1;
 	u_int lvt_activehi:1;
 	u_int lvt_masked:1;
 	u_int lvt_active:1;
 	u_int lvt_mode:16;
 	u_int lvt_vector:8;
 };
 
 struct lapic {
 	struct lvt la_lvts[LVT_MAX + 1];
 	u_int la_id:8;
 	u_int la_cluster:4;
 	u_int la_cluster_id:2;
 	u_int la_present:1;
 	u_long *la_timer_count;
 	u_long la_timer_period;
 	u_int la_timer_mode;
 	/* Include IDT_SYSCALL to make indexing easier. */
 	int la_ioint_irqs[APIC_NUM_IOINTS + 1];
 } static lapics[MAX_APIC_ID + 1];
 
 /* Global defaults for local APIC LVT entries. */
 static struct lvt lvts[LVT_MAX + 1] = {
 	{ 1, 1, 1, 1, APIC_LVT_DM_EXTINT, 0 },	/* LINT0: masked ExtINT */
 	{ 1, 1, 0, 1, APIC_LVT_DM_NMI, 0 },	/* LINT1: NMI */
 	{ 1, 1, 1, 1, APIC_LVT_DM_FIXED, APIC_TIMER_INT },	/* Timer */
 	{ 1, 1, 0, 1, APIC_LVT_DM_FIXED, APIC_ERROR_INT },	/* Error */
 	{ 1, 1, 1, 1, APIC_LVT_DM_NMI, 0 },	/* PMC */
 	{ 1, 1, 1, 1, APIC_LVT_DM_FIXED, APIC_THERMAL_INT },	/* Thermal */
 	{ 1, 1, 1, 1, APIC_LVT_DM_FIXED, APIC_CMC_INT },	/* CMCI */
 };
 
 static inthand_t *ioint_handlers[] = {
 	NULL,			/* 0 - 31 */
 	IDTVEC(apic_isr1),	/* 32 - 63 */
 	IDTVEC(apic_isr2),	/* 64 - 95 */
 	IDTVEC(apic_isr3),	/* 96 - 127 */
 	IDTVEC(apic_isr4),	/* 128 - 159 */
 	IDTVEC(apic_isr5),	/* 160 - 191 */
 	IDTVEC(apic_isr6),	/* 192 - 223 */
 	IDTVEC(apic_isr7),	/* 224 - 255 */
 };
 
 
 static u_int32_t lapic_timer_divisors[] = {
 	APIC_TDCR_1, APIC_TDCR_2, APIC_TDCR_4, APIC_TDCR_8, APIC_TDCR_16,
 	APIC_TDCR_32, APIC_TDCR_64, APIC_TDCR_128
 };
 
 extern inthand_t IDTVEC(rsvd);
 
 volatile lapic_t *lapic;
 vm_paddr_t lapic_paddr;
 static u_long lapic_timer_divisor;
 static struct eventtimer lapic_et;
 
 static void	lapic_enable(void);
 static void	lapic_resume(struct pic *pic);
 static void	lapic_timer_oneshot(u_int count, int enable_int);
 static void	lapic_timer_periodic(u_int count, int enable_int);
 static void	lapic_timer_stop(void);
 static void	lapic_timer_set_divisor(u_int divisor);
 static uint32_t	lvt_mode(struct lapic *la, u_int pin, uint32_t value);
 static int	lapic_et_start(struct eventtimer *et,
     struct bintime *first, struct bintime *period);
 static int	lapic_et_stop(struct eventtimer *et);
 
 struct pic lapic_pic = { .pic_resume = lapic_resume };
 
 static uint32_t
 lvt_mode(struct lapic *la, u_int pin, uint32_t value)
 {
 	struct lvt *lvt;
 
 	KASSERT(pin <= LVT_MAX, ("%s: pin %u out of range", __func__, pin));
 	if (la->la_lvts[pin].lvt_active)
 		lvt = &la->la_lvts[pin];
 	else
 		lvt = &lvts[pin];
 
 	value &= ~(APIC_LVT_M | APIC_LVT_TM | APIC_LVT_IIPP | APIC_LVT_DM |
 	    APIC_LVT_VECTOR);
 	if (lvt->lvt_edgetrigger == 0)
 		value |= APIC_LVT_TM;
 	if (lvt->lvt_activehi == 0)
 		value |= APIC_LVT_IIPP_INTALO;
 	if (lvt->lvt_masked)
 		value |= APIC_LVT_M;
 	value |= lvt->lvt_mode;
 	switch (lvt->lvt_mode) {
 	case APIC_LVT_DM_NMI:
 	case APIC_LVT_DM_SMI:
 	case APIC_LVT_DM_INIT:
 	case APIC_LVT_DM_EXTINT:
 		if (!lvt->lvt_edgetrigger) {
 			printf("lapic%u: Forcing LINT%u to edge trigger\n",
 			    la->la_id, pin);
 			value |= APIC_LVT_TM;
 		}
 		/* Use a vector of 0. */
 		break;
 	case APIC_LVT_DM_FIXED:
 		value |= lvt->lvt_vector;
 		break;
 	default:
 		panic("bad APIC LVT delivery mode: %#x\n", value);
 	}
 	return (value);
 }
 
 /*
  * Map the local APIC and setup necessary interrupt vectors.
  */
 void
 lapic_init(vm_paddr_t addr)
 {
 	u_int regs[4];
 	int i, arat;
 
 	/* Map the local APIC and setup the spurious interrupt handler. */
 	KASSERT(trunc_page(addr) == addr,
 	    ("local APIC not aligned on a page boundary"));
 	lapic = pmap_mapdev(addr, sizeof(lapic_t));
 	lapic_paddr = addr;
 	setidt(APIC_SPURIOUS_INT, IDTVEC(spuriousint), SDT_APIC, SEL_KPL,
 	    GSEL_APIC);
 
 	/* Perform basic initialization of the BSP's local APIC. */
 	lapic_enable();
 
 	/* Set BSP's per-CPU local APIC ID. */
 	PCPU_SET(apic_id, lapic_id());
 
 	/* Local APIC timer interrupt. */
 	setidt(APIC_TIMER_INT, IDTVEC(timerint), SDT_APIC, SEL_KPL, GSEL_APIC);
 
 	/* Local APIC error interrupt. */
 	setidt(APIC_ERROR_INT, IDTVEC(errorint), SDT_APIC, SEL_KPL, GSEL_APIC);
 
 	/* XXX: Thermal interrupt */
 
 	/* Local APIC CMCI. */
 	setidt(APIC_CMC_INT, IDTVEC(cmcint), SDT_APICT, SEL_KPL, GSEL_APIC);
 
 	if ((resource_int_value("apic", 0, "clock", &i) != 0 || i != 0)) {
 		arat = 0;
 		/* Intel CPUID 0x06 EAX[2] set if APIC timer runs in C3. */
 		if (cpu_vendor_id == CPU_VENDOR_INTEL && cpu_high >= 6) {
 			do_cpuid(0x06, regs);
 			if ((regs[0] & CPUTPM1_ARAT) != 0)
 				arat = 1;
 		}
 		bzero(&lapic_et, sizeof(lapic_et));
 		lapic_et.et_name = "LAPIC";
 		lapic_et.et_flags = ET_FLAGS_PERIODIC | ET_FLAGS_ONESHOT |
 		    ET_FLAGS_PERCPU;
 		lapic_et.et_quality = 600;
 		if (!arat) {
 			lapic_et.et_flags |= ET_FLAGS_C3STOP;
 			lapic_et.et_quality -= 200;
 		}
 		lapic_et.et_frequency = 0;
 		/* We don't know frequency yet, so trying to guess. */
 		lapic_et.et_min_period.sec = 0;
 		lapic_et.et_min_period.frac = 0x00001000LL << 32;
 		lapic_et.et_max_period.sec = 1;
 		lapic_et.et_max_period.frac = 0;
 		lapic_et.et_start = lapic_et_start;
 		lapic_et.et_stop = lapic_et_stop;
 		lapic_et.et_priv = NULL;
 		et_register(&lapic_et);
 	}
 }
 
 /*
  * Create a local APIC instance.
  */
 void
 lapic_create(u_int apic_id, int boot_cpu)
 {
 	int i;
 
 	if (apic_id > MAX_APIC_ID) {
 		printf("APIC: Ignoring local APIC with ID %d\n", apic_id);
 		if (boot_cpu)
 			panic("Can't ignore BSP");
 		return;
 	}
 	KASSERT(!lapics[apic_id].la_present, ("duplicate local APIC %u",
 	    apic_id));
 
 	/*
 	 * Assume no local LVT overrides and a cluster of 0 and
 	 * intra-cluster ID of 0.
 	 */
 	lapics[apic_id].la_present = 1;
 	lapics[apic_id].la_id = apic_id;
 	for (i = 0; i <= LVT_MAX; i++) {
 		lapics[apic_id].la_lvts[i] = lvts[i];
 		lapics[apic_id].la_lvts[i].lvt_active = 0;
 	}
 	for (i = 0; i <= APIC_NUM_IOINTS; i++)
 	    lapics[apic_id].la_ioint_irqs[i] = -1;
 	lapics[apic_id].la_ioint_irqs[IDT_SYSCALL - APIC_IO_INTS] = IRQ_SYSCALL;
 	lapics[apic_id].la_ioint_irqs[APIC_TIMER_INT - APIC_IO_INTS] =
 	    IRQ_TIMER;
 #ifdef KDTRACE_HOOKS
 	lapics[apic_id].la_ioint_irqs[IDT_DTRACE_RET - APIC_IO_INTS] = IRQ_DTRACE_RET;
 #endif
 
 
 #ifdef SMP
 	cpu_add(apic_id, boot_cpu);
 #endif
 }
 
 /*
  * Dump contents of local APIC registers
  */
 void
 lapic_dump(const char* str)
 {
 	uint32_t maxlvt;
 
 	maxlvt = (lapic->version & APIC_VER_MAXLVT) >> MAXLVTSHIFT;
 	printf("cpu%d %s:\n", PCPU_GET(cpuid), str);
 	printf("     ID: 0x%08x   VER: 0x%08x LDR: 0x%08x DFR: 0x%08x\n",
 	    lapic->id, lapic->version, lapic->ldr, lapic->dfr);
 	printf("  lint0: 0x%08x lint1: 0x%08x TPR: 0x%08x SVR: 0x%08x\n",
 	    lapic->lvt_lint0, lapic->lvt_lint1, lapic->tpr, lapic->svr);
 	printf("  timer: 0x%08x therm: 0x%08x err: 0x%08x",
 	    lapic->lvt_timer, lapic->lvt_thermal, lapic->lvt_error);
 	if (maxlvt >= LVT_PMC)
 		printf(" pmc: 0x%08x", lapic->lvt_pcint);
 	printf("\n");
 	if (maxlvt >= LVT_CMCI)
 		printf("   cmci: 0x%08x\n", lapic->lvt_cmci);
 }
 
 void
 lapic_setup(int boot)
 {
 	struct lapic *la;
 	u_int32_t maxlvt;
 	register_t saveintr;
 	char buf[MAXCOMLEN + 1];
 
 	la = &lapics[lapic_id()];
 	KASSERT(la->la_present, ("missing APIC structure"));
 	saveintr = intr_disable();
 	maxlvt = (lapic->version & APIC_VER_MAXLVT) >> MAXLVTSHIFT;
 
 	/* Initialize the TPR to allow all interrupts. */
 	lapic_set_tpr(0);
 
 	/* Setup spurious vector and enable the local APIC. */
 	lapic_enable();
 
 	/* Program LINT[01] LVT entries. */
 	lapic->lvt_lint0 = lvt_mode(la, LVT_LINT0, lapic->lvt_lint0);
 	lapic->lvt_lint1 = lvt_mode(la, LVT_LINT1, lapic->lvt_lint1);
 
 	/* Program the PMC LVT entry if present. */
 	if (maxlvt >= LVT_PMC)
 		lapic->lvt_pcint = lvt_mode(la, LVT_PMC, lapic->lvt_pcint);
 
 	/* Program timer LVT and setup handler. */
 	lapic->lvt_timer = lvt_mode(la, LVT_TIMER, lapic->lvt_timer);
 	if (boot) {
 		snprintf(buf, sizeof(buf), "cpu%d:timer", PCPU_GET(cpuid));
 		intrcnt_add(buf, &la->la_timer_count);
 	}
 
 	/* Setup the timer if configured. */
 	if (la->la_timer_mode != 0) {
 		KASSERT(la->la_timer_period != 0, ("lapic%u: zero divisor",
 		    lapic_id()));
 		lapic_timer_set_divisor(lapic_timer_divisor);
 		if (la->la_timer_mode == 1)
 			lapic_timer_periodic(la->la_timer_period, 1);
 		else
 			lapic_timer_oneshot(la->la_timer_period, 1);
 	}
 
 	/* Program error LVT and clear any existing errors. */
 	lapic->lvt_error = lvt_mode(la, LVT_ERROR, lapic->lvt_error);
 	lapic->esr = 0;
 
 	/* XXX: Thermal LVT */
 
 	/* Program the CMCI LVT entry if present. */
 	if (maxlvt >= LVT_CMCI)
 		lapic->lvt_cmci = lvt_mode(la, LVT_CMCI, lapic->lvt_cmci);
 	    
 	intr_restore(saveintr);
 }
 
 void
 lapic_reenable_pmc(void)
 {
 #ifdef HWPMC_HOOKS
 	uint32_t value;
 
 	value =  lapic->lvt_pcint;
 	value &= ~APIC_LVT_M;
 	lapic->lvt_pcint = value;
 #endif
 }
 
 #ifdef HWPMC_HOOKS
 static void
 lapic_update_pmc(void *dummy)
 {
 	struct lapic *la;
 
 	la = &lapics[lapic_id()];
 	lapic->lvt_pcint = lvt_mode(la, LVT_PMC, lapic->lvt_pcint);
 }
 #endif
 
 int
 lapic_enable_pmc(void)
 {
 #ifdef HWPMC_HOOKS
 	u_int32_t maxlvt;
 
 	/* Fail if the local APIC is not present. */
 	if (lapic == NULL)
 		return (0);
 
 	/* Fail if the PMC LVT is not present. */
 	maxlvt = (lapic->version & APIC_VER_MAXLVT) >> MAXLVTSHIFT;
 	if (maxlvt < LVT_PMC)
 		return (0);
 
 	lvts[LVT_PMC].lvt_masked = 0;
 
 #ifdef SMP
 	/*
 	 * If hwpmc was loaded at boot time then the APs may not be
 	 * started yet.  In that case, don't forward the request to
 	 * them as they will program the lvt when they start.
 	 */
 	if (smp_started)
 		smp_rendezvous(NULL, lapic_update_pmc, NULL, NULL);
 	else
 #endif
 		lapic_update_pmc(NULL);
 	return (1);
 #else
 	return (0);
 #endif
 }
 
 void
 lapic_disable_pmc(void)
 {
 #ifdef HWPMC_HOOKS
 	u_int32_t maxlvt;
 
 	/* Fail if the local APIC is not present. */
 	if (lapic == NULL)
 		return;
 
 	/* Fail if the PMC LVT is not present. */
 	maxlvt = (lapic->version & APIC_VER_MAXLVT) >> MAXLVTSHIFT;
 	if (maxlvt < LVT_PMC)
 		return;
 
 	lvts[LVT_PMC].lvt_masked = 1;
 
 #ifdef SMP
 	/* The APs should always be started when hwpmc is unloaded. */
 	KASSERT(mp_ncpus == 1 || smp_started, ("hwpmc unloaded too early"));
 #endif
 	smp_rendezvous(NULL, lapic_update_pmc, NULL, NULL);
 #endif
 }
 
 static int
 lapic_et_start(struct eventtimer *et,
     struct bintime *first, struct bintime *period)
 {
 	struct lapic *la;
 	u_long value;
 
 	if (et->et_frequency == 0) {
 		/* Start off with a divisor of 2 (power on reset default). */
 		lapic_timer_divisor = 2;
 		/* Try to calibrate the local APIC timer. */
 		do {
 			lapic_timer_set_divisor(lapic_timer_divisor);
 			lapic_timer_oneshot(APIC_TIMER_MAX_COUNT, 0);
 			DELAY(1000000);
 			value = APIC_TIMER_MAX_COUNT - lapic->ccr_timer;
 			if (value != APIC_TIMER_MAX_COUNT)
 				break;
 			lapic_timer_divisor <<= 1;
 		} while (lapic_timer_divisor <= 128);
 		if (lapic_timer_divisor > 128)
 			panic("lapic: Divisor too big");
 		if (bootverbose)
 			printf("lapic: Divisor %lu, Frequency %lu Hz\n",
 			    lapic_timer_divisor, value);
 		et->et_frequency = value;
 		et->et_min_period.sec = 0;
 		et->et_min_period.frac =
 		    ((0x00000002LLU << 32) / et->et_frequency) << 32;
 		et->et_max_period.sec = 0xfffffffeLLU / et->et_frequency;
 		et->et_max_period.frac =
 		    ((0xfffffffeLLU << 32) / et->et_frequency) << 32;
 	}
 	lapic_timer_set_divisor(lapic_timer_divisor);
 	la = &lapics[lapic_id()];
 	if (period != NULL) {
 		la->la_timer_mode = 1;
 		la->la_timer_period =
 		    (et->et_frequency * (period->frac >> 32)) >> 32;
 		if (period->sec != 0)
 			la->la_timer_period += et->et_frequency * period->sec;
 		lapic_timer_periodic(la->la_timer_period, 1);
 	} else {
 		la->la_timer_mode = 2;
 		la->la_timer_period =
 		    (et->et_frequency * (first->frac >> 32)) >> 32;
 		if (first->sec != 0)
 			la->la_timer_period += et->et_frequency * first->sec;
 		lapic_timer_oneshot(la->la_timer_period, 1);
 	}
 	return (0);
 }
 
 static int
 lapic_et_stop(struct eventtimer *et)
 {
 	struct lapic *la = &lapics[lapic_id()];
 
 	la->la_timer_mode = 0;
 	lapic_timer_stop();
 	return (0);
 }
 
 void
 lapic_disable(void)
 {
 	uint32_t value;
 
 	/* Software disable the local APIC. */
 	value = lapic->svr;
 	value &= ~APIC_SVR_SWEN;
 	lapic->svr = value;
 }
 
 static void
 lapic_enable(void)
 {
 	u_int32_t value;
 
 	/* Program the spurious vector to enable the local APIC. */
 	value = lapic->svr;
 	value &= ~(APIC_SVR_VECTOR | APIC_SVR_FOCUS);
 	value |= (APIC_SVR_FEN | APIC_SVR_SWEN | APIC_SPURIOUS_INT);
 	lapic->svr = value;
 }
 
 /* Reset the local APIC on the BSP during resume. */
 static void
 lapic_resume(struct pic *pic)
 {
 
 	lapic_setup(0);
 }
 
 int
 lapic_id(void)
 {
 
 	KASSERT(lapic != NULL, ("local APIC is not mapped"));
 	return (lapic->id >> APIC_ID_SHIFT);
 }
 
 int
 lapic_intr_pending(u_int vector)
 {
 	volatile u_int32_t *irr;
 
 	/*
 	 * The IRR registers are an array of 128-bit registers each of
 	 * which only describes 32 interrupts in the low 32 bits..  Thus,
 	 * we divide the vector by 32 to get the 128-bit index.  We then
 	 * multiply that index by 4 to get the equivalent index from
 	 * treating the IRR as an array of 32-bit registers.  Finally, we
 	 * modulus the vector by 32 to determine the individual bit to
 	 * test.
 	 */
 	irr = &lapic->irr0;
 	return (irr[(vector / 32) * 4] & 1 << (vector % 32));
 }
 
 void
 lapic_set_logical_id(u_int apic_id, u_int cluster, u_int cluster_id)
 {
 	struct lapic *la;
 
 	KASSERT(lapics[apic_id].la_present, ("%s: APIC %u doesn't exist",
 	    __func__, apic_id));
 	KASSERT(cluster <= APIC_MAX_CLUSTER, ("%s: cluster %u too big",
 	    __func__, cluster));
 	KASSERT(cluster_id <= APIC_MAX_INTRACLUSTER_ID,
 	    ("%s: intra cluster id %u too big", __func__, cluster_id));
 	la = &lapics[apic_id];
 	la->la_cluster = cluster;
 	la->la_cluster_id = cluster_id;
 }
 
 int
 lapic_set_lvt_mask(u_int apic_id, u_int pin, u_char masked)
 {
 
 	if (pin > LVT_MAX)
 		return (EINVAL);
 	if (apic_id == APIC_ID_ALL) {
 		lvts[pin].lvt_masked = masked;
 		if (bootverbose)
 			printf("lapic:");
 	} else {
 		KASSERT(lapics[apic_id].la_present,
 		    ("%s: missing APIC %u", __func__, apic_id));
 		lapics[apic_id].la_lvts[pin].lvt_masked = masked;
 		lapics[apic_id].la_lvts[pin].lvt_active = 1;
 		if (bootverbose)
 			printf("lapic%u:", apic_id);
 	}
 	if (bootverbose)
 		printf(" LINT%u %s\n", pin, masked ? "masked" : "unmasked");
 	return (0);
 }
 
 int
 lapic_set_lvt_mode(u_int apic_id, u_int pin, u_int32_t mode)
 {
 	struct lvt *lvt;
 
 	if (pin > LVT_MAX)
 		return (EINVAL);
 	if (apic_id == APIC_ID_ALL) {
 		lvt = &lvts[pin];
 		if (bootverbose)
 			printf("lapic:");
 	} else {
 		KASSERT(lapics[apic_id].la_present,
 		    ("%s: missing APIC %u", __func__, apic_id));
 		lvt = &lapics[apic_id].la_lvts[pin];
 		lvt->lvt_active = 1;
 		if (bootverbose)
 			printf("lapic%u:", apic_id);
 	}
 	lvt->lvt_mode = mode;
 	switch (mode) {
 	case APIC_LVT_DM_NMI:
 	case APIC_LVT_DM_SMI:
 	case APIC_LVT_DM_INIT:
 	case APIC_LVT_DM_EXTINT:
 		lvt->lvt_edgetrigger = 1;
 		lvt->lvt_activehi = 1;
 		if (mode == APIC_LVT_DM_EXTINT)
 			lvt->lvt_masked = 1;
 		else
 			lvt->lvt_masked = 0;
 		break;
 	default:
 		panic("Unsupported delivery mode: 0x%x\n", mode);
 	}
 	if (bootverbose) {
 		printf(" Routing ");
 		switch (mode) {
 		case APIC_LVT_DM_NMI:
 			printf("NMI");
 			break;
 		case APIC_LVT_DM_SMI:
 			printf("SMI");
 			break;
 		case APIC_LVT_DM_INIT:
 			printf("INIT");
 			break;
 		case APIC_LVT_DM_EXTINT:
 			printf("ExtINT");
 			break;
 		}
 		printf(" -> LINT%u\n", pin);
 	}
 	return (0);
 }
 
 int
 lapic_set_lvt_polarity(u_int apic_id, u_int pin, enum intr_polarity pol)
 {
 
 	if (pin > LVT_MAX || pol == INTR_POLARITY_CONFORM)
 		return (EINVAL);
 	if (apic_id == APIC_ID_ALL) {
 		lvts[pin].lvt_activehi = (pol == INTR_POLARITY_HIGH);
 		if (bootverbose)
 			printf("lapic:");
 	} else {
 		KASSERT(lapics[apic_id].la_present,
 		    ("%s: missing APIC %u", __func__, apic_id));
 		lapics[apic_id].la_lvts[pin].lvt_active = 1;
 		lapics[apic_id].la_lvts[pin].lvt_activehi =
 		    (pol == INTR_POLARITY_HIGH);
 		if (bootverbose)
 			printf("lapic%u:", apic_id);
 	}
 	if (bootverbose)
 		printf(" LINT%u polarity: %s\n", pin,
 		    pol == INTR_POLARITY_HIGH ? "high" : "low");
 	return (0);
 }
 
 int
 lapic_set_lvt_triggermode(u_int apic_id, u_int pin, enum intr_trigger trigger)
 {
 
 	if (pin > LVT_MAX || trigger == INTR_TRIGGER_CONFORM)
 		return (EINVAL);
 	if (apic_id == APIC_ID_ALL) {
 		lvts[pin].lvt_edgetrigger = (trigger == INTR_TRIGGER_EDGE);
 		if (bootverbose)
 			printf("lapic:");
 	} else {
 		KASSERT(lapics[apic_id].la_present,
 		    ("%s: missing APIC %u", __func__, apic_id));
 		lapics[apic_id].la_lvts[pin].lvt_edgetrigger =
 		    (trigger == INTR_TRIGGER_EDGE);
 		lapics[apic_id].la_lvts[pin].lvt_active = 1;
 		if (bootverbose)
 			printf("lapic%u:", apic_id);
 	}
 	if (bootverbose)
 		printf(" LINT%u trigger: %s\n", pin,
 		    trigger == INTR_TRIGGER_EDGE ? "edge" : "level");
 	return (0);
 }
 
 /*
  * Adjust the TPR of the current CPU so that it blocks all interrupts below
  * the passed in vector.
  */
 void
 lapic_set_tpr(u_int vector)
 {
 #ifdef CHEAP_TPR
 	lapic->tpr = vector;
 #else
 	u_int32_t tpr;
 
 	tpr = lapic->tpr & ~APIC_TPR_PRIO;
 	tpr |= vector;
 	lapic->tpr = tpr;
 #endif
 }
 
 void
 lapic_eoi(void)
 {
 
 	lapic->eoi = 0;
 }
 
 void
 lapic_handle_intr(int vector, struct trapframe *frame)
 {
 	struct intsrc *isrc;
 
 	isrc = intr_lookup_source(apic_idt_to_irq(PCPU_GET(apic_id),
 	    vector));
 	intr_execute_handlers(isrc, frame);
 }
 
 void
 lapic_handle_timer(struct trapframe *frame)
 {
 	struct lapic *la;
 	struct trapframe *oldframe;
 	struct thread *td;
 
 	/* Send EOI first thing. */
 	lapic_eoi();
 
 #if defined(SMP) && !defined(SCHED_ULE)
 	/*
 	 * Don't do any accounting for the disabled HTT cores, since it
 	 * will provide misleading numbers for the userland.
 	 *
 	 * No locking is necessary here, since even if we loose the race
 	 * when hlt_cpus_mask changes it is not a big deal, really.
 	 *
 	 * Don't do that for ULE, since ULE doesn't consider hlt_cpus_mask
 	 * and unlike other schedulers it actually schedules threads to
 	 * those CPUs.
 	 */
-	if ((hlt_cpus_mask & (1 << PCPU_GET(cpuid))) != 0)
+	if (CPU_ISSET(PCPU_GET(cpuid), &hlt_cpus_mask))
 		return;
 #endif
 
 	/* Look up our local APIC structure for the tick counters. */
 	la = &lapics[PCPU_GET(apic_id)];
 	(*la->la_timer_count)++;
 	critical_enter();
 	if (lapic_et.et_active) {
 		td = curthread;
 		td->td_intr_nesting_level++;
 		oldframe = td->td_intr_frame;
 		td->td_intr_frame = frame;
 		lapic_et.et_event_cb(&lapic_et, lapic_et.et_arg);
 		td->td_intr_frame = oldframe;
 		td->td_intr_nesting_level--;
 	}
 	critical_exit();
 }
 
 static void
 lapic_timer_set_divisor(u_int divisor)
 {
 
 	KASSERT(powerof2(divisor), ("lapic: invalid divisor %u", divisor));
 	KASSERT(ffs(divisor) <= sizeof(lapic_timer_divisors) /
 	    sizeof(u_int32_t), ("lapic: invalid divisor %u", divisor));
 	lapic->dcr_timer = lapic_timer_divisors[ffs(divisor) - 1];
 }
 
 static void
 lapic_timer_oneshot(u_int count, int enable_int)
 {
 	u_int32_t value;
 
 	value = lapic->lvt_timer;
 	value &= ~APIC_LVTT_TM;
 	value |= APIC_LVTT_TM_ONE_SHOT;
 	if (enable_int)
 		value &= ~APIC_LVT_M;
 	lapic->lvt_timer = value;
 	lapic->icr_timer = count;
 }
 
 static void
 lapic_timer_periodic(u_int count, int enable_int)
 {
 	u_int32_t value;
 
 	value = lapic->lvt_timer;
 	value &= ~APIC_LVTT_TM;
 	value |= APIC_LVTT_TM_PERIODIC;
 	if (enable_int)
 		value &= ~APIC_LVT_M;
 	lapic->lvt_timer = value;
 	lapic->icr_timer = count;
 }
 
 static void
 lapic_timer_stop(void)
 {
 	u_int32_t value;
 
 	value = lapic->lvt_timer;
 	value &= ~APIC_LVTT_TM;
 	value |= APIC_LVT_M;
 	lapic->lvt_timer = value;
 }
 
 void
 lapic_handle_cmc(void)
 {
 
 	lapic_eoi();
 	cmc_intr();
 }
 
 /*
  * Called from the mca_init() to activate the CMC interrupt if this CPU is
  * responsible for monitoring any MC banks for CMC events.  Since mca_init()
  * is called prior to lapic_setup() during boot, this just needs to unmask
  * this CPU's LVT_CMCI entry.
  */
 void
 lapic_enable_cmc(void)
 {
 	u_int apic_id;
 
 	apic_id = PCPU_GET(apic_id);
 	KASSERT(lapics[apic_id].la_present,
 	    ("%s: missing APIC %u", __func__, apic_id));
 	lapics[apic_id].la_lvts[LVT_CMCI].lvt_masked = 0;
 	lapics[apic_id].la_lvts[LVT_CMCI].lvt_active = 1;
 	if (bootverbose)
 		printf("lapic%u: CMCI unmasked\n", apic_id);
 }
 
 void
 lapic_handle_error(void)
 {
 	u_int32_t esr;
 
 	/*
 	 * Read the contents of the error status register.  Write to
 	 * the register first before reading from it to force the APIC
 	 * to update its value to indicate any errors that have
 	 * occurred since the previous write to the register.
 	 */
 	lapic->esr = 0;
 	esr = lapic->esr;
 
 	printf("CPU%d: local APIC error 0x%x\n", PCPU_GET(cpuid), esr);
 	lapic_eoi();
 }
 
 u_int
 apic_cpuid(u_int apic_id)
 {
 #ifdef SMP
 	return apic_cpuids[apic_id];
 #else
 	return 0;
 #endif
 }
 
 /* Request a free IDT vector to be used by the specified IRQ. */
 u_int
 apic_alloc_vector(u_int apic_id, u_int irq)
 {
 	u_int vector;
 
 	KASSERT(irq < NUM_IO_INTS, ("Invalid IRQ %u", irq));
 
 	/*
 	 * Search for a free vector.  Currently we just use a very simple
 	 * algorithm to find the first free vector.
 	 */
 	mtx_lock_spin(&icu_lock);
 	for (vector = 0; vector < APIC_NUM_IOINTS; vector++) {
 		if (lapics[apic_id].la_ioint_irqs[vector] != -1)
 			continue;
 		lapics[apic_id].la_ioint_irqs[vector] = irq;
 		mtx_unlock_spin(&icu_lock);
 		return (vector + APIC_IO_INTS);
 	}
 	mtx_unlock_spin(&icu_lock);
 	return (0);
 }
 
 /*
  * Request 'count' free contiguous IDT vectors to be used by 'count'
  * IRQs.  'count' must be a power of two and the vectors will be
  * aligned on a boundary of 'align'.  If the request cannot be
  * satisfied, 0 is returned.
  */
 u_int
 apic_alloc_vectors(u_int apic_id, u_int *irqs, u_int count, u_int align)
 {
 	u_int first, run, vector;
 
 	KASSERT(powerof2(count), ("bad count"));
 	KASSERT(powerof2(align), ("bad align"));
 	KASSERT(align >= count, ("align < count"));
 #ifdef INVARIANTS
 	for (run = 0; run < count; run++)
 		KASSERT(irqs[run] < NUM_IO_INTS, ("Invalid IRQ %u at index %u",
 		    irqs[run], run));
 #endif
 
 	/*
 	 * Search for 'count' free vectors.  As with apic_alloc_vector(),
 	 * this just uses a simple first fit algorithm.
 	 */
 	run = 0;
 	first = 0;
 	mtx_lock_spin(&icu_lock);
 	for (vector = 0; vector < APIC_NUM_IOINTS; vector++) {
 
 		/* Vector is in use, end run. */
 		if (lapics[apic_id].la_ioint_irqs[vector] != -1) {
 			run = 0;
 			first = 0;
 			continue;
 		}
 
 		/* Start a new run if run == 0 and vector is aligned. */
 		if (run == 0) {
 			if ((vector & (align - 1)) != 0)
 				continue;
 			first = vector;
 		}
 		run++;
 
 		/* Keep looping if the run isn't long enough yet. */
 		if (run < count)
 			continue;
 
 		/* Found a run, assign IRQs and return the first vector. */
 		for (vector = 0; vector < count; vector++)
 			lapics[apic_id].la_ioint_irqs[first + vector] =
 			    irqs[vector];
 		mtx_unlock_spin(&icu_lock);
 		return (first + APIC_IO_INTS);
 	}
 	mtx_unlock_spin(&icu_lock);
 	printf("APIC: Couldn't find APIC vectors for %u IRQs\n", count);
 	return (0);
 }
 
 /*
  * Enable a vector for a particular apic_id.  Since all lapics share idt
  * entries and ioint_handlers this enables the vector on all lapics.  lapics
  * which do not have the vector configured would report spurious interrupts
  * should it fire.
  */
 void
 apic_enable_vector(u_int apic_id, u_int vector)
 {
 
 	KASSERT(vector != IDT_SYSCALL, ("Attempt to overwrite syscall entry"));
 	KASSERT(ioint_handlers[vector / 32] != NULL,
 	    ("No ISR handler for vector %u", vector));
 #ifdef KDTRACE_HOOKS
 	KASSERT(vector != IDT_DTRACE_RET,
 	    ("Attempt to overwrite DTrace entry"));
 #endif
 	setidt(vector, ioint_handlers[vector / 32], SDT_APIC, SEL_KPL,
 	    GSEL_APIC);
 }
 
 void
 apic_disable_vector(u_int apic_id, u_int vector)
 {
 
 	KASSERT(vector != IDT_SYSCALL, ("Attempt to overwrite syscall entry"));
 #ifdef KDTRACE_HOOKS
 	KASSERT(vector != IDT_DTRACE_RET,
 	    ("Attempt to overwrite DTrace entry"));
 #endif
 	KASSERT(ioint_handlers[vector / 32] != NULL,
 	    ("No ISR handler for vector %u", vector));
 #ifdef notyet
 	/*
 	 * We can not currently clear the idt entry because other cpus
 	 * may have a valid vector at this offset.
 	 */
 	setidt(vector, &IDTVEC(rsvd), SDT_APICT, SEL_KPL, GSEL_APIC);
 #endif
 }
 
 /* Release an APIC vector when it's no longer in use. */
 void
 apic_free_vector(u_int apic_id, u_int vector, u_int irq)
 {
 	struct thread *td;
 
 	KASSERT(vector >= APIC_IO_INTS && vector != IDT_SYSCALL &&
 	    vector <= APIC_IO_INTS + APIC_NUM_IOINTS,
 	    ("Vector %u does not map to an IRQ line", vector));
 	KASSERT(irq < NUM_IO_INTS, ("Invalid IRQ %u", irq));
 	KASSERT(lapics[apic_id].la_ioint_irqs[vector - APIC_IO_INTS] ==
 	    irq, ("IRQ mismatch"));
 #ifdef KDTRACE_HOOKS
 	KASSERT(vector != IDT_DTRACE_RET,
 	    ("Attempt to overwrite DTrace entry"));
 #endif
 
 	/*
 	 * Bind us to the cpu that owned the vector before freeing it so
 	 * we don't lose an interrupt delivery race.
 	 */
 	td = curthread;
 	if (!rebooting) {
 		thread_lock(td);
 		if (sched_is_bound(td))
 			panic("apic_free_vector: Thread already bound.\n");
 		sched_bind(td, apic_cpuid(apic_id));
 		thread_unlock(td);
 	}
 	mtx_lock_spin(&icu_lock);
 	lapics[apic_id].la_ioint_irqs[vector - APIC_IO_INTS] = -1;
 	mtx_unlock_spin(&icu_lock);
 	if (!rebooting) {
 		thread_lock(td);
 		sched_unbind(td);
 		thread_unlock(td);
 	}
 }
 
 /* Map an IDT vector (APIC) to an IRQ (interrupt source). */
 u_int
 apic_idt_to_irq(u_int apic_id, u_int vector)
 {
 	int irq;
 
 	KASSERT(vector >= APIC_IO_INTS && vector != IDT_SYSCALL &&
 	    vector <= APIC_IO_INTS + APIC_NUM_IOINTS,
 	    ("Vector %u does not map to an IRQ line", vector));
 #ifdef KDTRACE_HOOKS
 	KASSERT(vector != IDT_DTRACE_RET,
 	    ("Attempt to overwrite DTrace entry"));
 #endif
 	irq = lapics[apic_id].la_ioint_irqs[vector - APIC_IO_INTS];
 	if (irq < 0)
 		irq = 0;
 	return (irq);
 }
 
 #ifdef DDB
 /*
  * Dump data about APIC IDT vector mappings.
  */
 DB_SHOW_COMMAND(apic, db_show_apic)
 {
 	struct intsrc *isrc;
 	int i, verbose;
 	u_int apic_id;
 	u_int irq;
 
 	if (strcmp(modif, "vv") == 0)
 		verbose = 2;
 	else if (strcmp(modif, "v") == 0)
 		verbose = 1;
 	else
 		verbose = 0;
 	for (apic_id = 0; apic_id <= MAX_APIC_ID; apic_id++) {
 		if (lapics[apic_id].la_present == 0)
 			continue;
 		db_printf("Interrupts bound to lapic %u\n", apic_id);
 		for (i = 0; i < APIC_NUM_IOINTS + 1 && !db_pager_quit; i++) {
 			irq = lapics[apic_id].la_ioint_irqs[i];
 			if (irq == -1 || irq == IRQ_SYSCALL)
 				continue;
 #ifdef KDTRACE_HOOKS
 			if (irq == IRQ_DTRACE_RET)
 				continue;
 #endif
 			db_printf("vec 0x%2x -> ", i + APIC_IO_INTS);
 			if (irq == IRQ_TIMER)
 				db_printf("lapic timer\n");
 			else if (irq < NUM_IO_INTS) {
 				isrc = intr_lookup_source(irq);
 				if (isrc == NULL || verbose == 0)
 					db_printf("IRQ %u\n", irq);
 				else
 					db_dump_intr_event(isrc->is_event,
 					    verbose == 2);
 			} else
 				db_printf("IRQ %u ???\n", irq);
 		}
 	}
 }
 
 static void
 dump_mask(const char *prefix, uint32_t v, int base)
 {
 	int i, first;
 
 	first = 1;
 	for (i = 0; i < 32; i++)
 		if (v & (1 << i)) {
 			if (first) {
 				db_printf("%s:", prefix);
 				first = 0;
 			}
 			db_printf(" %02x", base + i);
 		}
 	if (!first)
 		db_printf("\n");
 }
 
 /* Show info from the lapic regs for this CPU. */
 DB_SHOW_COMMAND(lapic, db_show_lapic)
 {
 	uint32_t v;
 
 	db_printf("lapic ID = %d\n", lapic_id());
 	v = lapic->version;
 	db_printf("version  = %d.%d\n", (v & APIC_VER_VERSION) >> 4,
 	    v & 0xf);
 	db_printf("max LVT  = %d\n", (v & APIC_VER_MAXLVT) >> MAXLVTSHIFT);
 	v = lapic->svr;
 	db_printf("SVR      = %02x (%s)\n", v & APIC_SVR_VECTOR,
 	    v & APIC_SVR_ENABLE ? "enabled" : "disabled");
 	db_printf("TPR      = %02x\n", lapic->tpr);
 
 #define dump_field(prefix, index)					\
 	dump_mask(__XSTRING(prefix ## index), lapic->prefix ## index,	\
 	    index * 32)
 
 	db_printf("In-service Interrupts:\n");
 	dump_field(isr, 0);
 	dump_field(isr, 1);
 	dump_field(isr, 2);
 	dump_field(isr, 3);
 	dump_field(isr, 4);
 	dump_field(isr, 5);
 	dump_field(isr, 6);
 	dump_field(isr, 7);
 
 	db_printf("TMR Interrupts:\n");
 	dump_field(tmr, 0);
 	dump_field(tmr, 1);
 	dump_field(tmr, 2);
 	dump_field(tmr, 3);
 	dump_field(tmr, 4);
 	dump_field(tmr, 5);
 	dump_field(tmr, 6);
 	dump_field(tmr, 7);
 
 	db_printf("IRR Interrupts:\n");
 	dump_field(irr, 0);
 	dump_field(irr, 1);
 	dump_field(irr, 2);
 	dump_field(irr, 3);
 	dump_field(irr, 4);
 	dump_field(irr, 5);
 	dump_field(irr, 6);
 	dump_field(irr, 7);
 
 #undef dump_field
 }
 #endif
 
 /*
  * APIC probing support code.  This includes code to manage enumerators.
  */
 
 static SLIST_HEAD(, apic_enumerator) enumerators =
 	SLIST_HEAD_INITIALIZER(enumerators);
 static struct apic_enumerator *best_enum;
 
 void
 apic_register_enumerator(struct apic_enumerator *enumerator)
 {
 #ifdef INVARIANTS
 	struct apic_enumerator *apic_enum;
 
 	SLIST_FOREACH(apic_enum, &enumerators, apic_next) {
 		if (apic_enum == enumerator)
 			panic("%s: Duplicate register of %s", __func__,
 			    enumerator->apic_name);
 	}
 #endif
 	SLIST_INSERT_HEAD(&enumerators, enumerator, apic_next);
 }
 
 /*
  * We have to look for CPU's very, very early because certain subsystems
  * want to know how many CPU's we have extremely early on in the boot
  * process.
  */
 static void
 apic_init(void *dummy __unused)
 {
 	struct apic_enumerator *enumerator;
 #ifndef __amd64__
 	uint64_t apic_base;
 #endif
 	int retval, best;
 
 	/* We only support built in local APICs. */
 	if (!(cpu_feature & CPUID_APIC))
 		return;
 
 	/* Don't probe if APIC mode is disabled. */
 	if (resource_disabled("apic", 0))
 		return;
 
 	/* Probe all the enumerators to find the best match. */
 	best_enum = NULL;
 	best = 0;
 	SLIST_FOREACH(enumerator, &enumerators, apic_next) {
 		retval = enumerator->apic_probe();
 		if (retval > 0)
 			continue;
 		if (best_enum == NULL || best < retval) {
 			best_enum = enumerator;
 			best = retval;
 		}
 	}
 	if (best_enum == NULL) {
 		if (bootverbose)
 			printf("APIC: Could not find any APICs.\n");
 		return;
 	}
 
 	if (bootverbose)
 		printf("APIC: Using the %s enumerator.\n",
 		    best_enum->apic_name);
 
 #ifndef __amd64__
 	/*
 	 * To work around an errata, we disable the local APIC on some
 	 * CPUs during early startup.  We need to turn the local APIC back
 	 * on on such CPUs now.
 	 */
 	if (cpu == CPU_686 && cpu_vendor_id == CPU_VENDOR_INTEL &&
 	    (cpu_id & 0xff0) == 0x610) {
 		apic_base = rdmsr(MSR_APICBASE);
 		apic_base |= APICBASE_ENABLED;
 		wrmsr(MSR_APICBASE, apic_base);
 	}
 #endif
 
 	/* Probe the CPU's in the system. */
 	retval = best_enum->apic_probe_cpus();
 	if (retval != 0)
 		printf("%s: Failed to probe CPUs: returned %d\n",
 		    best_enum->apic_name, retval);
 
 }
 SYSINIT(apic_init, SI_SUB_TUNABLES - 1, SI_ORDER_SECOND, apic_init, NULL);
 
 /*
  * Setup the local APIC.  We have to do this prior to starting up the APs
  * in the SMP case.
  */
 static void
 apic_setup_local(void *dummy __unused)
 {
 	int retval;
  
 	if (best_enum == NULL)
 		return;
 
 	/* Initialize the local APIC. */
 	retval = best_enum->apic_setup_local();
 	if (retval != 0)
 		printf("%s: Failed to setup the local APIC: returned %d\n",
 		    best_enum->apic_name, retval);
 }
 SYSINIT(apic_setup_local, SI_SUB_CPU, SI_ORDER_SECOND, apic_setup_local, NULL);
 
 /*
  * Setup the I/O APICs.
  */
 static void
 apic_setup_io(void *dummy __unused)
 {
 	int retval;
 
 	if (best_enum == NULL)
 		return;
 	retval = best_enum->apic_setup_io();
 	if (retval != 0)
 		printf("%s: Failed to setup I/O APICs: returned %d\n",
 		    best_enum->apic_name, retval);
 
 #ifdef XEN
 	return;
 #endif
 	/*
 	 * Finish setting up the local APIC on the BSP once we know how to
 	 * properly program the LINT pins.
 	 */
 	lapic_setup(1);
 	intr_register_pic(&lapic_pic);
 	if (bootverbose)
 		lapic_dump("BSP");
 
 	/* Enable the MSI "pic". */
 	msi_init();
 }
 SYSINIT(apic_setup_io, SI_SUB_INTR, SI_ORDER_SECOND, apic_setup_io, NULL);
 
 #ifdef SMP
 /*
  * Inter Processor Interrupt functions.  The lapic_ipi_*() functions are
  * private to the MD code.  The public interface for the rest of the
  * kernel is defined in mp_machdep.c.
  */
 int
 lapic_ipi_wait(int delay)
 {
 	int x, incr;
 
 	/*
 	 * Wait delay loops for IPI to be sent.  This is highly bogus
 	 * since this is sensitive to CPU clock speed.  If delay is
 	 * -1, we wait forever.
 	 */
 	if (delay == -1) {
 		incr = 0;
 		delay = 1;
 	} else
 		incr = 1;
 	for (x = 0; x < delay; x += incr) {
 		if ((lapic->icr_lo & APIC_DELSTAT_MASK) == APIC_DELSTAT_IDLE)
 			return (1);
 		ia32_pause();
 	}
 	return (0);
 }
 
 void
 lapic_ipi_raw(register_t icrlo, u_int dest)
 {
 	register_t value, saveintr;
 
 	/* XXX: Need more sanity checking of icrlo? */
 	KASSERT(lapic != NULL, ("%s called too early", __func__));
 	KASSERT((dest & ~(APIC_ID_MASK >> APIC_ID_SHIFT)) == 0,
 	    ("%s: invalid dest field", __func__));
 	KASSERT((icrlo & APIC_ICRLO_RESV_MASK) == 0,
 	    ("%s: reserved bits set in ICR LO register", __func__));
 
 	/* Set destination in ICR HI register if it is being used. */
 	saveintr = intr_disable();
 	if ((icrlo & APIC_DEST_MASK) == APIC_DEST_DESTFLD) {
 		value = lapic->icr_hi;
 		value &= ~APIC_ID_MASK;
 		value |= dest << APIC_ID_SHIFT;
 		lapic->icr_hi = value;
 	}
 
 	/* Program the contents of the IPI and dispatch it. */
 	value = lapic->icr_lo;
 	value &= APIC_ICRLO_RESV_MASK;
 	value |= icrlo;
 	lapic->icr_lo = value;
 	intr_restore(saveintr);
 }
 
 #define	BEFORE_SPIN	1000000
 #ifdef DETECT_DEADLOCK
 #define	AFTER_SPIN	1000
 #endif
 
 void
 lapic_ipi_vectored(u_int vector, int dest)
 {
 	register_t icrlo, destfield;
 
 	KASSERT((vector & ~APIC_VECTOR_MASK) == 0,
 	    ("%s: invalid vector %d", __func__, vector));
 
 	icrlo = APIC_DESTMODE_PHY | APIC_TRIGMOD_EDGE;
 
 	/*
 	 * IPI_STOP_HARD is just a "fake" vector used to send a NMI.
 	 * Use special rules regard NMI if passed, otherwise specify
 	 * the vector.
 	 */
 	if (vector == IPI_STOP_HARD)
 		icrlo |= APIC_DELMODE_NMI | APIC_LEVEL_ASSERT;
 	else
 		icrlo |= vector | APIC_DELMODE_FIXED | APIC_LEVEL_DEASSERT;
 	destfield = 0;
 	switch (dest) {
 	case APIC_IPI_DEST_SELF:
 		icrlo |= APIC_DEST_SELF;
 		break;
 	case APIC_IPI_DEST_ALL:
 		icrlo |= APIC_DEST_ALLISELF;
 		break;
 	case APIC_IPI_DEST_OTHERS:
 		icrlo |= APIC_DEST_ALLESELF;
 		break;
 	default:
 		KASSERT((dest & ~(APIC_ID_MASK >> APIC_ID_SHIFT)) == 0,
 		    ("%s: invalid destination 0x%x", __func__, dest));
 		destfield = dest;
 	}
 
 	/* Wait for an earlier IPI to finish. */
 	if (!lapic_ipi_wait(BEFORE_SPIN)) {
 		if (panicstr != NULL)
 			return;
 		else
 			panic("APIC: Previous IPI is stuck");
 	}
 
 	lapic_ipi_raw(icrlo, destfield);
 
 #ifdef DETECT_DEADLOCK
 	/* Wait for IPI to be delivered. */
 	if (!lapic_ipi_wait(AFTER_SPIN)) {
 #ifdef needsattention
 		/*
 		 * XXX FIXME:
 		 *
 		 * The above function waits for the message to actually be
 		 * delivered.  It breaks out after an arbitrary timeout
 		 * since the message should eventually be delivered (at
 		 * least in theory) and that if it wasn't we would catch
 		 * the failure with the check above when the next IPI is
 		 * sent.
 		 *
 		 * We could skip this wait entirely, EXCEPT it probably
 		 * protects us from other routines that assume that the
 		 * message was delivered and acted upon when this function
 		 * returns.
 		 */
 		printf("APIC: IPI might be stuck\n");
 #else /* !needsattention */
 		/* Wait until mesage is sent without a timeout. */
 		while (lapic->icr_lo & APIC_DELSTAT_PEND)
 			ia32_pause();
 #endif /* needsattention */
 	}
 #endif /* DETECT_DEADLOCK */
 }
 #endif /* SMP */
Index: head/sys
===================================================================
--- head/sys	(revision 222812)
+++ head/sys	(revision 222813)

Property changes on: head/sys
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/sys:r221273-222812
Index: head/tools/build/options/WITHOUT_GPIO
===================================================================
--- head/tools/build/options/WITHOUT_GPIO	(revision 222812)
+++ head/tools/build/options/WITHOUT_GPIO	(revision 222813)

Property changes on: head/tools/build/options/WITHOUT_GPIO
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/build/options/WITH_OFED
===================================================================
--- head/tools/build/options/WITH_OFED	(revision 222812)
+++ head/tools/build/options/WITH_OFED	(revision 222813)

Property changes on: head/tools/build/options/WITH_OFED
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/builtins/set1.0
===================================================================
--- head/tools/regression/bin/sh/builtins/set1.0	(revision 222812)
+++ head/tools/regression/bin/sh/builtins/set1.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/builtins/set1.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote1.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote1.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote1.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote1.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote2.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote2.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote2.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote2.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote3.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote3.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote3.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote3.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote4.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote4.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote4.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote4.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote5.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote5.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote5.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote5.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote6.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote6.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote6.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote6.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote7.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote7.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote7.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote7.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote8.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote8.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote8.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote8.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/tools/regression/bin/sh/parser/dollar-quote9.0
===================================================================
--- head/tools/regression/bin/sh/parser/dollar-quote9.0	(revision 222812)
+++ head/tools/regression/bin/sh/parser/dollar-quote9.0	(revision 222813)

Property changes on: head/tools/regression/bin/sh/parser/dollar-quote9.0
___________________________________________________________________
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Index: head/usr.bin/calendar
===================================================================
--- head/usr.bin/calendar	(revision 222812)
+++ head/usr.bin/calendar	(revision 222813)

Property changes on: head/usr.bin/calendar
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/usr.bin/calendar:r221273-222812
Index: head/usr.bin/csup
===================================================================
--- head/usr.bin/csup	(revision 222812)
+++ head/usr.bin/csup	(revision 222813)

Property changes on: head/usr.bin/csup
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/usr.bin/csup:r221273-222812
Index: head/usr.bin/procstat
===================================================================
--- head/usr.bin/procstat	(revision 222812)
+++ head/usr.bin/procstat	(revision 222813)

Property changes on: head/usr.bin/procstat
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/usr.bin/procstat:r221273-222812
Index: head/usr.sbin/ndiscvt
===================================================================
--- head/usr.sbin/ndiscvt	(revision 222812)
+++ head/usr.sbin/ndiscvt	(revision 222813)

Property changes on: head/usr.sbin/ndiscvt
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/usr.sbin/ndiscvt:r221273-222812
Index: head/usr.sbin/pmccontrol/pmccontrol.c
===================================================================
--- head/usr.sbin/pmccontrol/pmccontrol.c	(revision 222812)
+++ head/usr.sbin/pmccontrol/pmccontrol.c	(revision 222813)
@@ -1,501 +1,508 @@
 /*-
  * Copyright (c) 2003,2004 Joseph Koshy
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
-#include <sys/types.h>
+#include <sys/param.h>
 #include <sys/queue.h>
+#include <sys/cpuset.h>
 #include <sys/sysctl.h>
 
 #include <assert.h>
 #include <err.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <limits.h>
 #include <pmc.h>
 #include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sysexits.h>
 #include <unistd.h>
 
 /* Compile time defaults */
 
 #define	PMCC_PRINT_USAGE	0
 #define	PMCC_PRINT_EVENTS	1
 #define	PMCC_LIST_STATE 	2
 #define	PMCC_ENABLE_DISABLE	3
 #define	PMCC_SHOW_STATISTICS	4
 
 #define	PMCC_CPU_ALL		-1
 #define	PMCC_CPU_WILDCARD	'*'
 
 #define	PMCC_PMC_ALL		-1
 #define	PMCC_PMC_WILDCARD	'*'
 
 #define	PMCC_OP_IGNORE		0
 #define	PMCC_OP_DISABLE		1
 #define	PMCC_OP_ENABLE		2
 
 #define	PMCC_PROGRAM_NAME	"pmccontrol"
 
 STAILQ_HEAD(pmcc_op_list, pmcc_op) head = STAILQ_HEAD_INITIALIZER(head);
 
 struct pmcc_op {
 	char	op_cpu;
 	char	op_pmc;
 	char	op_op;
 	STAILQ_ENTRY(pmcc_op) op_next;
 };
 
 /* Function Prototypes */
 #if	DEBUG
 static void	pmcc_init_debug(void);
 #endif
 
 static int	pmcc_do_list_state(void);
 static int	pmcc_do_enable_disable(struct pmcc_op_list *);
 static int	pmcc_do_list_events(void);
 
 /* Globals */
 
 static char usage_message[] =
 	"Usage:\n"
 	"       " PMCC_PROGRAM_NAME " -L\n"
 	"       " PMCC_PROGRAM_NAME " -l\n"
 	"       " PMCC_PROGRAM_NAME " -s\n"
 	"       " PMCC_PROGRAM_NAME " [-e pmc | -d pmc | -c cpu] ...";
 
 #if DEBUG
 FILE *debug_stream = NULL;
 #endif
 
 #if DEBUG
 #define DEBUG_MSG(...)					                \
 	(void) fprintf(debug_stream, "[pmccontrol] " __VA_ARGS__);
 #else
 #define DEBUG_MSG(m)		/*  */
 #endif /* !DEBUG */
 
 int pmc_syscall = -1;
 
 #define PMC_CALL(cmd, params)						\
 if ((error = syscall(pmc_syscall, PMC_OP_##cmd, (params))) != 0)	\
 {									\
 	DEBUG_MSG("ERROR: syscall [" #cmd "]");				\
 	exit(EX_OSERR);							\
 }
 
 #if DEBUG
 /* log debug messages to a separate file */
 static void
 pmcc_init_debug(void)
 {
 	char *fn;
 
 	fn = getenv("PMCCONTROL_DEBUG");
 	if (fn != NULL)
 	{
 		debug_stream = fopen(fn, "w");
 		if (debug_stream == NULL)
 			debug_stream = stderr;
 	} else
 		debug_stream = stderr;
 }
 #endif
 
 static int
 pmcc_do_enable_disable(struct pmcc_op_list *op_list)
 {
+	long cpusetsize;
 	int c, error, i, j, ncpu, npmc, t;
-	cpumask_t haltedcpus, cpumask;
+	cpuset_t haltedcpus, cpumask;
 	struct pmcc_op *np;
 	unsigned char *map;
 	unsigned char op;
 	int cpu, pmc;
-	size_t dummy;
+	size_t setsize;
 
 	if ((ncpu = pmc_ncpu()) < 0)
 		err(EX_OSERR, "Unable to determine the number of cpus");
 
 	/* Determine the set of active CPUs. */
-	cpumask = (1 << ncpu) - 1;
-	dummy = sizeof(int);
-	haltedcpus = (cpumask_t) 0;
+	cpusetsize = sysconf(_SC_CPUSET_SIZE);
+	if (cpusetsize == -1 || (u_long)cpusetsize > sizeof(cpuset_t)) {
+		err(EX_OSERR, "ERROR: Cannot determine which CPUs are "
+		    "halted");
+	}
+	CPU_ZERO(&haltedcpus);
+	setsize = (size_t)cpusetsize;
 	if (ncpu > 1 && sysctlbyname("machdep.hlt_cpus", &haltedcpus,
-	    &dummy, NULL, 0) < 0)
+	    &setsize, NULL, 0) < 0)
 		err(EX_OSERR, "ERROR: Cannot determine which CPUs are "
 		    "halted");
-	cpumask &= ~haltedcpus;
+	CPU_FILL(&cpumask);
+	CPU_NAND(&cpumask, &haltedcpus);
 
 	/* Determine the maximum number of PMCs in any CPU. */
 	npmc = 0;
 	for (c = 0; c < ncpu; c++) {
 		if ((t = pmc_npmc(c)) < 0)
 			err(EX_OSERR, "Unable to determine the number of "
 			    "PMCs in CPU %d", c);
 		npmc = t > npmc ? t : npmc;
 	}
 
 	if (npmc == 0)
 		errx(EX_CONFIG, "No PMCs found");
 
 	if ((map = malloc(npmc * ncpu)) == NULL)
 		err(EX_SOFTWARE, "Out of memory");
 
 	(void) memset(map, PMCC_OP_IGNORE, npmc*ncpu);
 
 	error = 0;
 	STAILQ_FOREACH(np, op_list, op_next) {
 
 		cpu = np->op_cpu;
 		pmc = np->op_pmc;
 		op  = np->op_op;
 
 		if (cpu >= ncpu)
 			errx(EX_DATAERR, "CPU id too large: \"%d\"", cpu);
 
 		if (pmc >= npmc)
 			errx(EX_DATAERR, "PMC id too large: \"%d\"", pmc);
 
 #define MARKMAP(M,C,P,V)	do {				\
 		*((M) + (C)*npmc + (P)) = (V);			\
 } while (0)
 
 #define	SET_PMCS(C,P,V)		do {				\
 		if ((P) == PMCC_PMC_ALL) {			\
 			for (j = 0; j < npmc; j++)		\
 				MARKMAP(map, (C), j, (V));	\
 		} else						\
 			MARKMAP(map, (C), (P), (V));		\
 } while (0)
 
 #define MAP(M,C,P)	(*((M) + (C)*npmc + (P)))
 
 		if (cpu == PMCC_CPU_ALL)
 			for (i = 0; i < ncpu; i++) {
-				if ((1 << i) & cpumask)
+				if (CPU_ISSET(i, &cpumask))
 					SET_PMCS(i, pmc, op);
 			}
 		else
 			SET_PMCS(cpu, pmc, op);
 	}
 
 	/* Configure PMCS */
 	for (i = 0; i < ncpu; i++)
 		for (j = 0; j < npmc; j++) {
 			unsigned char b;
 
 			b = MAP(map, i, j);
 
 			error = 0;
 
 			if (b == PMCC_OP_ENABLE)
 				error = pmc_enable(i, j);
 			else if (b == PMCC_OP_DISABLE)
 				error = pmc_disable(i, j);
 
 			if (error < 0)
 				err(EX_OSERR, "%s of PMC %d on CPU %d failed",
 				    b == PMCC_OP_ENABLE ? "Enable" :
 				    "Disable", j, i);
 		}
 
 	return error;
 }
 
 static int
 pmcc_do_list_state(void)
 {
 	size_t dummy;
 	int c, cpu, n, npmc, ncpu;
 	unsigned int logical_cpus_mask;
 	struct pmc_info *pd;
 	struct pmc_pmcinfo *pi;
 	const struct pmc_cpuinfo *pc;
 
 	if (pmc_cpuinfo(&pc) != 0)
 		err(EX_OSERR, "Unable to determine CPU information");
 
 	printf("%d %s CPUs present, with %d PMCs per CPU\n", pc->pm_ncpu, 
 	       pmc_name_of_cputype(pc->pm_cputype),
 		pc->pm_npmc);
 
 	dummy = sizeof(logical_cpus_mask);
 	if (sysctlbyname("machdep.logical_cpus_mask", &logical_cpus_mask,
 		&dummy, NULL, 0) < 0)
 		logical_cpus_mask = 0;
 
 	ncpu = pc->pm_ncpu;
 
 	for (c = cpu = 0; cpu < ncpu; cpu++) {
 #if	defined(__i386__) || defined(__amd64__)
 		if (pc->pm_cputype == PMC_CPU_INTEL_PIV &&
 		    (logical_cpus_mask & (1 << cpu)))
 			continue; /* skip P4-style 'logical' cpus */
 #endif
 		if (pmc_pmcinfo(cpu, &pi) < 0) {
 			if (errno == ENXIO)
 				continue;
 			err(EX_OSERR, "Unable to get PMC status for CPU %d",
 			    cpu);
 		}
 
 		printf("#CPU %d:\n", c++);
 		npmc = pmc_npmc(cpu);
 		printf("#N  NAME             CLASS  STATE    ROW-DISP\n");
 
 		for (n = 0; n < npmc; n++) {
 			pd = &pi->pm_pmcs[n];
 
 			printf(" %-2d %-16s %-6s %-8s %-10s",
 			    n,
 			    pd->pm_name,
 			    pmc_name_of_class(pd->pm_class),
 			    pd->pm_enabled ? "ENABLED" : "DISABLED",
 			    pmc_name_of_disposition(pd->pm_rowdisp));
 
 			if (pd->pm_ownerpid != -1) {
 			        printf(" (pid %d)", pd->pm_ownerpid);
 				printf(" %-32s",
 				    pmc_name_of_event(pd->pm_event));
 				if (PMC_IS_SAMPLING_MODE(pd->pm_mode))
 					printf(" (reload count %jd)",
 					    pd->pm_reloadcount);
 			}
 			printf("\n");
 		}
 		free(pi);
 	}
 	return 0;
 }
 
 static int
 pmcc_do_list_events(void)
 {
 	enum pmc_class c;
 	unsigned int i, j, nevents;
 	const char **eventnamelist;
 	const struct pmc_cpuinfo *ci;
 
 	if (pmc_cpuinfo(&ci) != 0)
 		err(EX_OSERR, "Unable to determine CPU information");
 
 	eventnamelist = NULL;
 
 	for (i = 0; i < ci->pm_nclass; i++) {
 		c = ci->pm_classes[i].pm_class;
 
 		printf("%s\n", pmc_name_of_class(c));
 		if (pmc_event_names_of_class(c, &eventnamelist, &nevents) < 0)
 			err(EX_OSERR, "ERROR: Cannot find information for "
 			    "event class \"%s\"", pmc_name_of_class(c));
 
 		for (j = 0; j < nevents; j++)
 			printf("\t%s\n", eventnamelist[j]);
 
 		free(eventnamelist);
 	}
 	return 0;
 }
 
 static int
 pmcc_show_statistics(void)
 {
 
 	struct pmc_driverstats gms;
 
 	if (pmc_get_driver_stats(&gms) < 0)
 		err(EX_OSERR, "ERROR: cannot retrieve driver statistics");
 
 	/*
 	 * Print statistics.
 	 */
 
 #define	PRINT(N,V)	(void) printf("%-40s %d\n", (N), gms.pm_##V)
 	PRINT("interrupts processed:", intr_processed);
 	PRINT("non-PMC interrupts:", intr_ignored);
 	PRINT("sampling stalls due to space shortages:", intr_bufferfull);
 	PRINT("system calls:", syscalls);
 	PRINT("system calls with errors:", syscall_errors);
 	PRINT("buffer requests:", buffer_requests);
 	PRINT("buffer requests failed:", buffer_requests_failed);
 	PRINT("sampling log sweeps:", log_sweeps);
 
 	return 0;
 }
 
 /*
  * Main
  */
 
 int
 main(int argc, char **argv)
 {
 	int error, command, currentcpu, option, pmc;
 	char *dummy;
 	struct pmcc_op *p;
 
 #if DEBUG
 	pmcc_init_debug();
 #endif
 
 	/* parse args */
 
 	currentcpu = PMCC_CPU_ALL;
 	command    = PMCC_PRINT_USAGE;
 	error      = 0;
 
 	STAILQ_INIT(&head);
 
 	while ((option = getopt(argc, argv, ":c:d:e:lLs")) != -1)
 		switch (option) {
 		case 'L':
 			if (command != PMCC_PRINT_USAGE) {
 				error = 1;
 				break;
 			}
 			command = PMCC_PRINT_EVENTS;
 			break;
 
 		case 'c':
 			if (command != PMCC_PRINT_USAGE &&
 			    command != PMCC_ENABLE_DISABLE) {
 				error = 1;
 				break;
 			}
 			command = PMCC_ENABLE_DISABLE;
 
 			if (*optarg == PMCC_CPU_WILDCARD)
 				currentcpu = PMCC_CPU_ALL;
 			else {
 				currentcpu = strtoul(optarg, &dummy, 0);
 				if (*dummy != '\0' || currentcpu < 0)
 					errx(EX_DATAERR,
 					    "\"%s\" is not a valid CPU id",
 					    optarg);
 			}
 			break;
 
 		case 'd':
 		case 'e':
 			if (command != PMCC_PRINT_USAGE &&
 			    command != PMCC_ENABLE_DISABLE) {
 				error = 1;
 				break;
 			}
 			command = PMCC_ENABLE_DISABLE;
 
 			if (*optarg == PMCC_PMC_WILDCARD)
 				pmc = PMCC_PMC_ALL;
 			else {
 				pmc = strtoul(optarg, &dummy, 0);
 				if (*dummy != '\0' || pmc < 0)
 					errx(EX_DATAERR,
 					    "\"%s\" is not a valid PMC id",
 					    optarg);
 			}
 
 			if ((p = malloc(sizeof(*p))) == NULL)
 				err(EX_SOFTWARE, "Out of memory");
 
 			p->op_cpu = currentcpu;
 			p->op_pmc = pmc;
 			p->op_op  = option == 'd' ? PMCC_OP_DISABLE :
 			    PMCC_OP_ENABLE;
 
 			STAILQ_INSERT_TAIL(&head, p, op_next);
 			break;
 
 		case 'l':
 			if (command != PMCC_PRINT_USAGE) {
 				error = 1;
 				break;
 			}
 			command = PMCC_LIST_STATE;
 			break;
 
 		case 's':
 			if (command != PMCC_PRINT_USAGE) {
 				error = 1;
 				break;
 			}
 			command = PMCC_SHOW_STATISTICS;
 			break;
 
 		case ':':
 			errx(EX_USAGE,
 			    "Missing argument to option '-%c'", optopt);
 			break;
 
 		case '?':
 			warnx("Unrecognized option \"-%c\"", optopt);
 			errx(EX_USAGE, usage_message);
 			break;
 
 		default:
 			error = 1;
 			break;
 
 		}
 
 	if (command == PMCC_PRINT_USAGE)
 		(void) errx(EX_USAGE, usage_message);
 
 	if (error)
 		exit(EX_USAGE);
 
 	if (pmc_init() < 0)
 		err(EX_UNAVAILABLE,
 		    "Initialization of the pmc(3) library failed");
 
 	switch (command) {
 	case PMCC_LIST_STATE:
 		error = pmcc_do_list_state();
 		break;
 	case PMCC_PRINT_EVENTS:
 		error = pmcc_do_list_events();
 		break;
 	case PMCC_SHOW_STATISTICS:
 		error = pmcc_show_statistics();
 		break;
 	case PMCC_ENABLE_DISABLE:
 		if (STAILQ_EMPTY(&head))
 			errx(EX_USAGE, "No PMCs specified to enable or disable");
 		error = pmcc_do_enable_disable(&head);
 		break;
 	default:
 		assert(0);
 
 	}
 
 	if (error != 0)
 		err(EX_OSERR, "Command failed");
 	exit(0);
 }
Index: head/usr.sbin/zic
===================================================================
--- head/usr.sbin/zic	(revision 222812)
+++ head/usr.sbin/zic	(revision 222813)

Property changes on: head/usr.sbin/zic
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP/usr.sbin/zic:r221273-222812
Index: head
===================================================================
--- head	(revision 222812)
+++ head	(revision 222813)

Property changes on: head
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /projects/largeSMP:r221273-222812