Index: head/sys/amd64/sgx/sgx.c
===================================================================
--- head/sys/amd64/sgx/sgx.c (revision 349431)
+++ head/sys/amd64/sgx/sgx.c (revision 349432)
@@ -1,1220 +1,1220 @@
/*-
* Copyright (c) 2017 Ruslan Bukin
* All rights reserved.
*
* This software was developed by BAE Systems, the University of Cambridge
* Computer Laboratory, and Memorial University under DARPA/AFRL contract
* FA8650-15-C-7558 ("CADETS"), as part of the DARPA Transparent Computing
* (TC) research program.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
/*
* Design overview.
*
* The driver provides character device for mmap(2) and ioctl(2) system calls
* allowing user to manage isolated compartments ("enclaves") in user VA space.
*
* The driver duties is EPC pages management, enclave management, user data
* validation.
*
* This driver requires Intel SGX support from hardware.
*
* /dev/sgx:
* .mmap:
* sgx_mmap_single() allocates VM object with following pager
* operations:
* a) sgx_pg_ctor():
* VM object constructor does nothing
* b) sgx_pg_dtor():
* VM object destructor destroys the SGX enclave associated
* with the object: it frees all the EPC pages allocated for
* enclave and removes the enclave.
* c) sgx_pg_fault():
* VM object fault handler does nothing
*
* .ioctl:
* sgx_ioctl():
* a) SGX_IOC_ENCLAVE_CREATE
* Adds Enclave SECS page: initial step of enclave creation.
* b) SGX_IOC_ENCLAVE_ADD_PAGE
* Adds TCS, REG pages to the enclave.
* c) SGX_IOC_ENCLAVE_INIT
* Finalizes enclave creation.
*
* Enclave lifecycle:
* .-- ECREATE -- Add SECS page
* Kernel | EADD -- Add TCS, REG pages
* space | EEXTEND -- Measure the page (take unique hash)
* ENCLS | EPA -- Allocate version array page
* '-- EINIT -- Finalize enclave creation
* User .-- EENTER -- Go to entry point of enclave
* space | EEXIT -- Exit back to main application
* ENCLU '-- ERESUME -- Resume enclave execution (e.g. after exception)
*
* Enclave lifecycle from driver point of view:
* 1) User calls mmap() on /dev/sgx: we allocate a VM object
* 2) User calls ioctl SGX_IOC_ENCLAVE_CREATE: we look for the VM object
* associated with user process created on step 1, create SECS physical
* page and store it in enclave's VM object queue by special index
* SGX_SECS_VM_OBJECT_INDEX.
* 3) User calls ioctl SGX_IOC_ENCLAVE_ADD_PAGE: we look for enclave created
* on step 2, create TCS or REG physical page and map it to specified by
* user address of enclave VM object.
* 4) User finalizes enclave creation with ioctl SGX_IOC_ENCLAVE_INIT call.
* 5) User can freely enter to and exit from enclave using ENCLU instructions
* from userspace: the driver does nothing here.
* 6) User proceed munmap(2) system call (or the process with enclave dies):
* we destroy the enclave associated with the object.
*
* EPC page types and their indexes in VM object queue:
* - PT_SECS index is special and equals SGX_SECS_VM_OBJECT_INDEX (-1);
* - PT_TCS and PT_REG indexes are specified by user in addr field of ioctl
* request data and determined as follows:
* pidx = OFF_TO_IDX(addp->addr - vmh->base);
* - PT_VA index is special, created for PT_REG, PT_TCS and PT_SECS pages
* and determined by formula:
* va_page_idx = - SGX_VA_PAGES_OFFS - (page_idx / SGX_VA_PAGE_SLOTS);
* PT_VA page can hold versions of up to 512 pages, and slot for each
* page in PT_VA page is determined as follows:
* va_slot_idx = page_idx % SGX_VA_PAGE_SLOTS;
* - PT_TRIM is unused.
*
* Locking:
* SGX ENCLS set of instructions have limitations on concurrency:
* some instructions can't be executed same time on different CPUs.
* We use sc->mtx_encls lock around them to prevent concurrent execution.
* sc->mtx lock is used to manage list of created enclaves and the state of
* SGX driver.
*
* Eviction of EPC pages:
* Eviction support is not implemented in this driver, however the driver
* manages VA (version array) pages: it allocates a VA slot for each EPC
* page. This will be required for eviction support in future.
* VA pages and slots are currently unused.
*
* IntelĀ® 64 and IA-32 Architectures Software Developer's Manual
* https://software.intel.com/en-us/articles/intel-sdm
*/
#include
__FBSDID("$FreeBSD$");
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define SGX_DEBUG
#undef SGX_DEBUG
#ifdef SGX_DEBUG
#define dprintf(fmt, ...) printf(fmt, ##__VA_ARGS__)
#else
#define dprintf(fmt, ...)
#endif
static struct cdev_pager_ops sgx_pg_ops;
struct sgx_softc sgx_sc;
static int
sgx_get_epc_page(struct sgx_softc *sc, struct epc_page **epc)
{
vmem_addr_t addr;
int i;
if (vmem_alloc(sc->vmem_epc, PAGE_SIZE, M_FIRSTFIT | M_NOWAIT,
&addr) == 0) {
i = (addr - sc->epc_base) / PAGE_SIZE;
*epc = &sc->epc_pages[i];
return (0);
}
return (ENOMEM);
}
static void
sgx_put_epc_page(struct sgx_softc *sc, struct epc_page *epc)
{
vmem_addr_t addr;
if (epc == NULL)
return;
addr = (epc->index * PAGE_SIZE) + sc->epc_base;
vmem_free(sc->vmem_epc, addr, PAGE_SIZE);
}
static int
sgx_va_slot_init_by_index(struct sgx_softc *sc, vm_object_t object,
uint64_t idx)
{
struct epc_page *epc;
vm_page_t page;
vm_page_t p;
int ret;
VM_OBJECT_ASSERT_WLOCKED(object);
p = vm_page_lookup(object, idx);
if (p == NULL) {
ret = sgx_get_epc_page(sc, &epc);
if (ret) {
dprintf("%s: No free EPC pages available.\n",
__func__);
return (ret);
}
mtx_lock(&sc->mtx_encls);
sgx_epa((void *)epc->base);
mtx_unlock(&sc->mtx_encls);
page = PHYS_TO_VM_PAGE(epc->phys);
vm_page_insert(page, object, idx);
page->valid = VM_PAGE_BITS_ALL;
}
return (0);
}
static int
sgx_va_slot_init(struct sgx_softc *sc,
struct sgx_enclave *enclave,
uint64_t addr)
{
vm_pindex_t pidx;
uint64_t va_page_idx;
uint64_t idx;
vm_object_t object;
int va_slot;
int ret;
object = enclave->object;
VM_OBJECT_ASSERT_WLOCKED(object);
pidx = OFF_TO_IDX(addr);
va_slot = pidx % SGX_VA_PAGE_SLOTS;
va_page_idx = pidx / SGX_VA_PAGE_SLOTS;
idx = - SGX_VA_PAGES_OFFS - va_page_idx;
ret = sgx_va_slot_init_by_index(sc, object, idx);
return (ret);
}
static int
sgx_mem_find(struct sgx_softc *sc, uint64_t addr,
vm_map_entry_t *entry0, vm_object_t *object0)
{
vm_map_t map;
vm_map_entry_t entry;
vm_object_t object;
map = &curproc->p_vmspace->vm_map;
vm_map_lock_read(map);
if (!vm_map_lookup_entry(map, addr, &entry)) {
vm_map_unlock_read(map);
dprintf("%s: Can't find enclave.\n", __func__);
return (EINVAL);
}
object = entry->object.vm_object;
if (object == NULL || object->handle == NULL) {
vm_map_unlock_read(map);
return (EINVAL);
}
if (object->type != OBJT_MGTDEVICE ||
object->un_pager.devp.ops != &sgx_pg_ops) {
vm_map_unlock_read(map);
return (EINVAL);
}
vm_object_reference(object);
*object0 = object;
*entry0 = entry;
vm_map_unlock_read(map);
return (0);
}
static int
sgx_enclave_find(struct sgx_softc *sc, uint64_t addr,
struct sgx_enclave **encl)
{
struct sgx_vm_handle *vmh;
struct sgx_enclave *enclave;
vm_map_entry_t entry;
vm_object_t object;
int ret;
ret = sgx_mem_find(sc, addr, &entry, &object);
if (ret)
return (ret);
vmh = object->handle;
if (vmh == NULL) {
vm_object_deallocate(object);
return (EINVAL);
}
enclave = vmh->enclave;
if (enclave == NULL || enclave->object == NULL) {
vm_object_deallocate(object);
return (EINVAL);
}
*encl = enclave;
return (0);
}
static int
sgx_enclave_alloc(struct sgx_softc *sc, struct secs *secs,
struct sgx_enclave **enclave0)
{
struct sgx_enclave *enclave;
enclave = malloc(sizeof(struct sgx_enclave),
M_SGX, M_WAITOK | M_ZERO);
enclave->base = secs->base;
enclave->size = secs->size;
*enclave0 = enclave;
return (0);
}
static void
sgx_epc_page_remove(struct sgx_softc *sc,
struct epc_page *epc)
{
mtx_lock(&sc->mtx_encls);
sgx_eremove((void *)epc->base);
mtx_unlock(&sc->mtx_encls);
}
static void
sgx_page_remove(struct sgx_softc *sc, vm_page_t p)
{
struct epc_page *epc;
vm_paddr_t pa;
uint64_t offs;
vm_page_lock(p);
- vm_page_remove(p);
+ (void)vm_page_remove(p);
vm_page_unlock(p);
dprintf("%s: p->pidx %ld\n", __func__, p->pindex);
pa = VM_PAGE_TO_PHYS(p);
epc = &sc->epc_pages[0];
offs = (pa - epc->phys) / PAGE_SIZE;
epc = &sc->epc_pages[offs];
sgx_epc_page_remove(sc, epc);
sgx_put_epc_page(sc, epc);
}
static void
sgx_enclave_remove(struct sgx_softc *sc,
struct sgx_enclave *enclave)
{
vm_object_t object;
vm_page_t p, p_secs, p_next;
mtx_lock(&sc->mtx);
TAILQ_REMOVE(&sc->enclaves, enclave, next);
mtx_unlock(&sc->mtx);
object = enclave->object;
VM_OBJECT_WLOCK(object);
/*
* First remove all the pages except SECS,
* then remove SECS page.
*/
p_secs = NULL;
TAILQ_FOREACH_SAFE(p, &object->memq, listq, p_next) {
if (p->pindex == SGX_SECS_VM_OBJECT_INDEX) {
p_secs = p;
continue;
}
sgx_page_remove(sc, p);
}
/* Now remove SECS page */
if (p_secs != NULL)
sgx_page_remove(sc, p_secs);
KASSERT(TAILQ_EMPTY(&object->memq) == 1, ("not empty"));
KASSERT(object->resident_page_count == 0, ("count"));
VM_OBJECT_WUNLOCK(object);
}
static int
sgx_measure_page(struct sgx_softc *sc, struct epc_page *secs,
struct epc_page *epc, uint16_t mrmask)
{
int i, j;
int ret;
mtx_lock(&sc->mtx_encls);
for (i = 0, j = 1; i < PAGE_SIZE; i += 0x100, j <<= 1) {
if (!(j & mrmask))
continue;
ret = sgx_eextend((void *)secs->base,
(void *)(epc->base + i));
if (ret == SGX_EFAULT) {
mtx_unlock(&sc->mtx_encls);
return (ret);
}
}
mtx_unlock(&sc->mtx_encls);
return (0);
}
static int
sgx_secs_validate(struct sgx_softc *sc, struct secs *secs)
{
struct secs_attr *attr;
int i;
if (secs->size == 0)
return (EINVAL);
/* BASEADDR must be naturally aligned on an SECS.SIZE boundary. */
if (secs->base & (secs->size - 1))
return (EINVAL);
/* SECS.SIZE must be at least 2 pages. */
if (secs->size < 2 * PAGE_SIZE)
return (EINVAL);
if ((secs->size & (secs->size - 1)) != 0)
return (EINVAL);
attr = &secs->attributes;
if (attr->reserved1 != 0 ||
attr->reserved2 != 0 ||
attr->reserved3 != 0)
return (EINVAL);
for (i = 0; i < SECS_ATTR_RSV4_SIZE; i++)
if (attr->reserved4[i])
return (EINVAL);
/*
* IntelĀ® Software Guard Extensions Programming Reference
* 6.7.2 Relevant Fields in Various Data Structures
* 6.7.2.1 SECS.ATTRIBUTES.XFRM
* XFRM[1:0] must be set to 0x3.
*/
if ((attr->xfrm & 0x3) != 0x3)
return (EINVAL);
if (!attr->mode64bit)
return (EINVAL);
if (secs->size > sc->enclave_size_max)
return (EINVAL);
for (i = 0; i < SECS_RSV1_SIZE; i++)
if (secs->reserved1[i])
return (EINVAL);
for (i = 0; i < SECS_RSV2_SIZE; i++)
if (secs->reserved2[i])
return (EINVAL);
for (i = 0; i < SECS_RSV3_SIZE; i++)
if (secs->reserved3[i])
return (EINVAL);
for (i = 0; i < SECS_RSV4_SIZE; i++)
if (secs->reserved4[i])
return (EINVAL);
return (0);
}
static int
sgx_tcs_validate(struct tcs *tcs)
{
int i;
if ((tcs->flags) ||
(tcs->ossa & (PAGE_SIZE - 1)) ||
(tcs->ofsbasgx & (PAGE_SIZE - 1)) ||
(tcs->ogsbasgx & (PAGE_SIZE - 1)) ||
((tcs->fslimit & 0xfff) != 0xfff) ||
((tcs->gslimit & 0xfff) != 0xfff))
return (EINVAL);
for (i = 0; i < nitems(tcs->reserved3); i++)
if (tcs->reserved3[i])
return (EINVAL);
return (0);
}
static void
sgx_tcs_dump(struct sgx_softc *sc, struct tcs *t)
{
dprintf("t->flags %lx\n", t->flags);
dprintf("t->ossa %lx\n", t->ossa);
dprintf("t->cssa %x\n", t->cssa);
dprintf("t->nssa %x\n", t->nssa);
dprintf("t->oentry %lx\n", t->oentry);
dprintf("t->ofsbasgx %lx\n", t->ofsbasgx);
dprintf("t->ogsbasgx %lx\n", t->ogsbasgx);
dprintf("t->fslimit %x\n", t->fslimit);
dprintf("t->gslimit %x\n", t->gslimit);
}
static int
sgx_pg_ctor(void *handle, vm_ooffset_t size, vm_prot_t prot,
vm_ooffset_t foff, struct ucred *cred, u_short *color)
{
struct sgx_vm_handle *vmh;
vmh = handle;
if (vmh == NULL) {
dprintf("%s: vmh not found.\n", __func__);
return (0);
}
dprintf("%s: vmh->base %lx foff 0x%lx size 0x%lx\n",
__func__, vmh->base, foff, size);
return (0);
}
static void
sgx_pg_dtor(void *handle)
{
struct sgx_vm_handle *vmh;
struct sgx_softc *sc;
vmh = handle;
if (vmh == NULL) {
dprintf("%s: vmh not found.\n", __func__);
return;
}
sc = vmh->sc;
if (sc == NULL) {
dprintf("%s: sc is NULL\n", __func__);
return;
}
if (vmh->enclave == NULL) {
dprintf("%s: Enclave not found.\n", __func__);
return;
}
sgx_enclave_remove(sc, vmh->enclave);
free(vmh->enclave, M_SGX);
free(vmh, M_SGX);
}
static int
sgx_pg_fault(vm_object_t object, vm_ooffset_t offset,
int prot, vm_page_t *mres)
{
/*
* The purpose of this trivial handler is to handle the race
* when user tries to access mmaped region before or during
* enclave creation ioctl calls.
*/
dprintf("%s: offset 0x%lx\n", __func__, offset);
return (VM_PAGER_FAIL);
}
static struct cdev_pager_ops sgx_pg_ops = {
.cdev_pg_ctor = sgx_pg_ctor,
.cdev_pg_dtor = sgx_pg_dtor,
.cdev_pg_fault = sgx_pg_fault,
};
static void
sgx_insert_epc_page_by_index(vm_page_t page, vm_object_t object,
vm_pindex_t pidx)
{
VM_OBJECT_ASSERT_WLOCKED(object);
vm_page_insert(page, object, pidx);
page->valid = VM_PAGE_BITS_ALL;
}
static void
sgx_insert_epc_page(struct sgx_enclave *enclave,
struct epc_page *epc, uint64_t addr)
{
vm_pindex_t pidx;
vm_page_t page;
VM_OBJECT_ASSERT_WLOCKED(enclave->object);
pidx = OFF_TO_IDX(addr);
page = PHYS_TO_VM_PAGE(epc->phys);
sgx_insert_epc_page_by_index(page, enclave->object, pidx);
}
static int
sgx_ioctl_create(struct sgx_softc *sc, struct sgx_enclave_create *param)
{
struct sgx_vm_handle *vmh;
vm_map_entry_t entry;
vm_page_t p;
struct page_info pginfo;
struct secinfo secinfo;
struct sgx_enclave *enclave;
struct epc_page *epc;
struct secs *secs;
vm_object_t object;
vm_page_t page;
int ret;
epc = NULL;
secs = NULL;
enclave = NULL;
object = NULL;
/* SGX Enclave Control Structure (SECS) */
secs = malloc(PAGE_SIZE, M_SGX, M_WAITOK | M_ZERO);
ret = copyin((void *)param->src, secs, sizeof(struct secs));
if (ret) {
dprintf("%s: Can't copy SECS.\n", __func__);
goto error;
}
ret = sgx_secs_validate(sc, secs);
if (ret) {
dprintf("%s: SECS validation failed.\n", __func__);
goto error;
}
ret = sgx_mem_find(sc, secs->base, &entry, &object);
if (ret) {
dprintf("%s: Can't find vm_map.\n", __func__);
goto error;
}
vmh = object->handle;
if (!vmh) {
dprintf("%s: Can't find vmh.\n", __func__);
ret = ENXIO;
goto error;
}
dprintf("%s: entry start %lx offset %lx\n",
__func__, entry->start, entry->offset);
vmh->base = (entry->start - entry->offset);
ret = sgx_enclave_alloc(sc, secs, &enclave);
if (ret) {
dprintf("%s: Can't alloc enclave.\n", __func__);
goto error;
}
enclave->object = object;
enclave->vmh = vmh;
memset(&secinfo, 0, sizeof(struct secinfo));
memset(&pginfo, 0, sizeof(struct page_info));
pginfo.linaddr = 0;
pginfo.srcpge = (uint64_t)secs;
pginfo.secinfo = &secinfo;
pginfo.secs = 0;
ret = sgx_get_epc_page(sc, &epc);
if (ret) {
dprintf("%s: Failed to get free epc page.\n", __func__);
goto error;
}
enclave->secs_epc_page = epc;
VM_OBJECT_WLOCK(object);
p = vm_page_lookup(object, SGX_SECS_VM_OBJECT_INDEX);
if (p) {
VM_OBJECT_WUNLOCK(object);
/* SECS page already added. */
ret = ENXIO;
goto error;
}
ret = sgx_va_slot_init_by_index(sc, object,
- SGX_VA_PAGES_OFFS - SGX_SECS_VM_OBJECT_INDEX);
if (ret) {
VM_OBJECT_WUNLOCK(object);
dprintf("%s: Can't init va slot.\n", __func__);
goto error;
}
mtx_lock(&sc->mtx);
if ((sc->state & SGX_STATE_RUNNING) == 0) {
mtx_unlock(&sc->mtx);
/* Remove VA page that was just created for SECS page. */
p = vm_page_lookup(enclave->object,
- SGX_VA_PAGES_OFFS - SGX_SECS_VM_OBJECT_INDEX);
sgx_page_remove(sc, p);
VM_OBJECT_WUNLOCK(object);
goto error;
}
mtx_lock(&sc->mtx_encls);
ret = sgx_ecreate(&pginfo, (void *)epc->base);
mtx_unlock(&sc->mtx_encls);
if (ret == SGX_EFAULT) {
dprintf("%s: gp fault\n", __func__);
mtx_unlock(&sc->mtx);
/* Remove VA page that was just created for SECS page. */
p = vm_page_lookup(enclave->object,
- SGX_VA_PAGES_OFFS - SGX_SECS_VM_OBJECT_INDEX);
sgx_page_remove(sc, p);
VM_OBJECT_WUNLOCK(object);
goto error;
}
TAILQ_INSERT_TAIL(&sc->enclaves, enclave, next);
mtx_unlock(&sc->mtx);
vmh->enclave = enclave;
page = PHYS_TO_VM_PAGE(epc->phys);
sgx_insert_epc_page_by_index(page, enclave->object,
SGX_SECS_VM_OBJECT_INDEX);
VM_OBJECT_WUNLOCK(object);
/* Release the reference. */
vm_object_deallocate(object);
free(secs, M_SGX);
return (0);
error:
free(secs, M_SGX);
sgx_put_epc_page(sc, epc);
free(enclave, M_SGX);
vm_object_deallocate(object);
return (ret);
}
static int
sgx_ioctl_add_page(struct sgx_softc *sc,
struct sgx_enclave_add_page *addp)
{
struct epc_page *secs_epc_page;
struct sgx_enclave *enclave;
struct sgx_vm_handle *vmh;
struct epc_page *epc;
struct page_info pginfo;
struct secinfo secinfo;
vm_object_t object;
void *tmp_vaddr;
uint64_t page_type;
struct tcs *t;
uint64_t addr;
uint64_t pidx;
vm_page_t p;
int ret;
tmp_vaddr = NULL;
epc = NULL;
object = NULL;
/* Find and get reference to VM object. */
ret = sgx_enclave_find(sc, addp->addr, &enclave);
if (ret) {
dprintf("%s: Failed to find enclave.\n", __func__);
goto error;
}
object = enclave->object;
KASSERT(object != NULL, ("vm object is NULL\n"));
vmh = object->handle;
ret = sgx_get_epc_page(sc, &epc);
if (ret) {
dprintf("%s: Failed to get free epc page.\n", __func__);
goto error;
}
memset(&secinfo, 0, sizeof(struct secinfo));
ret = copyin((void *)addp->secinfo, &secinfo,
sizeof(struct secinfo));
if (ret) {
dprintf("%s: Failed to copy secinfo.\n", __func__);
goto error;
}
tmp_vaddr = malloc(PAGE_SIZE, M_SGX, M_WAITOK | M_ZERO);
ret = copyin((void *)addp->src, tmp_vaddr, PAGE_SIZE);
if (ret) {
dprintf("%s: Failed to copy page.\n", __func__);
goto error;
}
page_type = (secinfo.flags & SECINFO_FLAGS_PT_M) >>
SECINFO_FLAGS_PT_S;
if (page_type != SGX_PT_TCS && page_type != SGX_PT_REG) {
dprintf("%s: page can't be added.\n", __func__);
goto error;
}
if (page_type == SGX_PT_TCS) {
t = (struct tcs *)tmp_vaddr;
ret = sgx_tcs_validate(t);
if (ret) {
dprintf("%s: TCS page validation failed.\n",
__func__);
goto error;
}
sgx_tcs_dump(sc, t);
}
addr = (addp->addr - vmh->base);
pidx = OFF_TO_IDX(addr);
VM_OBJECT_WLOCK(object);
p = vm_page_lookup(object, pidx);
if (p) {
VM_OBJECT_WUNLOCK(object);
/* Page already added. */
ret = ENXIO;
goto error;
}
ret = sgx_va_slot_init(sc, enclave, addr);
if (ret) {
VM_OBJECT_WUNLOCK(object);
dprintf("%s: Can't init va slot.\n", __func__);
goto error;
}
secs_epc_page = enclave->secs_epc_page;
memset(&pginfo, 0, sizeof(struct page_info));
pginfo.linaddr = (uint64_t)addp->addr;
pginfo.srcpge = (uint64_t)tmp_vaddr;
pginfo.secinfo = &secinfo;
pginfo.secs = (uint64_t)secs_epc_page->base;
mtx_lock(&sc->mtx_encls);
ret = sgx_eadd(&pginfo, (void *)epc->base);
if (ret == SGX_EFAULT) {
dprintf("%s: gp fault on eadd\n", __func__);
mtx_unlock(&sc->mtx_encls);
VM_OBJECT_WUNLOCK(object);
goto error;
}
mtx_unlock(&sc->mtx_encls);
ret = sgx_measure_page(sc, enclave->secs_epc_page, epc, addp->mrmask);
if (ret == SGX_EFAULT) {
dprintf("%s: gp fault on eextend\n", __func__);
sgx_epc_page_remove(sc, epc);
VM_OBJECT_WUNLOCK(object);
goto error;
}
sgx_insert_epc_page(enclave, epc, addr);
VM_OBJECT_WUNLOCK(object);
/* Release the reference. */
vm_object_deallocate(object);
free(tmp_vaddr, M_SGX);
return (0);
error:
free(tmp_vaddr, M_SGX);
sgx_put_epc_page(sc, epc);
vm_object_deallocate(object);
return (ret);
}
static int
sgx_ioctl_init(struct sgx_softc *sc, struct sgx_enclave_init *initp)
{
struct epc_page *secs_epc_page;
struct sgx_enclave *enclave;
struct thread *td;
void *tmp_vaddr;
void *einittoken;
void *sigstruct;
vm_object_t object;
int retry;
int ret;
td = curthread;
tmp_vaddr = NULL;
object = NULL;
dprintf("%s: addr %lx, sigstruct %lx, einittoken %lx\n",
__func__, initp->addr, initp->sigstruct, initp->einittoken);
/* Find and get reference to VM object. */
ret = sgx_enclave_find(sc, initp->addr, &enclave);
if (ret) {
dprintf("%s: Failed to find enclave.\n", __func__);
goto error;
}
object = enclave->object;
tmp_vaddr = malloc(PAGE_SIZE, M_SGX, M_WAITOK | M_ZERO);
sigstruct = tmp_vaddr;
einittoken = (void *)((uint64_t)sigstruct + PAGE_SIZE / 2);
ret = copyin((void *)initp->sigstruct, sigstruct,
SGX_SIGSTRUCT_SIZE);
if (ret) {
dprintf("%s: Failed to copy SIGSTRUCT page.\n", __func__);
goto error;
}
ret = copyin((void *)initp->einittoken, einittoken,
SGX_EINITTOKEN_SIZE);
if (ret) {
dprintf("%s: Failed to copy EINITTOKEN page.\n", __func__);
goto error;
}
secs_epc_page = enclave->secs_epc_page;
retry = 16;
do {
mtx_lock(&sc->mtx_encls);
ret = sgx_einit(sigstruct, (void *)secs_epc_page->base,
einittoken);
mtx_unlock(&sc->mtx_encls);
dprintf("%s: sgx_einit returned %d\n", __func__, ret);
} while (ret == SGX_UNMASKED_EVENT && retry--);
if (ret) {
dprintf("%s: Failed init enclave: %d\n", __func__, ret);
td->td_retval[0] = ret;
ret = 0;
}
error:
free(tmp_vaddr, M_SGX);
/* Release the reference. */
vm_object_deallocate(object);
return (ret);
}
static int
sgx_ioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flags,
struct thread *td)
{
struct sgx_enclave_add_page *addp;
struct sgx_enclave_create *param;
struct sgx_enclave_init *initp;
struct sgx_softc *sc;
int ret;
int len;
sc = &sgx_sc;
len = IOCPARM_LEN(cmd);
dprintf("%s: cmd %lx, addr %lx, len %d\n",
__func__, cmd, (uint64_t)addr, len);
if (len > SGX_IOCTL_MAX_DATA_LEN)
return (EINVAL);
switch (cmd) {
case SGX_IOC_ENCLAVE_CREATE:
param = (struct sgx_enclave_create *)addr;
ret = sgx_ioctl_create(sc, param);
break;
case SGX_IOC_ENCLAVE_ADD_PAGE:
addp = (struct sgx_enclave_add_page *)addr;
ret = sgx_ioctl_add_page(sc, addp);
break;
case SGX_IOC_ENCLAVE_INIT:
initp = (struct sgx_enclave_init *)addr;
ret = sgx_ioctl_init(sc, initp);
break;
default:
return (EINVAL);
}
return (ret);
}
static int
sgx_mmap_single(struct cdev *cdev, vm_ooffset_t *offset,
vm_size_t mapsize, struct vm_object **objp, int nprot)
{
struct sgx_vm_handle *vmh;
struct sgx_softc *sc;
sc = &sgx_sc;
dprintf("%s: mapsize 0x%lx, offset %lx\n",
__func__, mapsize, *offset);
vmh = malloc(sizeof(struct sgx_vm_handle),
M_SGX, M_WAITOK | M_ZERO);
vmh->sc = sc;
vmh->size = mapsize;
vmh->mem = cdev_pager_allocate(vmh, OBJT_MGTDEVICE, &sgx_pg_ops,
mapsize, nprot, *offset, NULL);
if (vmh->mem == NULL) {
free(vmh, M_SGX);
return (ENOMEM);
}
VM_OBJECT_WLOCK(vmh->mem);
vm_object_set_flag(vmh->mem, OBJ_PG_DTOR);
VM_OBJECT_WUNLOCK(vmh->mem);
*objp = vmh->mem;
return (0);
}
static struct cdevsw sgx_cdevsw = {
.d_version = D_VERSION,
.d_ioctl = sgx_ioctl,
.d_mmap_single = sgx_mmap_single,
.d_name = "Intel SGX",
};
static int
sgx_get_epc_area(struct sgx_softc *sc)
{
vm_offset_t epc_base_vaddr;
u_int cp[4];
int error;
int i;
cpuid_count(SGX_CPUID, 0x2, cp);
sc->epc_base = ((uint64_t)(cp[1] & 0xfffff) << 32) +
(cp[0] & 0xfffff000);
sc->epc_size = ((uint64_t)(cp[3] & 0xfffff) << 32) +
(cp[2] & 0xfffff000);
sc->npages = sc->epc_size / SGX_PAGE_SIZE;
if (sc->epc_size == 0 || sc->epc_base == 0) {
printf("%s: Incorrect EPC data: EPC base %lx, size %lu\n",
__func__, sc->epc_base, sc->epc_size);
return (EINVAL);
}
if (cp[3] & 0xffff)
sc->enclave_size_max = (1 << ((cp[3] >> 8) & 0xff));
else
sc->enclave_size_max = SGX_ENCL_SIZE_MAX_DEF;
epc_base_vaddr = (vm_offset_t)pmap_mapdev_attr(sc->epc_base,
sc->epc_size, VM_MEMATTR_DEFAULT);
sc->epc_pages = malloc(sizeof(struct epc_page) * sc->npages,
M_DEVBUF, M_WAITOK | M_ZERO);
for (i = 0; i < sc->npages; i++) {
sc->epc_pages[i].base = epc_base_vaddr + SGX_PAGE_SIZE * i;
sc->epc_pages[i].phys = sc->epc_base + SGX_PAGE_SIZE * i;
sc->epc_pages[i].index = i;
}
sc->vmem_epc = vmem_create("SGX EPC", sc->epc_base, sc->epc_size,
PAGE_SIZE, PAGE_SIZE, M_FIRSTFIT | M_WAITOK);
if (sc->vmem_epc == NULL) {
printf("%s: Can't create vmem arena.\n", __func__);
free(sc->epc_pages, M_SGX);
return (EINVAL);
}
error = vm_phys_fictitious_reg_range(sc->epc_base,
sc->epc_base + sc->epc_size, VM_MEMATTR_DEFAULT);
if (error) {
printf("%s: Can't register fictitious space.\n", __func__);
free(sc->epc_pages, M_SGX);
return (EINVAL);
}
return (0);
}
static void
sgx_put_epc_area(struct sgx_softc *sc)
{
vm_phys_fictitious_unreg_range(sc->epc_base,
sc->epc_base + sc->epc_size);
free(sc->epc_pages, M_SGX);
}
static int
sgx_load(void)
{
struct sgx_softc *sc;
int error;
sc = &sgx_sc;
if ((cpu_stdext_feature & CPUID_STDEXT_SGX) == 0)
return (ENXIO);
error = sgx_get_epc_area(sc);
if (error) {
printf("%s: Failed to get Processor Reserved Memory area.\n",
__func__);
return (ENXIO);
}
mtx_init(&sc->mtx_encls, "SGX ENCLS", NULL, MTX_DEF);
mtx_init(&sc->mtx, "SGX driver", NULL, MTX_DEF);
TAILQ_INIT(&sc->enclaves);
sc->sgx_cdev = make_dev(&sgx_cdevsw, 0, UID_ROOT, GID_WHEEL,
0600, "isgx");
sc->state |= SGX_STATE_RUNNING;
printf("SGX initialized: EPC base 0x%lx size %ld (%d pages)\n",
sc->epc_base, sc->epc_size, sc->npages);
return (0);
}
static int
sgx_unload(void)
{
struct sgx_softc *sc;
sc = &sgx_sc;
if ((sc->state & SGX_STATE_RUNNING) == 0)
return (0);
mtx_lock(&sc->mtx);
if (!TAILQ_EMPTY(&sc->enclaves)) {
mtx_unlock(&sc->mtx);
return (EBUSY);
}
sc->state &= ~SGX_STATE_RUNNING;
mtx_unlock(&sc->mtx);
destroy_dev(sc->sgx_cdev);
vmem_destroy(sc->vmem_epc);
sgx_put_epc_area(sc);
mtx_destroy(&sc->mtx_encls);
mtx_destroy(&sc->mtx);
return (0);
}
static int
sgx_handler(module_t mod, int what, void *arg)
{
int error;
switch (what) {
case MOD_LOAD:
error = sgx_load();
break;
case MOD_UNLOAD:
error = sgx_unload();
break;
default:
error = 0;
break;
}
return (error);
}
static moduledata_t sgx_kmod = {
"sgx",
sgx_handler,
NULL
};
DECLARE_MODULE(sgx, sgx_kmod, SI_SUB_LAST, SI_ORDER_ANY);
MODULE_VERSION(sgx, 1);
Index: head/sys/dev/drm2/ttm/ttm_bo_vm.c
===================================================================
--- head/sys/dev/drm2/ttm/ttm_bo_vm.c (revision 349431)
+++ head/sys/dev/drm2/ttm/ttm_bo_vm.c (revision 349432)
@@ -1,563 +1,563 @@
/**************************************************************************
*
* Copyright (c) 2006-2009 VMware, Inc., Palo Alto, CA., USA
* All Rights Reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sub license, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice (including the
* next paragraph) shall be included in all copies or substantial portions
* of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
* THE COPYRIGHT HOLDERS, AUTHORS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM,
* DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
* OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
* USE OR OTHER DEALINGS IN THE SOFTWARE.
*
**************************************************************************/
/*
* Authors: Thomas Hellstrom
*/
/*
* Copyright (c) 2013 The FreeBSD Foundation
* All rights reserved.
*
* Portions of this software were developed by Konstantin Belousov
* under sponsorship from the FreeBSD Foundation.
*/
#include
__FBSDID("$FreeBSD$");
#include "opt_vm.h"
#include
#include
#include
#include
#include
#include
#include
#define TTM_BO_VM_NUM_PREFAULT 16
RB_GENERATE(ttm_bo_device_buffer_objects, ttm_buffer_object, vm_rb,
ttm_bo_cmp_rb_tree_items);
int
ttm_bo_cmp_rb_tree_items(struct ttm_buffer_object *a,
struct ttm_buffer_object *b)
{
if (a->vm_node->start < b->vm_node->start) {
return (-1);
} else if (a->vm_node->start > b->vm_node->start) {
return (1);
} else {
return (0);
}
}
static struct ttm_buffer_object *ttm_bo_vm_lookup_rb(struct ttm_bo_device *bdev,
unsigned long page_start,
unsigned long num_pages)
{
unsigned long cur_offset;
struct ttm_buffer_object *bo;
struct ttm_buffer_object *best_bo = NULL;
bo = RB_ROOT(&bdev->addr_space_rb);
while (bo != NULL) {
cur_offset = bo->vm_node->start;
if (page_start >= cur_offset) {
best_bo = bo;
if (page_start == cur_offset)
break;
bo = RB_RIGHT(bo, vm_rb);
} else
bo = RB_LEFT(bo, vm_rb);
}
if (unlikely(best_bo == NULL))
return NULL;
if (unlikely((best_bo->vm_node->start + best_bo->num_pages) <
(page_start + num_pages)))
return NULL;
return best_bo;
}
static int
ttm_bo_vm_fault(vm_object_t vm_obj, vm_ooffset_t offset,
int prot, vm_page_t *mres)
{
struct ttm_buffer_object *bo = vm_obj->handle;
struct ttm_bo_device *bdev = bo->bdev;
struct ttm_tt *ttm = NULL;
vm_page_t m, m1;
int ret;
int retval = VM_PAGER_OK;
struct ttm_mem_type_manager *man =
&bdev->man[bo->mem.mem_type];
vm_object_pip_add(vm_obj, 1);
if (*mres != NULL) {
vm_page_lock(*mres);
- vm_page_remove(*mres);
+ (void)vm_page_remove(*mres);
vm_page_unlock(*mres);
}
retry:
VM_OBJECT_WUNLOCK(vm_obj);
m = NULL;
reserve:
ret = ttm_bo_reserve(bo, false, false, false, 0);
if (unlikely(ret != 0)) {
if (ret == -EBUSY) {
kern_yield(PRI_USER);
goto reserve;
}
}
if (bdev->driver->fault_reserve_notify) {
ret = bdev->driver->fault_reserve_notify(bo);
switch (ret) {
case 0:
break;
case -EBUSY:
case -ERESTARTSYS:
case -EINTR:
kern_yield(PRI_USER);
goto reserve;
default:
retval = VM_PAGER_ERROR;
goto out_unlock;
}
}
/*
* Wait for buffer data in transit, due to a pipelined
* move.
*/
mtx_lock(&bdev->fence_lock);
if (test_bit(TTM_BO_PRIV_FLAG_MOVING, &bo->priv_flags)) {
/*
* Here, the behavior differs between Linux and FreeBSD.
*
* On Linux, the wait is interruptible (3rd argument to
* ttm_bo_wait). There must be some mechanism to resume
* page fault handling, once the signal is processed.
*
* On FreeBSD, the wait is uninteruptible. This is not a
* problem as we can't end up with an unkillable process
* here, because the wait will eventually time out.
*
* An example of this situation is the Xorg process
* which uses SIGALRM internally. The signal could
* interrupt the wait, causing the page fault to fail
* and the process to receive SIGSEGV.
*/
ret = ttm_bo_wait(bo, false, false, false);
mtx_unlock(&bdev->fence_lock);
if (unlikely(ret != 0)) {
retval = VM_PAGER_ERROR;
goto out_unlock;
}
} else
mtx_unlock(&bdev->fence_lock);
ret = ttm_mem_io_lock(man, true);
if (unlikely(ret != 0)) {
retval = VM_PAGER_ERROR;
goto out_unlock;
}
ret = ttm_mem_io_reserve_vm(bo);
if (unlikely(ret != 0)) {
retval = VM_PAGER_ERROR;
goto out_io_unlock;
}
/*
* Strictly, we're not allowed to modify vma->vm_page_prot here,
* since the mmap_sem is only held in read mode. However, we
* modify only the caching bits of vma->vm_page_prot and
* consider those bits protected by
* the bo->mutex, as we should be the only writers.
* There shouldn't really be any readers of these bits except
* within vm_insert_mixed()? fork?
*
* TODO: Add a list of vmas to the bo, and change the
* vma->vm_page_prot when the object changes caching policy, with
* the correct locks held.
*/
if (!bo->mem.bus.is_iomem) {
/* Allocate all page at once, most common usage */
ttm = bo->ttm;
if (ttm->bdev->driver->ttm_tt_populate(ttm)) {
retval = VM_PAGER_ERROR;
goto out_io_unlock;
}
}
if (bo->mem.bus.is_iomem) {
m = PHYS_TO_VM_PAGE(bo->mem.bus.base + bo->mem.bus.offset +
offset);
KASSERT((m->flags & PG_FICTITIOUS) != 0,
("physical address %#jx not fictitious",
(uintmax_t)(bo->mem.bus.base + bo->mem.bus.offset
+ offset)));
pmap_page_set_memattr(m, ttm_io_prot(bo->mem.placement));
} else {
ttm = bo->ttm;
m = ttm->pages[OFF_TO_IDX(offset)];
if (unlikely(!m)) {
retval = VM_PAGER_ERROR;
goto out_io_unlock;
}
pmap_page_set_memattr(m,
(bo->mem.placement & TTM_PL_FLAG_CACHED) ?
VM_MEMATTR_WRITE_BACK : ttm_io_prot(bo->mem.placement));
}
VM_OBJECT_WLOCK(vm_obj);
if (vm_page_busied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(vm_obj);
vm_page_busy_sleep(m, "ttmpbs", false);
VM_OBJECT_WLOCK(vm_obj);
ttm_mem_io_unlock(man);
ttm_bo_unreserve(bo);
goto retry;
}
m1 = vm_page_lookup(vm_obj, OFF_TO_IDX(offset));
if (m1 == NULL) {
if (vm_page_insert(m, vm_obj, OFF_TO_IDX(offset))) {
VM_OBJECT_WUNLOCK(vm_obj);
vm_wait(vm_obj);
VM_OBJECT_WLOCK(vm_obj);
ttm_mem_io_unlock(man);
ttm_bo_unreserve(bo);
goto retry;
}
} else {
KASSERT(m == m1,
("inconsistent insert bo %p m %p m1 %p offset %jx",
bo, m, m1, (uintmax_t)offset));
}
m->valid = VM_PAGE_BITS_ALL;
vm_page_xbusy(m);
if (*mres != NULL) {
KASSERT(*mres != m, ("losing %p %p", *mres, m));
vm_page_lock(*mres);
vm_page_free(*mres);
vm_page_unlock(*mres);
}
*mres = m;
out_io_unlock1:
ttm_mem_io_unlock(man);
out_unlock1:
ttm_bo_unreserve(bo);
vm_object_pip_wakeup(vm_obj);
return (retval);
out_io_unlock:
VM_OBJECT_WLOCK(vm_obj);
goto out_io_unlock1;
out_unlock:
VM_OBJECT_WLOCK(vm_obj);
goto out_unlock1;
}
static int
ttm_bo_vm_ctor(void *handle, vm_ooffset_t size, vm_prot_t prot,
vm_ooffset_t foff, struct ucred *cred, u_short *color)
{
/*
* On Linux, a reference to the buffer object is acquired here.
* The reason is that this function is not called when the
* mmap() is initialized, but only when a process forks for
* instance. Therefore on Linux, the reference on the bo is
* acquired either in ttm_bo_mmap() or ttm_bo_vm_open(). It's
* then released in ttm_bo_vm_close().
*
* Here, this function is called during mmap() initialization.
* Thus, the reference acquired in ttm_bo_mmap_single() is
* sufficient.
*/
*color = 0;
return (0);
}
static void
ttm_bo_vm_dtor(void *handle)
{
struct ttm_buffer_object *bo = handle;
ttm_bo_unref(&bo);
}
static struct cdev_pager_ops ttm_pager_ops = {
.cdev_pg_fault = ttm_bo_vm_fault,
.cdev_pg_ctor = ttm_bo_vm_ctor,
.cdev_pg_dtor = ttm_bo_vm_dtor
};
int
ttm_bo_mmap_single(struct ttm_bo_device *bdev, vm_ooffset_t *offset, vm_size_t size,
struct vm_object **obj_res, int nprot)
{
struct ttm_bo_driver *driver;
struct ttm_buffer_object *bo;
struct vm_object *vm_obj;
int ret;
rw_wlock(&bdev->vm_lock);
bo = ttm_bo_vm_lookup_rb(bdev, OFF_TO_IDX(*offset), OFF_TO_IDX(size));
if (likely(bo != NULL))
refcount_acquire(&bo->kref);
rw_wunlock(&bdev->vm_lock);
if (unlikely(bo == NULL)) {
printf("[TTM] Could not find buffer object to map\n");
return (-EINVAL);
}
driver = bo->bdev->driver;
if (unlikely(!driver->verify_access)) {
ret = -EPERM;
goto out_unref;
}
ret = driver->verify_access(bo);
if (unlikely(ret != 0))
goto out_unref;
vm_obj = cdev_pager_allocate(bo, OBJT_MGTDEVICE, &ttm_pager_ops,
size, nprot, 0, curthread->td_ucred);
if (vm_obj == NULL) {
ret = -EINVAL;
goto out_unref;
}
/*
* Note: We're transferring the bo reference to vm_obj->handle here.
*/
*offset = 0;
*obj_res = vm_obj;
return 0;
out_unref:
ttm_bo_unref(&bo);
return ret;
}
void
ttm_bo_release_mmap(struct ttm_buffer_object *bo)
{
vm_object_t vm_obj;
vm_page_t m;
int i;
vm_obj = cdev_pager_lookup(bo);
if (vm_obj == NULL)
return;
VM_OBJECT_WLOCK(vm_obj);
retry:
for (i = 0; i < bo->num_pages; i++) {
m = vm_page_lookup(vm_obj, i);
if (m == NULL)
continue;
if (vm_page_sleep_if_busy(m, "ttm_unm"))
goto retry;
cdev_pager_free_page(vm_obj, m);
}
VM_OBJECT_WUNLOCK(vm_obj);
vm_object_deallocate(vm_obj);
}
#if 0
int ttm_fbdev_mmap(struct vm_area_struct *vma, struct ttm_buffer_object *bo)
{
if (vma->vm_pgoff != 0)
return -EACCES;
vma->vm_ops = &ttm_bo_vm_ops;
vma->vm_private_data = ttm_bo_reference(bo);
vma->vm_flags |= VM_IO | VM_MIXEDMAP | VM_DONTEXPAND;
return 0;
}
ssize_t ttm_bo_io(struct ttm_bo_device *bdev, struct file *filp,
const char __user *wbuf, char __user *rbuf, size_t count,
loff_t *f_pos, bool write)
{
struct ttm_buffer_object *bo;
struct ttm_bo_driver *driver;
struct ttm_bo_kmap_obj map;
unsigned long dev_offset = (*f_pos >> PAGE_SHIFT);
unsigned long kmap_offset;
unsigned long kmap_end;
unsigned long kmap_num;
size_t io_size;
unsigned int page_offset;
char *virtual;
int ret;
bool no_wait = false;
bool dummy;
read_lock(&bdev->vm_lock);
bo = ttm_bo_vm_lookup_rb(bdev, dev_offset, 1);
if (likely(bo != NULL))
ttm_bo_reference(bo);
read_unlock(&bdev->vm_lock);
if (unlikely(bo == NULL))
return -EFAULT;
driver = bo->bdev->driver;
if (unlikely(!driver->verify_access)) {
ret = -EPERM;
goto out_unref;
}
ret = driver->verify_access(bo, filp);
if (unlikely(ret != 0))
goto out_unref;
kmap_offset = dev_offset - bo->vm_node->start;
if (unlikely(kmap_offset >= bo->num_pages)) {
ret = -EFBIG;
goto out_unref;
}
page_offset = *f_pos & ~PAGE_MASK;
io_size = bo->num_pages - kmap_offset;
io_size = (io_size << PAGE_SHIFT) - page_offset;
if (count < io_size)
io_size = count;
kmap_end = (*f_pos + count - 1) >> PAGE_SHIFT;
kmap_num = kmap_end - kmap_offset + 1;
ret = ttm_bo_reserve(bo, true, no_wait, false, 0);
switch (ret) {
case 0:
break;
case -EBUSY:
ret = -EAGAIN;
goto out_unref;
default:
goto out_unref;
}
ret = ttm_bo_kmap(bo, kmap_offset, kmap_num, &map);
if (unlikely(ret != 0)) {
ttm_bo_unreserve(bo);
goto out_unref;
}
virtual = ttm_kmap_obj_virtual(&map, &dummy);
virtual += page_offset;
if (write)
ret = copy_from_user(virtual, wbuf, io_size);
else
ret = copy_to_user(rbuf, virtual, io_size);
ttm_bo_kunmap(&map);
ttm_bo_unreserve(bo);
ttm_bo_unref(&bo);
if (unlikely(ret != 0))
return -EFBIG;
*f_pos += io_size;
return io_size;
out_unref:
ttm_bo_unref(&bo);
return ret;
}
ssize_t ttm_bo_fbdev_io(struct ttm_buffer_object *bo, const char __user *wbuf,
char __user *rbuf, size_t count, loff_t *f_pos,
bool write)
{
struct ttm_bo_kmap_obj map;
unsigned long kmap_offset;
unsigned long kmap_end;
unsigned long kmap_num;
size_t io_size;
unsigned int page_offset;
char *virtual;
int ret;
bool no_wait = false;
bool dummy;
kmap_offset = (*f_pos >> PAGE_SHIFT);
if (unlikely(kmap_offset >= bo->num_pages))
return -EFBIG;
page_offset = *f_pos & ~PAGE_MASK;
io_size = bo->num_pages - kmap_offset;
io_size = (io_size << PAGE_SHIFT) - page_offset;
if (count < io_size)
io_size = count;
kmap_end = (*f_pos + count - 1) >> PAGE_SHIFT;
kmap_num = kmap_end - kmap_offset + 1;
ret = ttm_bo_reserve(bo, true, no_wait, false, 0);
switch (ret) {
case 0:
break;
case -EBUSY:
return -EAGAIN;
default:
return ret;
}
ret = ttm_bo_kmap(bo, kmap_offset, kmap_num, &map);
if (unlikely(ret != 0)) {
ttm_bo_unreserve(bo);
return ret;
}
virtual = ttm_kmap_obj_virtual(&map, &dummy);
virtual += page_offset;
if (write)
ret = copy_from_user(virtual, wbuf, io_size);
else
ret = copy_to_user(rbuf, virtual, io_size);
ttm_bo_kunmap(&map);
ttm_bo_unreserve(bo);
ttm_bo_unref(&bo);
if (unlikely(ret != 0))
return ret;
*f_pos += io_size;
return io_size;
}
#endif
Index: head/sys/vm/device_pager.c
===================================================================
--- head/sys/vm/device_pager.c (revision 349431)
+++ head/sys/vm/device_pager.c (revision 349432)
@@ -1,471 +1,471 @@
/*-
* SPDX-License-Identifier: BSD-3-Clause
*
* Copyright (c) 1990 University of Utah.
* Copyright (c) 1991, 1993
* The Regents of the University of California. All rights reserved.
*
* This code is derived from software contributed to Berkeley by
* the Systems Programming Group of the University of Utah Computer
* Science Department.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* @(#)device_pager.c 8.1 (Berkeley) 6/11/93
*/
#include
__FBSDID("$FreeBSD$");
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
static void dev_pager_init(void);
static vm_object_t dev_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
vm_ooffset_t, struct ucred *);
static void dev_pager_dealloc(vm_object_t);
static int dev_pager_getpages(vm_object_t, vm_page_t *, int, int *, int *);
static void dev_pager_putpages(vm_object_t, vm_page_t *, int, int, int *);
static boolean_t dev_pager_haspage(vm_object_t, vm_pindex_t, int *, int *);
static void dev_pager_free_page(vm_object_t object, vm_page_t m);
static int dev_pager_populate(vm_object_t object, vm_pindex_t pidx,
int fault_type, vm_prot_t, vm_pindex_t *first, vm_pindex_t *last);
/* list of device pager objects */
static struct pagerlst dev_pager_object_list;
/* protect list manipulation */
static struct mtx dev_pager_mtx;
struct pagerops devicepagerops = {
.pgo_init = dev_pager_init,
.pgo_alloc = dev_pager_alloc,
.pgo_dealloc = dev_pager_dealloc,
.pgo_getpages = dev_pager_getpages,
.pgo_putpages = dev_pager_putpages,
.pgo_haspage = dev_pager_haspage,
};
struct pagerops mgtdevicepagerops = {
.pgo_alloc = dev_pager_alloc,
.pgo_dealloc = dev_pager_dealloc,
.pgo_getpages = dev_pager_getpages,
.pgo_putpages = dev_pager_putpages,
.pgo_haspage = dev_pager_haspage,
.pgo_populate = dev_pager_populate,
};
static int old_dev_pager_ctor(void *handle, vm_ooffset_t size, vm_prot_t prot,
vm_ooffset_t foff, struct ucred *cred, u_short *color);
static void old_dev_pager_dtor(void *handle);
static int old_dev_pager_fault(vm_object_t object, vm_ooffset_t offset,
int prot, vm_page_t *mres);
static struct cdev_pager_ops old_dev_pager_ops = {
.cdev_pg_ctor = old_dev_pager_ctor,
.cdev_pg_dtor = old_dev_pager_dtor,
.cdev_pg_fault = old_dev_pager_fault
};
static void
dev_pager_init(void)
{
TAILQ_INIT(&dev_pager_object_list);
mtx_init(&dev_pager_mtx, "dev_pager list", NULL, MTX_DEF);
}
vm_object_t
cdev_pager_lookup(void *handle)
{
vm_object_t object;
mtx_lock(&dev_pager_mtx);
object = vm_pager_object_lookup(&dev_pager_object_list, handle);
mtx_unlock(&dev_pager_mtx);
return (object);
}
vm_object_t
cdev_pager_allocate(void *handle, enum obj_type tp, struct cdev_pager_ops *ops,
vm_ooffset_t size, vm_prot_t prot, vm_ooffset_t foff, struct ucred *cred)
{
vm_object_t object, object1;
vm_pindex_t pindex;
u_short color;
if (tp != OBJT_DEVICE && tp != OBJT_MGTDEVICE)
return (NULL);
KASSERT(tp == OBJT_MGTDEVICE || ops->cdev_pg_populate == NULL,
("populate on unmanaged device pager"));
/*
* Offset should be page aligned.
*/
if (foff & PAGE_MASK)
return (NULL);
/*
* Treat the mmap(2) file offset as an unsigned value for a
* device mapping. This, in effect, allows a user to pass all
* possible off_t values as the mapping cookie to the driver. At
* this point, we know that both foff and size are a multiple
* of the page size. Do a check to avoid wrap.
*/
size = round_page(size);
pindex = OFF_TO_IDX(foff) + OFF_TO_IDX(size);
if (pindex > OBJ_MAX_SIZE || pindex < OFF_TO_IDX(foff) ||
pindex < OFF_TO_IDX(size))
return (NULL);
if (ops->cdev_pg_ctor(handle, size, prot, foff, cred, &color) != 0)
return (NULL);
mtx_lock(&dev_pager_mtx);
/*
* Look up pager, creating as necessary.
*/
object1 = NULL;
object = vm_pager_object_lookup(&dev_pager_object_list, handle);
if (object == NULL) {
/*
* Allocate object and associate it with the pager. Initialize
* the object's pg_color based upon the physical address of the
* device's memory.
*/
mtx_unlock(&dev_pager_mtx);
object1 = vm_object_allocate(tp, pindex);
object1->flags |= OBJ_COLORED;
object1->pg_color = color;
object1->handle = handle;
object1->un_pager.devp.ops = ops;
object1->un_pager.devp.dev = handle;
TAILQ_INIT(&object1->un_pager.devp.devp_pglist);
mtx_lock(&dev_pager_mtx);
object = vm_pager_object_lookup(&dev_pager_object_list, handle);
if (object != NULL) {
/*
* We raced with other thread while allocating object.
*/
if (pindex > object->size)
object->size = pindex;
KASSERT(object->type == tp,
("Inconsistent device pager type %p %d",
object, tp));
KASSERT(object->un_pager.devp.ops == ops,
("Inconsistent devops %p %p", object, ops));
} else {
object = object1;
object1 = NULL;
object->handle = handle;
TAILQ_INSERT_TAIL(&dev_pager_object_list, object,
pager_object_list);
if (ops->cdev_pg_populate != NULL)
vm_object_set_flag(object, OBJ_POPULATE);
}
} else {
if (pindex > object->size)
object->size = pindex;
KASSERT(object->type == tp,
("Inconsistent device pager type %p %d", object, tp));
}
mtx_unlock(&dev_pager_mtx);
if (object1 != NULL) {
object1->handle = object1;
mtx_lock(&dev_pager_mtx);
TAILQ_INSERT_TAIL(&dev_pager_object_list, object1,
pager_object_list);
mtx_unlock(&dev_pager_mtx);
vm_object_deallocate(object1);
}
return (object);
}
static vm_object_t
dev_pager_alloc(void *handle, vm_ooffset_t size, vm_prot_t prot,
vm_ooffset_t foff, struct ucred *cred)
{
return (cdev_pager_allocate(handle, OBJT_DEVICE, &old_dev_pager_ops,
size, prot, foff, cred));
}
void
cdev_pager_free_page(vm_object_t object, vm_page_t m)
{
VM_OBJECT_ASSERT_WLOCKED(object);
if (object->type == OBJT_MGTDEVICE) {
KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("unmanaged %p", m));
pmap_remove_all(m);
vm_page_lock(m);
- vm_page_remove(m);
+ (void)vm_page_remove(m);
vm_page_unlock(m);
} else if (object->type == OBJT_DEVICE)
dev_pager_free_page(object, m);
}
static void
dev_pager_free_page(vm_object_t object, vm_page_t m)
{
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT((object->type == OBJT_DEVICE &&
(m->oflags & VPO_UNMANAGED) != 0),
("Managed device or page obj %p m %p", object, m));
TAILQ_REMOVE(&object->un_pager.devp.devp_pglist, m, plinks.q);
vm_page_putfake(m);
}
static void
dev_pager_dealloc(vm_object_t object)
{
vm_page_t m;
VM_OBJECT_WUNLOCK(object);
object->un_pager.devp.ops->cdev_pg_dtor(object->un_pager.devp.dev);
mtx_lock(&dev_pager_mtx);
TAILQ_REMOVE(&dev_pager_object_list, object, pager_object_list);
mtx_unlock(&dev_pager_mtx);
VM_OBJECT_WLOCK(object);
if (object->type == OBJT_DEVICE) {
/*
* Free up our fake pages.
*/
while ((m = TAILQ_FIRST(&object->un_pager.devp.devp_pglist))
!= NULL)
dev_pager_free_page(object, m);
}
object->handle = NULL;
object->type = OBJT_DEAD;
}
static int
dev_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int *rbehind,
int *rahead)
{
int error;
/* Since our haspage reports zero after/before, the count is 1. */
KASSERT(count == 1, ("%s: count %d", __func__, count));
VM_OBJECT_ASSERT_WLOCKED(object);
if (object->un_pager.devp.ops->cdev_pg_fault == NULL)
return (VM_PAGER_FAIL);
error = object->un_pager.devp.ops->cdev_pg_fault(object,
IDX_TO_OFF(ma[0]->pindex), PROT_READ, &ma[0]);
VM_OBJECT_ASSERT_WLOCKED(object);
if (error == VM_PAGER_OK) {
KASSERT((object->type == OBJT_DEVICE &&
(ma[0]->oflags & VPO_UNMANAGED) != 0) ||
(object->type == OBJT_MGTDEVICE &&
(ma[0]->oflags & VPO_UNMANAGED) == 0),
("Wrong page type %p %p", ma[0], object));
if (object->type == OBJT_DEVICE) {
TAILQ_INSERT_TAIL(&object->un_pager.devp.devp_pglist,
ma[0], plinks.q);
}
if (rbehind)
*rbehind = 0;
if (rahead)
*rahead = 0;
}
return (error);
}
static int
dev_pager_populate(vm_object_t object, vm_pindex_t pidx, int fault_type,
vm_prot_t max_prot, vm_pindex_t *first, vm_pindex_t *last)
{
VM_OBJECT_ASSERT_WLOCKED(object);
if (object->un_pager.devp.ops->cdev_pg_populate == NULL)
return (VM_PAGER_FAIL);
return (object->un_pager.devp.ops->cdev_pg_populate(object, pidx,
fault_type, max_prot, first, last));
}
static int
old_dev_pager_fault(vm_object_t object, vm_ooffset_t offset, int prot,
vm_page_t *mres)
{
vm_paddr_t paddr;
vm_page_t m_paddr, page;
struct cdev *dev;
struct cdevsw *csw;
struct file *fpop;
struct thread *td;
vm_memattr_t memattr, memattr1;
int ref, ret;
memattr = object->memattr;
VM_OBJECT_WUNLOCK(object);
dev = object->handle;
csw = dev_refthread(dev, &ref);
if (csw == NULL) {
VM_OBJECT_WLOCK(object);
return (VM_PAGER_FAIL);
}
td = curthread;
fpop = td->td_fpop;
td->td_fpop = NULL;
ret = csw->d_mmap(dev, offset, &paddr, prot, &memattr);
td->td_fpop = fpop;
dev_relthread(dev, ref);
if (ret != 0) {
printf(
"WARNING: dev_pager_getpage: map function returns error %d", ret);
VM_OBJECT_WLOCK(object);
return (VM_PAGER_FAIL);
}
/* If "paddr" is a real page, perform a sanity check on "memattr". */
if ((m_paddr = vm_phys_paddr_to_vm_page(paddr)) != NULL &&
(memattr1 = pmap_page_get_memattr(m_paddr)) != memattr) {
/*
* For the /dev/mem d_mmap routine to return the
* correct memattr, pmap_page_get_memattr() needs to
* be called, which we do there.
*/
if ((csw->d_flags & D_MEM) == 0) {
printf("WARNING: Device driver %s has set "
"\"memattr\" inconsistently (drv %u pmap %u).\n",
csw->d_name, memattr, memattr1);
}
memattr = memattr1;
}
if (((*mres)->flags & PG_FICTITIOUS) != 0) {
/*
* If the passed in result page is a fake page, update it with
* the new physical address.
*/
page = *mres;
VM_OBJECT_WLOCK(object);
vm_page_updatefake(page, paddr, memattr);
} else {
/*
* Replace the passed in reqpage page with our own fake page and
* free up the all of the original pages.
*/
page = vm_page_getfake(paddr, memattr);
VM_OBJECT_WLOCK(object);
vm_page_replace_checked(page, object, (*mres)->pindex, *mres);
vm_page_lock(*mres);
vm_page_free(*mres);
vm_page_unlock(*mres);
*mres = page;
}
page->valid = VM_PAGE_BITS_ALL;
return (VM_PAGER_OK);
}
static void
dev_pager_putpages(vm_object_t object, vm_page_t *m, int count, int flags,
int *rtvals)
{
panic("dev_pager_putpage called");
}
static boolean_t
dev_pager_haspage(vm_object_t object, vm_pindex_t pindex, int *before,
int *after)
{
if (before != NULL)
*before = 0;
if (after != NULL)
*after = 0;
return (TRUE);
}
static int
old_dev_pager_ctor(void *handle, vm_ooffset_t size, vm_prot_t prot,
vm_ooffset_t foff, struct ucred *cred, u_short *color)
{
struct cdev *dev;
struct cdevsw *csw;
vm_memattr_t dummy;
vm_ooffset_t off;
vm_paddr_t paddr;
unsigned int npages;
int ref;
/*
* Make sure this device can be mapped.
*/
dev = handle;
csw = dev_refthread(dev, &ref);
if (csw == NULL)
return (ENXIO);
/*
* Check that the specified range of the device allows the desired
* protection.
*
* XXX assumes VM_PROT_* == PROT_*
*/
npages = OFF_TO_IDX(size);
paddr = 0; /* Make paddr initialized for the case of size == 0. */
for (off = foff; npages--; off += PAGE_SIZE) {
if (csw->d_mmap(dev, off, &paddr, (int)prot, &dummy) != 0) {
dev_relthread(dev, ref);
return (EINVAL);
}
}
dev_ref(dev);
dev_relthread(dev, ref);
*color = atop(paddr) - OFF_TO_IDX(off - PAGE_SIZE);
return (0);
}
static void
old_dev_pager_dtor(void *handle)
{
dev_rel(handle);
}
Index: head/sys/vm/vm_fault.c
===================================================================
--- head/sys/vm/vm_fault.c (revision 349431)
+++ head/sys/vm/vm_fault.c (revision 349432)
@@ -1,1843 +1,1843 @@
/*-
* SPDX-License-Identifier: (BSD-4-Clause AND MIT-CMU)
*
* Copyright (c) 1991, 1993
* The Regents of the University of California. All rights reserved.
* Copyright (c) 1994 John S. Dyson
* All rights reserved.
* Copyright (c) 1994 David Greenman
* All rights reserved.
*
*
* This code is derived from software contributed to Berkeley by
* The Mach Operating System project at Carnegie-Mellon University.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. All advertising materials mentioning features or use of this software
* must display the following acknowledgement:
* This product includes software developed by the University of
* California, Berkeley and its contributors.
* 4. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* from: @(#)vm_fault.c 8.4 (Berkeley) 1/12/94
*
*
* Copyright (c) 1987, 1990 Carnegie-Mellon University.
* All rights reserved.
*
* Authors: Avadis Tevanian, Jr., Michael Wayne Young
*
* Permission to use, copy, modify and distribute this software and
* its documentation is hereby granted, provided that both the copyright
* notice and this permission notice appear in all copies of the
* software, derivative works or modified versions, and any portions
* thereof, and that both notices appear in supporting documentation.
*
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
* FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
*
* Carnegie Mellon requests users of this software to return to
*
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
* School of Computer Science
* Carnegie Mellon University
* Pittsburgh PA 15213-3890
*
* any improvements or extensions that they make and grant Carnegie the
* rights to redistribute these changes.
*/
/*
* Page fault handling module.
*/
#include
__FBSDID("$FreeBSD$");
#include "opt_ktrace.h"
#include "opt_vm.h"
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#ifdef KTRACE
#include
#endif
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define PFBAK 4
#define PFFOR 4
#define VM_FAULT_READ_DEFAULT (1 + VM_FAULT_READ_AHEAD_INIT)
#define VM_FAULT_READ_MAX (1 + VM_FAULT_READ_AHEAD_MAX)
#define VM_FAULT_DONTNEED_MIN 1048576
struct faultstate {
vm_page_t m;
vm_object_t object;
vm_pindex_t pindex;
vm_page_t first_m;
vm_object_t first_object;
vm_pindex_t first_pindex;
vm_map_t map;
vm_map_entry_t entry;
int map_generation;
bool lookup_still_valid;
struct vnode *vp;
};
static void vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr,
int ahead);
static void vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
int backward, int forward, bool obj_locked);
static inline void
release_page(struct faultstate *fs)
{
vm_page_xunbusy(fs->m);
vm_page_lock(fs->m);
vm_page_deactivate(fs->m);
vm_page_unlock(fs->m);
fs->m = NULL;
}
static inline void
unlock_map(struct faultstate *fs)
{
if (fs->lookup_still_valid) {
vm_map_lookup_done(fs->map, fs->entry);
fs->lookup_still_valid = false;
}
}
static void
unlock_vp(struct faultstate *fs)
{
if (fs->vp != NULL) {
vput(fs->vp);
fs->vp = NULL;
}
}
static void
unlock_and_deallocate(struct faultstate *fs)
{
vm_object_pip_wakeup(fs->object);
VM_OBJECT_WUNLOCK(fs->object);
if (fs->object != fs->first_object) {
VM_OBJECT_WLOCK(fs->first_object);
vm_page_lock(fs->first_m);
vm_page_free(fs->first_m);
vm_page_unlock(fs->first_m);
vm_object_pip_wakeup(fs->first_object);
VM_OBJECT_WUNLOCK(fs->first_object);
fs->first_m = NULL;
}
vm_object_deallocate(fs->first_object);
unlock_map(fs);
unlock_vp(fs);
}
static void
vm_fault_dirty(vm_map_entry_t entry, vm_page_t m, vm_prot_t prot,
vm_prot_t fault_type, int fault_flags, bool set_wd)
{
bool need_dirty;
if (((prot & VM_PROT_WRITE) == 0 &&
(fault_flags & VM_FAULT_DIRTY) == 0) ||
(m->oflags & VPO_UNMANAGED) != 0)
return;
VM_OBJECT_ASSERT_LOCKED(m->object);
need_dirty = ((fault_type & VM_PROT_WRITE) != 0 &&
(fault_flags & VM_FAULT_WIRE) == 0) ||
(fault_flags & VM_FAULT_DIRTY) != 0;
if (set_wd)
vm_object_set_writeable_dirty(m->object);
else
/*
* If two callers of vm_fault_dirty() with set_wd ==
* FALSE, one for the map entry with MAP_ENTRY_NOSYNC
* flag set, other with flag clear, race, it is
* possible for the no-NOSYNC thread to see m->dirty
* != 0 and not clear VPO_NOSYNC. Take vm_page lock
* around manipulation of VPO_NOSYNC and
* vm_page_dirty() call, to avoid the race and keep
* m->oflags consistent.
*/
vm_page_lock(m);
/*
* If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
* if the page is already dirty to prevent data written with
* the expectation of being synced from not being synced.
* Likewise if this entry does not request NOSYNC then make
* sure the page isn't marked NOSYNC. Applications sharing
* data should use the same flags to avoid ping ponging.
*/
if ((entry->eflags & MAP_ENTRY_NOSYNC) != 0) {
if (m->dirty == 0) {
m->oflags |= VPO_NOSYNC;
}
} else {
m->oflags &= ~VPO_NOSYNC;
}
/*
* If the fault is a write, we know that this page is being
* written NOW so dirty it explicitly to save on
* pmap_is_modified() calls later.
*
* Also, since the page is now dirty, we can possibly tell
* the pager to release any swap backing the page. Calling
* the pager requires a write lock on the object.
*/
if (need_dirty)
vm_page_dirty(m);
if (!set_wd)
vm_page_unlock(m);
else if (need_dirty)
vm_pager_page_unswapped(m);
}
static void
vm_fault_fill_hold(vm_page_t *m_hold, vm_page_t m)
{
if (m_hold != NULL) {
*m_hold = m;
vm_page_lock(m);
vm_page_hold(m);
vm_page_unlock(m);
}
}
/*
* Unlocks fs.first_object and fs.map on success.
*/
static int
vm_fault_soft_fast(struct faultstate *fs, vm_offset_t vaddr, vm_prot_t prot,
int fault_type, int fault_flags, boolean_t wired, vm_page_t *m_hold)
{
vm_page_t m, m_map;
#if (defined(__aarch64__) || defined(__amd64__) || (defined(__arm__) && \
__ARM_ARCH >= 6) || defined(__i386__) || defined(__riscv)) && \
VM_NRESERVLEVEL > 0
vm_page_t m_super;
int flags;
#endif
int psind, rv;
MPASS(fs->vp == NULL);
m = vm_page_lookup(fs->first_object, fs->first_pindex);
/* A busy page can be mapped for read|execute access. */
if (m == NULL || ((prot & VM_PROT_WRITE) != 0 &&
vm_page_busied(m)) || m->valid != VM_PAGE_BITS_ALL)
return (KERN_FAILURE);
m_map = m;
psind = 0;
#if (defined(__aarch64__) || defined(__amd64__) || (defined(__arm__) && \
__ARM_ARCH >= 6) || defined(__i386__) || defined(__riscv)) && \
VM_NRESERVLEVEL > 0
if ((m->flags & PG_FICTITIOUS) == 0 &&
(m_super = vm_reserv_to_superpage(m)) != NULL &&
rounddown2(vaddr, pagesizes[m_super->psind]) >= fs->entry->start &&
roundup2(vaddr + 1, pagesizes[m_super->psind]) <= fs->entry->end &&
(vaddr & (pagesizes[m_super->psind] - 1)) == (VM_PAGE_TO_PHYS(m) &
(pagesizes[m_super->psind] - 1)) && !wired &&
pmap_ps_enabled(fs->map->pmap)) {
flags = PS_ALL_VALID;
if ((prot & VM_PROT_WRITE) != 0) {
/*
* Create a superpage mapping allowing write access
* only if none of the constituent pages are busy and
* all of them are already dirty (except possibly for
* the page that was faulted on).
*/
flags |= PS_NONE_BUSY;
if ((fs->first_object->flags & OBJ_UNMANAGED) == 0)
flags |= PS_ALL_DIRTY;
}
if (vm_page_ps_test(m_super, flags, m)) {
m_map = m_super;
psind = m_super->psind;
vaddr = rounddown2(vaddr, pagesizes[psind]);
/* Preset the modified bit for dirty superpages. */
if ((flags & PS_ALL_DIRTY) != 0)
fault_type |= VM_PROT_WRITE;
}
}
#endif
rv = pmap_enter(fs->map->pmap, vaddr, m_map, prot, fault_type |
PMAP_ENTER_NOSLEEP | (wired ? PMAP_ENTER_WIRED : 0), psind);
if (rv != KERN_SUCCESS)
return (rv);
vm_fault_fill_hold(m_hold, m);
vm_fault_dirty(fs->entry, m, prot, fault_type, fault_flags, false);
if (psind == 0 && !wired)
vm_fault_prefault(fs, vaddr, PFBAK, PFFOR, true);
VM_OBJECT_RUNLOCK(fs->first_object);
vm_map_lookup_done(fs->map, fs->entry);
curthread->td_ru.ru_minflt++;
return (KERN_SUCCESS);
}
static void
vm_fault_restore_map_lock(struct faultstate *fs)
{
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
MPASS(fs->first_object->paging_in_progress > 0);
if (!vm_map_trylock_read(fs->map)) {
VM_OBJECT_WUNLOCK(fs->first_object);
vm_map_lock_read(fs->map);
VM_OBJECT_WLOCK(fs->first_object);
}
fs->lookup_still_valid = true;
}
static void
vm_fault_populate_check_page(vm_page_t m)
{
/*
* Check each page to ensure that the pager is obeying the
* interface: the page must be installed in the object, fully
* valid, and exclusively busied.
*/
MPASS(m != NULL);
MPASS(m->valid == VM_PAGE_BITS_ALL);
MPASS(vm_page_xbusied(m));
}
static void
vm_fault_populate_cleanup(vm_object_t object, vm_pindex_t first,
vm_pindex_t last)
{
vm_page_t m;
vm_pindex_t pidx;
VM_OBJECT_ASSERT_WLOCKED(object);
MPASS(first <= last);
for (pidx = first, m = vm_page_lookup(object, pidx);
pidx <= last; pidx++, m = vm_page_next(m)) {
vm_fault_populate_check_page(m);
vm_page_lock(m);
vm_page_deactivate(m);
vm_page_unlock(m);
vm_page_xunbusy(m);
}
}
static int
vm_fault_populate(struct faultstate *fs, vm_prot_t prot, int fault_type,
int fault_flags, boolean_t wired, vm_page_t *m_hold)
{
struct mtx *m_mtx;
vm_offset_t vaddr;
vm_page_t m;
vm_pindex_t map_first, map_last, pager_first, pager_last, pidx;
int i, npages, psind, rv;
MPASS(fs->object == fs->first_object);
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
MPASS(fs->first_object->paging_in_progress > 0);
MPASS(fs->first_object->backing_object == NULL);
MPASS(fs->lookup_still_valid);
pager_first = OFF_TO_IDX(fs->entry->offset);
pager_last = pager_first + atop(fs->entry->end - fs->entry->start) - 1;
unlock_map(fs);
unlock_vp(fs);
/*
* Call the pager (driver) populate() method.
*
* There is no guarantee that the method will be called again
* if the current fault is for read, and a future fault is
* for write. Report the entry's maximum allowed protection
* to the driver.
*/
rv = vm_pager_populate(fs->first_object, fs->first_pindex,
fault_type, fs->entry->max_protection, &pager_first, &pager_last);
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
if (rv == VM_PAGER_BAD) {
/*
* VM_PAGER_BAD is the backdoor for a pager to request
* normal fault handling.
*/
vm_fault_restore_map_lock(fs);
if (fs->map->timestamp != fs->map_generation)
return (KERN_RESOURCE_SHORTAGE); /* RetryFault */
return (KERN_NOT_RECEIVER);
}
if (rv != VM_PAGER_OK)
return (KERN_FAILURE); /* AKA SIGSEGV */
/* Ensure that the driver is obeying the interface. */
MPASS(pager_first <= pager_last);
MPASS(fs->first_pindex <= pager_last);
MPASS(fs->first_pindex >= pager_first);
MPASS(pager_last < fs->first_object->size);
vm_fault_restore_map_lock(fs);
if (fs->map->timestamp != fs->map_generation) {
vm_fault_populate_cleanup(fs->first_object, pager_first,
pager_last);
return (KERN_RESOURCE_SHORTAGE); /* RetryFault */
}
/*
* The map is unchanged after our last unlock. Process the fault.
*
* The range [pager_first, pager_last] that is given to the
* pager is only a hint. The pager may populate any range
* within the object that includes the requested page index.
* In case the pager expanded the range, clip it to fit into
* the map entry.
*/
map_first = OFF_TO_IDX(fs->entry->offset);
if (map_first > pager_first) {
vm_fault_populate_cleanup(fs->first_object, pager_first,
map_first - 1);
pager_first = map_first;
}
map_last = map_first + atop(fs->entry->end - fs->entry->start) - 1;
if (map_last < pager_last) {
vm_fault_populate_cleanup(fs->first_object, map_last + 1,
pager_last);
pager_last = map_last;
}
for (pidx = pager_first, m = vm_page_lookup(fs->first_object, pidx);
pidx <= pager_last;
pidx += npages, m = vm_page_next(&m[npages - 1])) {
vaddr = fs->entry->start + IDX_TO_OFF(pidx) - fs->entry->offset;
#if defined(__aarch64__) || defined(__amd64__) || (defined(__arm__) && \
__ARM_ARCH >= 6) || defined(__i386__) || defined(__riscv)
psind = m->psind;
if (psind > 0 && ((vaddr & (pagesizes[psind] - 1)) != 0 ||
pidx + OFF_TO_IDX(pagesizes[psind]) - 1 > pager_last ||
!pmap_ps_enabled(fs->map->pmap) || wired))
psind = 0;
#else
psind = 0;
#endif
npages = atop(pagesizes[psind]);
for (i = 0; i < npages; i++) {
vm_fault_populate_check_page(&m[i]);
vm_fault_dirty(fs->entry, &m[i], prot, fault_type,
fault_flags, true);
}
VM_OBJECT_WUNLOCK(fs->first_object);
rv = pmap_enter(fs->map->pmap, vaddr, m, prot, fault_type |
(wired ? PMAP_ENTER_WIRED : 0), psind);
#if defined(__amd64__)
if (psind > 0 && rv == KERN_FAILURE) {
for (i = 0; i < npages; i++) {
rv = pmap_enter(fs->map->pmap, vaddr + ptoa(i),
&m[i], prot, fault_type |
(wired ? PMAP_ENTER_WIRED : 0), 0);
MPASS(rv == KERN_SUCCESS);
}
}
#else
MPASS(rv == KERN_SUCCESS);
#endif
VM_OBJECT_WLOCK(fs->first_object);
m_mtx = NULL;
for (i = 0; i < npages; i++) {
vm_page_change_lock(&m[i], &m_mtx);
if ((fault_flags & VM_FAULT_WIRE) != 0)
vm_page_wire(&m[i]);
else
vm_page_activate(&m[i]);
if (m_hold != NULL && m[i].pindex == fs->first_pindex) {
*m_hold = &m[i];
vm_page_hold(&m[i]);
}
vm_page_xunbusy_maybelocked(&m[i]);
}
if (m_mtx != NULL)
mtx_unlock(m_mtx);
}
curthread->td_ru.ru_majflt++;
return (KERN_SUCCESS);
}
/*
* vm_fault:
*
* Handle a page fault occurring at the given address,
* requiring the given permissions, in the map specified.
* If successful, the page is inserted into the
* associated physical map.
*
* NOTE: the given address should be truncated to the
* proper page address.
*
* KERN_SUCCESS is returned if the page fault is handled; otherwise,
* a standard error specifying why the fault is fatal is returned.
*
* The map in question must be referenced, and remains so.
* Caller may hold no locks.
*/
int
vm_fault(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
int fault_flags)
{
struct thread *td;
int result;
td = curthread;
if ((td->td_pflags & TDP_NOFAULTING) != 0)
return (KERN_PROTECTION_FAILURE);
#ifdef KTRACE
if (map != kernel_map && KTRPOINT(td, KTR_FAULT))
ktrfault(vaddr, fault_type);
#endif
result = vm_fault_hold(map, trunc_page(vaddr), fault_type, fault_flags,
NULL);
#ifdef KTRACE
if (map != kernel_map && KTRPOINT(td, KTR_FAULTEND))
ktrfaultend(result);
#endif
return (result);
}
int
vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
int fault_flags, vm_page_t *m_hold)
{
struct faultstate fs;
struct vnode *vp;
struct domainset *dset;
vm_object_t next_object, retry_object;
vm_offset_t e_end, e_start;
vm_pindex_t retry_pindex;
vm_prot_t prot, retry_prot;
int ahead, alloc_req, behind, cluster_offset, error, era, faultcount;
int locked, nera, result, rv;
u_char behavior;
boolean_t wired; /* Passed by reference. */
bool dead, hardfault, is_first_object_locked;
VM_CNT_INC(v_vm_faults);
fs.vp = NULL;
faultcount = 0;
nera = -1;
hardfault = false;
RetryFault:;
/*
* Find the backing store object and offset into it to begin the
* search.
*/
fs.map = map;
result = vm_map_lookup(&fs.map, vaddr, fault_type |
VM_PROT_FAULT_LOOKUP, &fs.entry, &fs.first_object,
&fs.first_pindex, &prot, &wired);
if (result != KERN_SUCCESS) {
unlock_vp(&fs);
return (result);
}
fs.map_generation = fs.map->timestamp;
if (fs.entry->eflags & MAP_ENTRY_NOFAULT) {
panic("%s: fault on nofault entry, addr: %#lx",
__func__, (u_long)vaddr);
}
if (fs.entry->eflags & MAP_ENTRY_IN_TRANSITION &&
fs.entry->wiring_thread != curthread) {
vm_map_unlock_read(fs.map);
vm_map_lock(fs.map);
if (vm_map_lookup_entry(fs.map, vaddr, &fs.entry) &&
(fs.entry->eflags & MAP_ENTRY_IN_TRANSITION)) {
unlock_vp(&fs);
fs.entry->eflags |= MAP_ENTRY_NEEDS_WAKEUP;
vm_map_unlock_and_wait(fs.map, 0);
} else
vm_map_unlock(fs.map);
goto RetryFault;
}
MPASS((fs.entry->eflags & MAP_ENTRY_GUARD) == 0);
if (wired)
fault_type = prot | (fault_type & VM_PROT_COPY);
else
KASSERT((fault_flags & VM_FAULT_WIRE) == 0,
("!wired && VM_FAULT_WIRE"));
/*
* Try to avoid lock contention on the top-level object through
* special-case handling of some types of page faults, specifically,
* those that are both (1) mapping an existing page from the top-
* level object and (2) not having to mark that object as containing
* dirty pages. Under these conditions, a read lock on the top-level
* object suffices, allowing multiple page faults of a similar type to
* run in parallel on the same top-level object.
*/
if (fs.vp == NULL /* avoid locked vnode leak */ &&
(fault_flags & (VM_FAULT_WIRE | VM_FAULT_DIRTY)) == 0 &&
/* avoid calling vm_object_set_writeable_dirty() */
((prot & VM_PROT_WRITE) == 0 ||
(fs.first_object->type != OBJT_VNODE &&
(fs.first_object->flags & OBJ_TMPFS_NODE) == 0) ||
(fs.first_object->flags & OBJ_MIGHTBEDIRTY) != 0)) {
VM_OBJECT_RLOCK(fs.first_object);
if ((prot & VM_PROT_WRITE) == 0 ||
(fs.first_object->type != OBJT_VNODE &&
(fs.first_object->flags & OBJ_TMPFS_NODE) == 0) ||
(fs.first_object->flags & OBJ_MIGHTBEDIRTY) != 0) {
rv = vm_fault_soft_fast(&fs, vaddr, prot, fault_type,
fault_flags, wired, m_hold);
if (rv == KERN_SUCCESS)
return (rv);
}
if (!VM_OBJECT_TRYUPGRADE(fs.first_object)) {
VM_OBJECT_RUNLOCK(fs.first_object);
VM_OBJECT_WLOCK(fs.first_object);
}
} else {
VM_OBJECT_WLOCK(fs.first_object);
}
/*
* Make a reference to this object to prevent its disposal while we
* are messing with it. Once we have the reference, the map is free
* to be diddled. Since objects reference their shadows (and copies),
* they will stay around as well.
*
* Bump the paging-in-progress count to prevent size changes (e.g.
* truncation operations) during I/O.
*/
vm_object_reference_locked(fs.first_object);
vm_object_pip_add(fs.first_object, 1);
fs.lookup_still_valid = true;
fs.first_m = NULL;
/*
* Search for the page at object/offset.
*/
fs.object = fs.first_object;
fs.pindex = fs.first_pindex;
while (TRUE) {
/*
* If the object is marked for imminent termination,
* we retry here, since the collapse pass has raced
* with us. Otherwise, if we see terminally dead
* object, return fail.
*/
if ((fs.object->flags & OBJ_DEAD) != 0) {
dead = fs.object->type == OBJT_DEAD;
unlock_and_deallocate(&fs);
if (dead)
return (KERN_PROTECTION_FAILURE);
pause("vmf_de", 1);
goto RetryFault;
}
/*
* See if page is resident
*/
fs.m = vm_page_lookup(fs.object, fs.pindex);
if (fs.m != NULL) {
/*
* Wait/Retry if the page is busy. We have to do this
* if the page is either exclusive or shared busy
* because the vm_pager may be using read busy for
* pageouts (and even pageins if it is the vnode
* pager), and we could end up trying to pagein and
* pageout the same page simultaneously.
*
* We can theoretically allow the busy case on a read
* fault if the page is marked valid, but since such
* pages are typically already pmap'd, putting that
* special case in might be more effort then it is
* worth. We cannot under any circumstances mess
* around with a shared busied page except, perhaps,
* to pmap it.
*/
if (vm_page_busied(fs.m)) {
/*
* Reference the page before unlocking and
* sleeping so that the page daemon is less
* likely to reclaim it.
*/
vm_page_aflag_set(fs.m, PGA_REFERENCED);
if (fs.object != fs.first_object) {
if (!VM_OBJECT_TRYWLOCK(
fs.first_object)) {
VM_OBJECT_WUNLOCK(fs.object);
VM_OBJECT_WLOCK(fs.first_object);
VM_OBJECT_WLOCK(fs.object);
}
vm_page_lock(fs.first_m);
vm_page_free(fs.first_m);
vm_page_unlock(fs.first_m);
vm_object_pip_wakeup(fs.first_object);
VM_OBJECT_WUNLOCK(fs.first_object);
fs.first_m = NULL;
}
unlock_map(&fs);
if (fs.m == vm_page_lookup(fs.object,
fs.pindex)) {
vm_page_sleep_if_busy(fs.m, "vmpfw");
}
vm_object_pip_wakeup(fs.object);
VM_OBJECT_WUNLOCK(fs.object);
VM_CNT_INC(v_intrans);
vm_object_deallocate(fs.first_object);
goto RetryFault;
}
/*
* Mark page busy for other processes, and the
* pagedaemon. If it still isn't completely valid
* (readable), jump to readrest, else break-out ( we
* found the page ).
*/
vm_page_xbusy(fs.m);
if (fs.m->valid != VM_PAGE_BITS_ALL)
goto readrest;
break; /* break to PAGE HAS BEEN FOUND */
}
KASSERT(fs.m == NULL, ("fs.m should be NULL, not %p", fs.m));
/*
* Page is not resident. If the pager might contain the page
* or this is the beginning of the search, allocate a new
* page. (Default objects are zero-fill, so there is no real
* pager for them.)
*/
if (fs.object->type != OBJT_DEFAULT ||
fs.object == fs.first_object) {
if (fs.pindex >= fs.object->size) {
unlock_and_deallocate(&fs);
return (KERN_PROTECTION_FAILURE);
}
if (fs.object == fs.first_object &&
(fs.first_object->flags & OBJ_POPULATE) != 0 &&
fs.first_object->shadow_count == 0) {
rv = vm_fault_populate(&fs, prot, fault_type,
fault_flags, wired, m_hold);
switch (rv) {
case KERN_SUCCESS:
case KERN_FAILURE:
unlock_and_deallocate(&fs);
return (rv);
case KERN_RESOURCE_SHORTAGE:
unlock_and_deallocate(&fs);
goto RetryFault;
case KERN_NOT_RECEIVER:
/*
* Pager's populate() method
* returned VM_PAGER_BAD.
*/
break;
default:
panic("inconsistent return codes");
}
}
/*
* Allocate a new page for this object/offset pair.
*
* Unlocked read of the p_flag is harmless. At
* worst, the P_KILLED might be not observed
* there, and allocation can fail, causing
* restart and new reading of the p_flag.
*/
dset = fs.object->domain.dr_policy;
if (dset == NULL)
dset = curthread->td_domain.dr_policy;
if (!vm_page_count_severe_set(&dset->ds_mask) ||
P_KILLED(curproc)) {
#if VM_NRESERVLEVEL > 0
vm_object_color(fs.object, atop(vaddr) -
fs.pindex);
#endif
alloc_req = P_KILLED(curproc) ?
VM_ALLOC_SYSTEM : VM_ALLOC_NORMAL;
if (fs.object->type != OBJT_VNODE &&
fs.object->backing_object == NULL)
alloc_req |= VM_ALLOC_ZERO;
fs.m = vm_page_alloc(fs.object, fs.pindex,
alloc_req);
}
if (fs.m == NULL) {
unlock_and_deallocate(&fs);
vm_waitpfault(dset);
goto RetryFault;
}
}
readrest:
/*
* At this point, we have either allocated a new page or found
* an existing page that is only partially valid.
*
* We hold a reference on the current object and the page is
* exclusive busied.
*/
/*
* If the pager for the current object might have the page,
* then determine the number of additional pages to read and
* potentially reprioritize previously read pages for earlier
* reclamation. These operations should only be performed
* once per page fault. Even if the current pager doesn't
* have the page, the number of additional pages to read will
* apply to subsequent objects in the shadow chain.
*/
if (fs.object->type != OBJT_DEFAULT && nera == -1 &&
!P_KILLED(curproc)) {
KASSERT(fs.lookup_still_valid, ("map unlocked"));
era = fs.entry->read_ahead;
behavior = vm_map_entry_behavior(fs.entry);
if (behavior == MAP_ENTRY_BEHAV_RANDOM) {
nera = 0;
} else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) {
nera = VM_FAULT_READ_AHEAD_MAX;
if (vaddr == fs.entry->next_read)
vm_fault_dontneed(&fs, vaddr, nera);
} else if (vaddr == fs.entry->next_read) {
/*
* This is a sequential fault. Arithmetically
* increase the requested number of pages in
* the read-ahead window. The requested
* number of pages is "# of sequential faults
* x (read ahead min + 1) + read ahead min"
*/
nera = VM_FAULT_READ_AHEAD_MIN;
if (era > 0) {
nera += era + 1;
if (nera > VM_FAULT_READ_AHEAD_MAX)
nera = VM_FAULT_READ_AHEAD_MAX;
}
if (era == VM_FAULT_READ_AHEAD_MAX)
vm_fault_dontneed(&fs, vaddr, nera);
} else {
/*
* This is a non-sequential fault.
*/
nera = 0;
}
if (era != nera) {
/*
* A read lock on the map suffices to update
* the read ahead count safely.
*/
fs.entry->read_ahead = nera;
}
/*
* Prepare for unlocking the map. Save the map
* entry's start and end addresses, which are used to
* optimize the size of the pager operation below.
* Even if the map entry's addresses change after
* unlocking the map, using the saved addresses is
* safe.
*/
e_start = fs.entry->start;
e_end = fs.entry->end;
}
/*
* Call the pager to retrieve the page if there is a chance
* that the pager has it, and potentially retrieve additional
* pages at the same time.
*/
if (fs.object->type != OBJT_DEFAULT) {
/*
* Release the map lock before locking the vnode or
* sleeping in the pager. (If the current object has
* a shadow, then an earlier iteration of this loop
* may have already unlocked the map.)
*/
unlock_map(&fs);
if (fs.object->type == OBJT_VNODE &&
(vp = fs.object->handle) != fs.vp) {
/*
* Perform an unlock in case the desired vnode
* changed while the map was unlocked during a
* retry.
*/
unlock_vp(&fs);
locked = VOP_ISLOCKED(vp);
if (locked != LK_EXCLUSIVE)
locked = LK_SHARED;
/*
* We must not sleep acquiring the vnode lock
* while we have the page exclusive busied or
* the object's paging-in-progress count
* incremented. Otherwise, we could deadlock.
*/
error = vget(vp, locked | LK_CANRECURSE |
LK_NOWAIT, curthread);
if (error != 0) {
vhold(vp);
release_page(&fs);
unlock_and_deallocate(&fs);
error = vget(vp, locked | LK_RETRY |
LK_CANRECURSE, curthread);
vdrop(vp);
fs.vp = vp;
KASSERT(error == 0,
("vm_fault: vget failed"));
goto RetryFault;
}
fs.vp = vp;
}
KASSERT(fs.vp == NULL || !fs.map->system_map,
("vm_fault: vnode-backed object mapped by system map"));
/*
* Page in the requested page and hint the pager,
* that it may bring up surrounding pages.
*/
if (nera == -1 || behavior == MAP_ENTRY_BEHAV_RANDOM ||
P_KILLED(curproc)) {
behind = 0;
ahead = 0;
} else {
/* Is this a sequential fault? */
if (nera > 0) {
behind = 0;
ahead = nera;
} else {
/*
* Request a cluster of pages that is
* aligned to a VM_FAULT_READ_DEFAULT
* page offset boundary within the
* object. Alignment to a page offset
* boundary is more likely to coincide
* with the underlying file system
* block than alignment to a virtual
* address boundary.
*/
cluster_offset = fs.pindex %
VM_FAULT_READ_DEFAULT;
behind = ulmin(cluster_offset,
atop(vaddr - e_start));
ahead = VM_FAULT_READ_DEFAULT - 1 -
cluster_offset;
}
ahead = ulmin(ahead, atop(e_end - vaddr) - 1);
}
rv = vm_pager_get_pages(fs.object, &fs.m, 1,
&behind, &ahead);
if (rv == VM_PAGER_OK) {
faultcount = behind + 1 + ahead;
hardfault = true;
break; /* break to PAGE HAS BEEN FOUND */
}
if (rv == VM_PAGER_ERROR)
printf("vm_fault: pager read error, pid %d (%s)\n",
curproc->p_pid, curproc->p_comm);
/*
* If an I/O error occurred or the requested page was
* outside the range of the pager, clean up and return
* an error.
*/
if (rv == VM_PAGER_ERROR || rv == VM_PAGER_BAD) {
vm_page_lock(fs.m);
if (!vm_page_wired(fs.m))
vm_page_free(fs.m);
else
vm_page_xunbusy_maybelocked(fs.m);
vm_page_unlock(fs.m);
fs.m = NULL;
unlock_and_deallocate(&fs);
return (rv == VM_PAGER_ERROR ? KERN_FAILURE :
KERN_PROTECTION_FAILURE);
}
/*
* The requested page does not exist at this object/
* offset. Remove the invalid page from the object,
* waking up anyone waiting for it, and continue on to
* the next object. However, if this is the top-level
* object, we must leave the busy page in place to
* prevent another process from rushing past us, and
* inserting the page in that object at the same time
* that we are.
*/
if (fs.object != fs.first_object) {
vm_page_lock(fs.m);
if (!vm_page_wired(fs.m))
vm_page_free(fs.m);
else
vm_page_xunbusy_maybelocked(fs.m);
vm_page_unlock(fs.m);
fs.m = NULL;
}
}
/*
* We get here if the object has default pager (or unwiring)
* or the pager doesn't have the page.
*/
if (fs.object == fs.first_object)
fs.first_m = fs.m;
/*
* Move on to the next object. Lock the next object before
* unlocking the current one.
*/
next_object = fs.object->backing_object;
if (next_object == NULL) {
/*
* If there's no object left, fill the page in the top
* object with zeros.
*/
if (fs.object != fs.first_object) {
vm_object_pip_wakeup(fs.object);
VM_OBJECT_WUNLOCK(fs.object);
fs.object = fs.first_object;
fs.pindex = fs.first_pindex;
fs.m = fs.first_m;
VM_OBJECT_WLOCK(fs.object);
}
fs.first_m = NULL;
/*
* Zero the page if necessary and mark it valid.
*/
if ((fs.m->flags & PG_ZERO) == 0) {
pmap_zero_page(fs.m);
} else {
VM_CNT_INC(v_ozfod);
}
VM_CNT_INC(v_zfod);
fs.m->valid = VM_PAGE_BITS_ALL;
/* Don't try to prefault neighboring pages. */
faultcount = 1;
break; /* break to PAGE HAS BEEN FOUND */
} else {
KASSERT(fs.object != next_object,
("object loop %p", next_object));
VM_OBJECT_WLOCK(next_object);
vm_object_pip_add(next_object, 1);
if (fs.object != fs.first_object)
vm_object_pip_wakeup(fs.object);
fs.pindex +=
OFF_TO_IDX(fs.object->backing_object_offset);
VM_OBJECT_WUNLOCK(fs.object);
fs.object = next_object;
}
}
vm_page_assert_xbusied(fs.m);
/*
* PAGE HAS BEEN FOUND. [Loop invariant still holds -- the object lock
* is held.]
*/
/*
* If the page is being written, but isn't already owned by the
* top-level object, we have to copy it into a new page owned by the
* top-level object.
*/
if (fs.object != fs.first_object) {
/*
* We only really need to copy if we want to write it.
*/
if ((fault_type & (VM_PROT_COPY | VM_PROT_WRITE)) != 0) {
/*
* This allows pages to be virtually copied from a
* backing_object into the first_object, where the
* backing object has no other refs to it, and cannot
* gain any more refs. Instead of a bcopy, we just
* move the page from the backing object to the
* first object. Note that we must mark the page
* dirty in the first object so that it will go out
* to swap when needed.
*/
is_first_object_locked = false;
if (
/*
* Only one shadow object
*/
(fs.object->shadow_count == 1) &&
/*
* No COW refs, except us
*/
(fs.object->ref_count == 1) &&
/*
* No one else can look this object up
*/
(fs.object->handle == NULL) &&
/*
* No other ways to look the object up
*/
((fs.object->type == OBJT_DEFAULT) ||
(fs.object->type == OBJT_SWAP)) &&
(is_first_object_locked = VM_OBJECT_TRYWLOCK(fs.first_object)) &&
/*
* We don't chase down the shadow chain
*/
fs.object == fs.first_object->backing_object) {
vm_page_lock(fs.m);
vm_page_dequeue(fs.m);
- vm_page_remove(fs.m);
+ (void)vm_page_remove(fs.m);
vm_page_unlock(fs.m);
vm_page_lock(fs.first_m);
vm_page_replace_checked(fs.m, fs.first_object,
fs.first_pindex, fs.first_m);
vm_page_free(fs.first_m);
vm_page_unlock(fs.first_m);
vm_page_dirty(fs.m);
#if VM_NRESERVLEVEL > 0
/*
* Rename the reservation.
*/
vm_reserv_rename(fs.m, fs.first_object,
fs.object, OFF_TO_IDX(
fs.first_object->backing_object_offset));
#endif
/*
* Removing the page from the backing object
* unbusied it.
*/
vm_page_xbusy(fs.m);
fs.first_m = fs.m;
fs.m = NULL;
VM_CNT_INC(v_cow_optim);
} else {
/*
* Oh, well, lets copy it.
*/
pmap_copy_page(fs.m, fs.first_m);
fs.first_m->valid = VM_PAGE_BITS_ALL;
if (wired && (fault_flags &
VM_FAULT_WIRE) == 0) {
vm_page_lock(fs.first_m);
vm_page_wire(fs.first_m);
vm_page_unlock(fs.first_m);
vm_page_lock(fs.m);
vm_page_unwire(fs.m, PQ_INACTIVE);
vm_page_unlock(fs.m);
}
/*
* We no longer need the old page or object.
*/
release_page(&fs);
}
/*
* fs.object != fs.first_object due to above
* conditional
*/
vm_object_pip_wakeup(fs.object);
VM_OBJECT_WUNLOCK(fs.object);
/*
* We only try to prefault read-only mappings to the
* neighboring pages when this copy-on-write fault is
* a hard fault. In other cases, trying to prefault
* is typically wasted effort.
*/
if (faultcount == 0)
faultcount = 1;
/*
* Only use the new page below...
*/
fs.object = fs.first_object;
fs.pindex = fs.first_pindex;
fs.m = fs.first_m;
if (!is_first_object_locked)
VM_OBJECT_WLOCK(fs.object);
VM_CNT_INC(v_cow_faults);
curthread->td_cow++;
} else {
prot &= ~VM_PROT_WRITE;
}
}
/*
* We must verify that the maps have not changed since our last
* lookup.
*/
if (!fs.lookup_still_valid) {
if (!vm_map_trylock_read(fs.map)) {
release_page(&fs);
unlock_and_deallocate(&fs);
goto RetryFault;
}
fs.lookup_still_valid = true;
if (fs.map->timestamp != fs.map_generation) {
result = vm_map_lookup_locked(&fs.map, vaddr, fault_type,
&fs.entry, &retry_object, &retry_pindex, &retry_prot, &wired);
/*
* If we don't need the page any longer, put it on the inactive
* list (the easiest thing to do here). If no one needs it,
* pageout will grab it eventually.
*/
if (result != KERN_SUCCESS) {
release_page(&fs);
unlock_and_deallocate(&fs);
/*
* If retry of map lookup would have blocked then
* retry fault from start.
*/
if (result == KERN_FAILURE)
goto RetryFault;
return (result);
}
if ((retry_object != fs.first_object) ||
(retry_pindex != fs.first_pindex)) {
release_page(&fs);
unlock_and_deallocate(&fs);
goto RetryFault;
}
/*
* Check whether the protection has changed or the object has
* been copied while we left the map unlocked. Changing from
* read to write permission is OK - we leave the page
* write-protected, and catch the write fault. Changing from
* write to read permission means that we can't mark the page
* write-enabled after all.
*/
prot &= retry_prot;
fault_type &= retry_prot;
if (prot == 0) {
release_page(&fs);
unlock_and_deallocate(&fs);
goto RetryFault;
}
/* Reassert because wired may have changed. */
KASSERT(wired || (fault_flags & VM_FAULT_WIRE) == 0,
("!wired && VM_FAULT_WIRE"));
}
}
/*
* If the page was filled by a pager, save the virtual address that
* should be faulted on next under a sequential access pattern to the
* map entry. A read lock on the map suffices to update this address
* safely.
*/
if (hardfault)
fs.entry->next_read = vaddr + ptoa(ahead) + PAGE_SIZE;
vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, true);
vm_page_assert_xbusied(fs.m);
/*
* Page must be completely valid or it is not fit to
* map into user space. vm_pager_get_pages() ensures this.
*/
KASSERT(fs.m->valid == VM_PAGE_BITS_ALL,
("vm_fault: page %p partially invalid", fs.m));
VM_OBJECT_WUNLOCK(fs.object);
/*
* Put this page into the physical map. We had to do the unlock above
* because pmap_enter() may sleep. We don't put the page
* back on the active queue until later so that the pageout daemon
* won't find it (yet).
*/
pmap_enter(fs.map->pmap, vaddr, fs.m, prot,
fault_type | (wired ? PMAP_ENTER_WIRED : 0), 0);
if (faultcount != 1 && (fault_flags & VM_FAULT_WIRE) == 0 &&
wired == 0)
vm_fault_prefault(&fs, vaddr,
faultcount > 0 ? behind : PFBAK,
faultcount > 0 ? ahead : PFFOR, false);
VM_OBJECT_WLOCK(fs.object);
vm_page_lock(fs.m);
/*
* If the page is not wired down, then put it where the pageout daemon
* can find it.
*/
if ((fault_flags & VM_FAULT_WIRE) != 0)
vm_page_wire(fs.m);
else
vm_page_activate(fs.m);
if (m_hold != NULL) {
*m_hold = fs.m;
vm_page_hold(fs.m);
}
vm_page_unlock(fs.m);
vm_page_xunbusy(fs.m);
/*
* Unlock everything, and return
*/
unlock_and_deallocate(&fs);
if (hardfault) {
VM_CNT_INC(v_io_faults);
curthread->td_ru.ru_majflt++;
#ifdef RACCT
if (racct_enable && fs.object->type == OBJT_VNODE) {
PROC_LOCK(curproc);
if ((fault_type & (VM_PROT_COPY | VM_PROT_WRITE)) != 0) {
racct_add_force(curproc, RACCT_WRITEBPS,
PAGE_SIZE + behind * PAGE_SIZE);
racct_add_force(curproc, RACCT_WRITEIOPS, 1);
} else {
racct_add_force(curproc, RACCT_READBPS,
PAGE_SIZE + ahead * PAGE_SIZE);
racct_add_force(curproc, RACCT_READIOPS, 1);
}
PROC_UNLOCK(curproc);
}
#endif
} else
curthread->td_ru.ru_minflt++;
return (KERN_SUCCESS);
}
/*
* Speed up the reclamation of pages that precede the faulting pindex within
* the first object of the shadow chain. Essentially, perform the equivalent
* to madvise(..., MADV_DONTNEED) on a large cluster of pages that precedes
* the faulting pindex by the cluster size when the pages read by vm_fault()
* cross a cluster-size boundary. The cluster size is the greater of the
* smallest superpage size and VM_FAULT_DONTNEED_MIN.
*
* When "fs->first_object" is a shadow object, the pages in the backing object
* that precede the faulting pindex are deactivated by vm_fault(). So, this
* function must only be concerned with pages in the first object.
*/
static void
vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr, int ahead)
{
vm_map_entry_t entry;
vm_object_t first_object, object;
vm_offset_t end, start;
vm_page_t m, m_next;
vm_pindex_t pend, pstart;
vm_size_t size;
object = fs->object;
VM_OBJECT_ASSERT_WLOCKED(object);
first_object = fs->first_object;
if (first_object != object) {
if (!VM_OBJECT_TRYWLOCK(first_object)) {
VM_OBJECT_WUNLOCK(object);
VM_OBJECT_WLOCK(first_object);
VM_OBJECT_WLOCK(object);
}
}
/* Neither fictitious nor unmanaged pages can be reclaimed. */
if ((first_object->flags & (OBJ_FICTITIOUS | OBJ_UNMANAGED)) == 0) {
size = VM_FAULT_DONTNEED_MIN;
if (MAXPAGESIZES > 1 && size < pagesizes[1])
size = pagesizes[1];
end = rounddown2(vaddr, size);
if (vaddr - end >= size - PAGE_SIZE - ptoa(ahead) &&
(entry = fs->entry)->start < end) {
if (end - entry->start < size)
start = entry->start;
else
start = end - size;
pmap_advise(fs->map->pmap, start, end, MADV_DONTNEED);
pstart = OFF_TO_IDX(entry->offset) + atop(start -
entry->start);
m_next = vm_page_find_least(first_object, pstart);
pend = OFF_TO_IDX(entry->offset) + atop(end -
entry->start);
while ((m = m_next) != NULL && m->pindex < pend) {
m_next = TAILQ_NEXT(m, listq);
if (m->valid != VM_PAGE_BITS_ALL ||
vm_page_busied(m))
continue;
/*
* Don't clear PGA_REFERENCED, since it would
* likely represent a reference by a different
* process.
*
* Typically, at this point, prefetched pages
* are still in the inactive queue. Only
* pages that triggered page faults are in the
* active queue.
*/
vm_page_lock(m);
if (!vm_page_inactive(m))
vm_page_deactivate(m);
vm_page_unlock(m);
}
}
}
if (first_object != object)
VM_OBJECT_WUNLOCK(first_object);
}
/*
* vm_fault_prefault provides a quick way of clustering
* pagefaults into a processes address space. It is a "cousin"
* of vm_map_pmap_enter, except it runs at page fault time instead
* of mmap time.
*/
static void
vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
int backward, int forward, bool obj_locked)
{
pmap_t pmap;
vm_map_entry_t entry;
vm_object_t backing_object, lobject;
vm_offset_t addr, starta;
vm_pindex_t pindex;
vm_page_t m;
int i;
pmap = fs->map->pmap;
if (pmap != vmspace_pmap(curthread->td_proc->p_vmspace))
return;
entry = fs->entry;
if (addra < backward * PAGE_SIZE) {
starta = entry->start;
} else {
starta = addra - backward * PAGE_SIZE;
if (starta < entry->start)
starta = entry->start;
}
/*
* Generate the sequence of virtual addresses that are candidates for
* prefaulting in an outward spiral from the faulting virtual address,
* "addra". Specifically, the sequence is "addra - PAGE_SIZE", "addra
* + PAGE_SIZE", "addra - 2 * PAGE_SIZE", "addra + 2 * PAGE_SIZE", ...
* If the candidate address doesn't have a backing physical page, then
* the loop immediately terminates.
*/
for (i = 0; i < 2 * imax(backward, forward); i++) {
addr = addra + ((i >> 1) + 1) * ((i & 1) == 0 ? -PAGE_SIZE :
PAGE_SIZE);
if (addr > addra + forward * PAGE_SIZE)
addr = 0;
if (addr < starta || addr >= entry->end)
continue;
if (!pmap_is_prefaultable(pmap, addr))
continue;
pindex = ((addr - entry->start) + entry->offset) >> PAGE_SHIFT;
lobject = entry->object.vm_object;
if (!obj_locked)
VM_OBJECT_RLOCK(lobject);
while ((m = vm_page_lookup(lobject, pindex)) == NULL &&
lobject->type == OBJT_DEFAULT &&
(backing_object = lobject->backing_object) != NULL) {
KASSERT((lobject->backing_object_offset & PAGE_MASK) ==
0, ("vm_fault_prefault: unaligned object offset"));
pindex += lobject->backing_object_offset >> PAGE_SHIFT;
VM_OBJECT_RLOCK(backing_object);
if (!obj_locked || lobject != entry->object.vm_object)
VM_OBJECT_RUNLOCK(lobject);
lobject = backing_object;
}
if (m == NULL) {
if (!obj_locked || lobject != entry->object.vm_object)
VM_OBJECT_RUNLOCK(lobject);
break;
}
if (m->valid == VM_PAGE_BITS_ALL &&
(m->flags & PG_FICTITIOUS) == 0)
pmap_enter_quick(pmap, addr, m, entry->protection);
if (!obj_locked || lobject != entry->object.vm_object)
VM_OBJECT_RUNLOCK(lobject);
}
}
/*
* Hold each of the physical pages that are mapped by the specified range of
* virtual addresses, ["addr", "addr" + "len"), if those mappings are valid
* and allow the specified types of access, "prot". If all of the implied
* pages are successfully held, then the number of held pages is returned
* together with pointers to those pages in the array "ma". However, if any
* of the pages cannot be held, -1 is returned.
*/
int
vm_fault_quick_hold_pages(vm_map_t map, vm_offset_t addr, vm_size_t len,
vm_prot_t prot, vm_page_t *ma, int max_count)
{
vm_offset_t end, va;
vm_page_t *mp;
int count;
boolean_t pmap_failed;
if (len == 0)
return (0);
end = round_page(addr + len);
addr = trunc_page(addr);
/*
* Check for illegal addresses.
*/
if (addr < vm_map_min(map) || addr > end || end > vm_map_max(map))
return (-1);
if (atop(end - addr) > max_count)
panic("vm_fault_quick_hold_pages: count > max_count");
count = atop(end - addr);
/*
* Most likely, the physical pages are resident in the pmap, so it is
* faster to try pmap_extract_and_hold() first.
*/
pmap_failed = FALSE;
for (mp = ma, va = addr; va < end; mp++, va += PAGE_SIZE) {
*mp = pmap_extract_and_hold(map->pmap, va, prot);
if (*mp == NULL)
pmap_failed = TRUE;
else if ((prot & VM_PROT_WRITE) != 0 &&
(*mp)->dirty != VM_PAGE_BITS_ALL) {
/*
* Explicitly dirty the physical page. Otherwise, the
* caller's changes may go unnoticed because they are
* performed through an unmanaged mapping or by a DMA
* operation.
*
* The object lock is not held here.
* See vm_page_clear_dirty_mask().
*/
vm_page_dirty(*mp);
}
}
if (pmap_failed) {
/*
* One or more pages could not be held by the pmap. Either no
* page was mapped at the specified virtual address or that
* mapping had insufficient permissions. Attempt to fault in
* and hold these pages.
*
* If vm_fault_disable_pagefaults() was called,
* i.e., TDP_NOFAULTING is set, we must not sleep nor
* acquire MD VM locks, which means we must not call
* vm_fault_hold(). Some (out of tree) callers mark
* too wide a code area with vm_fault_disable_pagefaults()
* already, use the VM_PROT_QUICK_NOFAULT flag to request
* the proper behaviour explicitly.
*/
if ((prot & VM_PROT_QUICK_NOFAULT) != 0 &&
(curthread->td_pflags & TDP_NOFAULTING) != 0)
goto error;
for (mp = ma, va = addr; va < end; mp++, va += PAGE_SIZE)
if (*mp == NULL && vm_fault_hold(map, va, prot,
VM_FAULT_NORMAL, mp) != KERN_SUCCESS)
goto error;
}
return (count);
error:
for (mp = ma; mp < ma + count; mp++)
if (*mp != NULL) {
vm_page_lock(*mp);
vm_page_unhold(*mp);
vm_page_unlock(*mp);
}
return (-1);
}
/*
* Routine:
* vm_fault_copy_entry
* Function:
* Create new shadow object backing dst_entry with private copy of
* all underlying pages. When src_entry is equal to dst_entry,
* function implements COW for wired-down map entry. Otherwise,
* it forks wired entry into dst_map.
*
* In/out conditions:
* The source and destination maps must be locked for write.
* The source map entry must be wired down (or be a sharing map
* entry corresponding to a main map entry that is wired down).
*/
void
vm_fault_copy_entry(vm_map_t dst_map, vm_map_t src_map,
vm_map_entry_t dst_entry, vm_map_entry_t src_entry,
vm_ooffset_t *fork_charge)
{
vm_object_t backing_object, dst_object, object, src_object;
vm_pindex_t dst_pindex, pindex, src_pindex;
vm_prot_t access, prot;
vm_offset_t vaddr;
vm_page_t dst_m;
vm_page_t src_m;
boolean_t upgrade;
#ifdef lint
src_map++;
#endif /* lint */
upgrade = src_entry == dst_entry;
access = prot = dst_entry->protection;
src_object = src_entry->object.vm_object;
src_pindex = OFF_TO_IDX(src_entry->offset);
if (upgrade && (dst_entry->eflags & MAP_ENTRY_NEEDS_COPY) == 0) {
dst_object = src_object;
vm_object_reference(dst_object);
} else {
/*
* Create the top-level object for the destination entry. (Doesn't
* actually shadow anything - we copy the pages directly.)
*/
dst_object = vm_object_allocate(OBJT_DEFAULT,
atop(dst_entry->end - dst_entry->start));
#if VM_NRESERVLEVEL > 0
dst_object->flags |= OBJ_COLORED;
dst_object->pg_color = atop(dst_entry->start);
#endif
dst_object->domain = src_object->domain;
dst_object->charge = dst_entry->end - dst_entry->start;
}
VM_OBJECT_WLOCK(dst_object);
KASSERT(upgrade || dst_entry->object.vm_object == NULL,
("vm_fault_copy_entry: vm_object not NULL"));
if (src_object != dst_object) {
dst_entry->object.vm_object = dst_object;
dst_entry->offset = 0;
dst_entry->eflags &= ~MAP_ENTRY_VN_EXEC;
}
if (fork_charge != NULL) {
KASSERT(dst_entry->cred == NULL,
("vm_fault_copy_entry: leaked swp charge"));
dst_object->cred = curthread->td_ucred;
crhold(dst_object->cred);
*fork_charge += dst_object->charge;
} else if ((dst_object->type == OBJT_DEFAULT ||
dst_object->type == OBJT_SWAP) &&
dst_object->cred == NULL) {
KASSERT(dst_entry->cred != NULL, ("no cred for entry %p",
dst_entry));
dst_object->cred = dst_entry->cred;
dst_entry->cred = NULL;
}
/*
* If not an upgrade, then enter the mappings in the pmap as
* read and/or execute accesses. Otherwise, enter them as
* write accesses.
*
* A writeable large page mapping is only created if all of
* the constituent small page mappings are modified. Marking
* PTEs as modified on inception allows promotion to happen
* without taking potentially large number of soft faults.
*/
if (!upgrade)
access &= ~VM_PROT_WRITE;
/*
* Loop through all of the virtual pages within the entry's
* range, copying each page from the source object to the
* destination object. Since the source is wired, those pages
* must exist. In contrast, the destination is pageable.
* Since the destination object doesn't share any backing storage
* with the source object, all of its pages must be dirtied,
* regardless of whether they can be written.
*/
for (vaddr = dst_entry->start, dst_pindex = 0;
vaddr < dst_entry->end;
vaddr += PAGE_SIZE, dst_pindex++) {
again:
/*
* Find the page in the source object, and copy it in.
* Because the source is wired down, the page will be
* in memory.
*/
if (src_object != dst_object)
VM_OBJECT_RLOCK(src_object);
object = src_object;
pindex = src_pindex + dst_pindex;
while ((src_m = vm_page_lookup(object, pindex)) == NULL &&
(backing_object = object->backing_object) != NULL) {
/*
* Unless the source mapping is read-only or
* it is presently being upgraded from
* read-only, the first object in the shadow
* chain should provide all of the pages. In
* other words, this loop body should never be
* executed when the source mapping is already
* read/write.
*/
KASSERT((src_entry->protection & VM_PROT_WRITE) == 0 ||
upgrade,
("vm_fault_copy_entry: main object missing page"));
VM_OBJECT_RLOCK(backing_object);
pindex += OFF_TO_IDX(object->backing_object_offset);
if (object != dst_object)
VM_OBJECT_RUNLOCK(object);
object = backing_object;
}
KASSERT(src_m != NULL, ("vm_fault_copy_entry: page missing"));
if (object != dst_object) {
/*
* Allocate a page in the destination object.
*/
dst_m = vm_page_alloc(dst_object, (src_object ==
dst_object ? src_pindex : 0) + dst_pindex,
VM_ALLOC_NORMAL);
if (dst_m == NULL) {
VM_OBJECT_WUNLOCK(dst_object);
VM_OBJECT_RUNLOCK(object);
vm_wait(dst_object);
VM_OBJECT_WLOCK(dst_object);
goto again;
}
pmap_copy_page(src_m, dst_m);
VM_OBJECT_RUNLOCK(object);
dst_m->dirty = dst_m->valid = src_m->valid;
} else {
dst_m = src_m;
if (vm_page_sleep_if_busy(dst_m, "fltupg"))
goto again;
if (dst_m->pindex >= dst_object->size)
/*
* We are upgrading. Index can occur
* out of bounds if the object type is
* vnode and the file was truncated.
*/
break;
vm_page_xbusy(dst_m);
}
VM_OBJECT_WUNLOCK(dst_object);
/*
* Enter it in the pmap. If a wired, copy-on-write
* mapping is being replaced by a write-enabled
* mapping, then wire that new mapping.
*
* The page can be invalid if the user called
* msync(MS_INVALIDATE) or truncated the backing vnode
* or shared memory object. In this case, do not
* insert it into pmap, but still do the copy so that
* all copies of the wired map entry have similar
* backing pages.
*/
if (dst_m->valid == VM_PAGE_BITS_ALL) {
pmap_enter(dst_map->pmap, vaddr, dst_m, prot,
access | (upgrade ? PMAP_ENTER_WIRED : 0), 0);
}
/*
* Mark it no longer busy, and put it on the active list.
*/
VM_OBJECT_WLOCK(dst_object);
if (upgrade) {
if (src_m != dst_m) {
vm_page_lock(src_m);
vm_page_unwire(src_m, PQ_INACTIVE);
vm_page_unlock(src_m);
vm_page_lock(dst_m);
vm_page_wire(dst_m);
vm_page_unlock(dst_m);
} else {
KASSERT(vm_page_wired(dst_m),
("dst_m %p is not wired", dst_m));
}
} else {
vm_page_lock(dst_m);
vm_page_activate(dst_m);
vm_page_unlock(dst_m);
}
vm_page_xunbusy(dst_m);
}
VM_OBJECT_WUNLOCK(dst_object);
if (upgrade) {
dst_entry->eflags &= ~(MAP_ENTRY_COW | MAP_ENTRY_NEEDS_COPY);
vm_object_deallocate(src_object);
}
}
/*
* Block entry into the machine-independent layer's page fault handler by
* the calling thread. Subsequent calls to vm_fault() by that thread will
* return KERN_PROTECTION_FAILURE. Enable machine-dependent handling of
* spurious page faults.
*/
int
vm_fault_disable_pagefaults(void)
{
return (curthread_pflags_set(TDP_NOFAULTING | TDP_RESETSPUR));
}
void
vm_fault_enable_pagefaults(int save)
{
curthread_pflags_restore(save);
}
Index: head/sys/vm/vm_object.c
===================================================================
--- head/sys/vm/vm_object.c (revision 349431)
+++ head/sys/vm/vm_object.c (revision 349432)
@@ -1,2691 +1,2687 @@
/*-
* SPDX-License-Identifier: (BSD-3-Clause AND MIT-CMU)
*
* Copyright (c) 1991, 1993
* The Regents of the University of California. All rights reserved.
*
* This code is derived from software contributed to Berkeley by
* The Mach Operating System project at Carnegie-Mellon University.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* from: @(#)vm_object.c 8.5 (Berkeley) 3/22/94
*
*
* Copyright (c) 1987, 1990 Carnegie-Mellon University.
* All rights reserved.
*
* Authors: Avadis Tevanian, Jr., Michael Wayne Young
*
* Permission to use, copy, modify and distribute this software and
* its documentation is hereby granted, provided that both the copyright
* notice and this permission notice appear in all copies of the
* software, derivative works or modified versions, and any portions
* thereof, and that both notices appear in supporting documentation.
*
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
* FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
*
* Carnegie Mellon requests users of this software to return to
*
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
* School of Computer Science
* Carnegie Mellon University
* Pittsburgh PA 15213-3890
*
* any improvements or extensions that they make and grant Carnegie the
* rights to redistribute these changes.
*/
/*
* Virtual memory object module.
*/
#include
__FBSDID("$FreeBSD$");
#include "opt_vm.h"
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include /* for curproc, pageproc */
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
static int old_msync;
SYSCTL_INT(_vm, OID_AUTO, old_msync, CTLFLAG_RW, &old_msync, 0,
"Use old (insecure) msync behavior");
static int vm_object_page_collect_flush(vm_object_t object, vm_page_t p,
int pagerflags, int flags, boolean_t *clearobjflags,
boolean_t *eio);
static boolean_t vm_object_page_remove_write(vm_page_t p, int flags,
boolean_t *clearobjflags);
static void vm_object_qcollapse(vm_object_t object);
static void vm_object_vndeallocate(vm_object_t object);
/*
* Virtual memory objects maintain the actual data
* associated with allocated virtual memory. A given
* page of memory exists within exactly one object.
*
* An object is only deallocated when all "references"
* are given up. Only one "reference" to a given
* region of an object should be writeable.
*
* Associated with each object is a list of all resident
* memory pages belonging to that object; this list is
* maintained by the "vm_page" module, and locked by the object's
* lock.
*
* Each object also records a "pager" routine which is
* used to retrieve (and store) pages to the proper backing
* storage. In addition, objects may be backed by other
* objects from which they were virtual-copied.
*
* The only items within the object structure which are
* modified after time of creation are:
* reference count locked by object's lock
* pager routine locked by object's lock
*
*/
struct object_q vm_object_list;
struct mtx vm_object_list_mtx; /* lock for object list and count */
struct vm_object kernel_object_store;
static SYSCTL_NODE(_vm_stats, OID_AUTO, object, CTLFLAG_RD, 0,
"VM object stats");
static counter_u64_t object_collapses = EARLY_COUNTER;
SYSCTL_COUNTER_U64(_vm_stats_object, OID_AUTO, collapses, CTLFLAG_RD,
&object_collapses,
"VM object collapses");
static counter_u64_t object_bypasses = EARLY_COUNTER;
SYSCTL_COUNTER_U64(_vm_stats_object, OID_AUTO, bypasses, CTLFLAG_RD,
&object_bypasses,
"VM object bypasses");
static void
counter_startup(void)
{
object_collapses = counter_u64_alloc(M_WAITOK);
object_bypasses = counter_u64_alloc(M_WAITOK);
}
SYSINIT(object_counters, SI_SUB_CPU, SI_ORDER_ANY, counter_startup, NULL);
static uma_zone_t obj_zone;
static int vm_object_zinit(void *mem, int size, int flags);
#ifdef INVARIANTS
static void vm_object_zdtor(void *mem, int size, void *arg);
static void
vm_object_zdtor(void *mem, int size, void *arg)
{
vm_object_t object;
object = (vm_object_t)mem;
KASSERT(object->ref_count == 0,
("object %p ref_count = %d", object, object->ref_count));
KASSERT(TAILQ_EMPTY(&object->memq),
("object %p has resident pages in its memq", object));
KASSERT(vm_radix_is_empty(&object->rtree),
("object %p has resident pages in its trie", object));
#if VM_NRESERVLEVEL > 0
KASSERT(LIST_EMPTY(&object->rvq),
("object %p has reservations",
object));
#endif
KASSERT(object->paging_in_progress == 0,
("object %p paging_in_progress = %d",
object, object->paging_in_progress));
KASSERT(object->resident_page_count == 0,
("object %p resident_page_count = %d",
object, object->resident_page_count));
KASSERT(object->shadow_count == 0,
("object %p shadow_count = %d",
object, object->shadow_count));
KASSERT(object->type == OBJT_DEAD,
("object %p has non-dead type %d",
object, object->type));
}
#endif
static int
vm_object_zinit(void *mem, int size, int flags)
{
vm_object_t object;
object = (vm_object_t)mem;
rw_init_flags(&object->lock, "vm object", RW_DUPOK | RW_NEW);
/* These are true for any object that has been freed */
object->type = OBJT_DEAD;
object->ref_count = 0;
vm_radix_init(&object->rtree);
object->paging_in_progress = 0;
object->resident_page_count = 0;
object->shadow_count = 0;
object->flags = OBJ_DEAD;
mtx_lock(&vm_object_list_mtx);
TAILQ_INSERT_TAIL(&vm_object_list, object, object_list);
mtx_unlock(&vm_object_list_mtx);
return (0);
}
static void
_vm_object_allocate(objtype_t type, vm_pindex_t size, vm_object_t object)
{
TAILQ_INIT(&object->memq);
LIST_INIT(&object->shadow_head);
object->type = type;
if (type == OBJT_SWAP)
pctrie_init(&object->un_pager.swp.swp_blks);
/*
* Ensure that swap_pager_swapoff() iteration over object_list
* sees up to date type and pctrie head if it observed
* non-dead object.
*/
atomic_thread_fence_rel();
switch (type) {
case OBJT_DEAD:
panic("_vm_object_allocate: can't create OBJT_DEAD");
case OBJT_DEFAULT:
case OBJT_SWAP:
object->flags = OBJ_ONEMAPPING;
break;
case OBJT_DEVICE:
case OBJT_SG:
object->flags = OBJ_FICTITIOUS | OBJ_UNMANAGED;
break;
case OBJT_MGTDEVICE:
object->flags = OBJ_FICTITIOUS;
break;
case OBJT_PHYS:
object->flags = OBJ_UNMANAGED;
break;
case OBJT_VNODE:
object->flags = 0;
break;
default:
panic("_vm_object_allocate: type %d is undefined", type);
}
object->size = size;
object->domain.dr_policy = NULL;
object->generation = 1;
object->ref_count = 1;
object->memattr = VM_MEMATTR_DEFAULT;
object->cred = NULL;
object->charge = 0;
object->handle = NULL;
object->backing_object = NULL;
object->backing_object_offset = (vm_ooffset_t) 0;
#if VM_NRESERVLEVEL > 0
LIST_INIT(&object->rvq);
#endif
umtx_shm_object_init(object);
}
/*
* vm_object_init:
*
* Initialize the VM objects module.
*/
void
vm_object_init(void)
{
TAILQ_INIT(&vm_object_list);
mtx_init(&vm_object_list_mtx, "vm object_list", NULL, MTX_DEF);
rw_init(&kernel_object->lock, "kernel vm object");
_vm_object_allocate(OBJT_PHYS, atop(VM_MAX_KERNEL_ADDRESS -
VM_MIN_KERNEL_ADDRESS), kernel_object);
#if VM_NRESERVLEVEL > 0
kernel_object->flags |= OBJ_COLORED;
kernel_object->pg_color = (u_short)atop(VM_MIN_KERNEL_ADDRESS);
#endif
/*
* The lock portion of struct vm_object must be type stable due
* to vm_pageout_fallback_object_lock locking a vm object
* without holding any references to it.
*/
obj_zone = uma_zcreate("VM OBJECT", sizeof (struct vm_object), NULL,
#ifdef INVARIANTS
vm_object_zdtor,
#else
NULL,
#endif
vm_object_zinit, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
vm_radix_zinit();
}
void
vm_object_clear_flag(vm_object_t object, u_short bits)
{
VM_OBJECT_ASSERT_WLOCKED(object);
object->flags &= ~bits;
}
/*
* Sets the default memory attribute for the specified object. Pages
* that are allocated to this object are by default assigned this memory
* attribute.
*
* Presently, this function must be called before any pages are allocated
* to the object. In the future, this requirement may be relaxed for
* "default" and "swap" objects.
*/
int
vm_object_set_memattr(vm_object_t object, vm_memattr_t memattr)
{
VM_OBJECT_ASSERT_WLOCKED(object);
switch (object->type) {
case OBJT_DEFAULT:
case OBJT_DEVICE:
case OBJT_MGTDEVICE:
case OBJT_PHYS:
case OBJT_SG:
case OBJT_SWAP:
case OBJT_VNODE:
if (!TAILQ_EMPTY(&object->memq))
return (KERN_FAILURE);
break;
case OBJT_DEAD:
return (KERN_INVALID_ARGUMENT);
default:
panic("vm_object_set_memattr: object %p is of undefined type",
object);
}
object->memattr = memattr;
return (KERN_SUCCESS);
}
void
vm_object_pip_add(vm_object_t object, short i)
{
VM_OBJECT_ASSERT_WLOCKED(object);
object->paging_in_progress += i;
}
void
vm_object_pip_subtract(vm_object_t object, short i)
{
VM_OBJECT_ASSERT_WLOCKED(object);
object->paging_in_progress -= i;
}
void
vm_object_pip_wakeup(vm_object_t object)
{
VM_OBJECT_ASSERT_WLOCKED(object);
object->paging_in_progress--;
if ((object->flags & OBJ_PIPWNT) && object->paging_in_progress == 0) {
vm_object_clear_flag(object, OBJ_PIPWNT);
wakeup(object);
}
}
void
vm_object_pip_wakeupn(vm_object_t object, short i)
{
VM_OBJECT_ASSERT_WLOCKED(object);
if (i)
object->paging_in_progress -= i;
if ((object->flags & OBJ_PIPWNT) && object->paging_in_progress == 0) {
vm_object_clear_flag(object, OBJ_PIPWNT);
wakeup(object);
}
}
void
vm_object_pip_wait(vm_object_t object, char *waitid)
{
VM_OBJECT_ASSERT_WLOCKED(object);
while (object->paging_in_progress) {
object->flags |= OBJ_PIPWNT;
VM_OBJECT_SLEEP(object, object, PVM, waitid, 0);
}
}
/*
* vm_object_allocate:
*
* Returns a new object with the given size.
*/
vm_object_t
vm_object_allocate(objtype_t type, vm_pindex_t size)
{
vm_object_t object;
object = (vm_object_t)uma_zalloc(obj_zone, M_WAITOK);
_vm_object_allocate(type, size, object);
return (object);
}
/*
* vm_object_reference:
*
* Gets another reference to the given object. Note: OBJ_DEAD
* objects can be referenced during final cleaning.
*/
void
vm_object_reference(vm_object_t object)
{
if (object == NULL)
return;
VM_OBJECT_WLOCK(object);
vm_object_reference_locked(object);
VM_OBJECT_WUNLOCK(object);
}
/*
* vm_object_reference_locked:
*
* Gets another reference to the given object.
*
* The object must be locked.
*/
void
vm_object_reference_locked(vm_object_t object)
{
struct vnode *vp;
VM_OBJECT_ASSERT_WLOCKED(object);
object->ref_count++;
if (object->type == OBJT_VNODE) {
vp = object->handle;
vref(vp);
}
}
/*
* Handle deallocating an object of type OBJT_VNODE.
*/
static void
vm_object_vndeallocate(vm_object_t object)
{
struct vnode *vp = (struct vnode *) object->handle;
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT(object->type == OBJT_VNODE,
("vm_object_vndeallocate: not a vnode object"));
KASSERT(vp != NULL, ("vm_object_vndeallocate: missing vp"));
#ifdef INVARIANTS
if (object->ref_count == 0) {
vn_printf(vp, "vm_object_vndeallocate ");
panic("vm_object_vndeallocate: bad object reference count");
}
#endif
if (!umtx_shm_vnobj_persistent && object->ref_count == 1)
umtx_shm_object_terminated(object);
object->ref_count--;
/* vrele may need the vnode lock. */
VM_OBJECT_WUNLOCK(object);
vrele(vp);
}
/*
* vm_object_deallocate:
*
* Release a reference to the specified object,
* gained either through a vm_object_allocate
* or a vm_object_reference call. When all references
* are gone, storage associated with this object
* may be relinquished.
*
* No object may be locked.
*/
void
vm_object_deallocate(vm_object_t object)
{
vm_object_t temp;
struct vnode *vp;
while (object != NULL) {
VM_OBJECT_WLOCK(object);
if (object->type == OBJT_VNODE) {
vm_object_vndeallocate(object);
return;
}
KASSERT(object->ref_count != 0,
("vm_object_deallocate: object deallocated too many times: %d", object->type));
/*
* If the reference count goes to 0 we start calling
* vm_object_terminate() on the object chain.
* A ref count of 1 may be a special case depending on the
* shadow count being 0 or 1.
*/
object->ref_count--;
if (object->ref_count > 1) {
VM_OBJECT_WUNLOCK(object);
return;
} else if (object->ref_count == 1) {
if (object->type == OBJT_SWAP &&
(object->flags & OBJ_TMPFS) != 0) {
vp = object->un_pager.swp.swp_tmpfs;
vhold(vp);
VM_OBJECT_WUNLOCK(object);
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
VM_OBJECT_WLOCK(object);
if (object->type == OBJT_DEAD ||
object->ref_count != 1) {
VM_OBJECT_WUNLOCK(object);
VOP_UNLOCK(vp, 0);
vdrop(vp);
return;
}
if ((object->flags & OBJ_TMPFS) != 0)
VOP_UNSET_TEXT(vp);
VOP_UNLOCK(vp, 0);
vdrop(vp);
}
if (object->shadow_count == 0 &&
object->handle == NULL &&
(object->type == OBJT_DEFAULT ||
(object->type == OBJT_SWAP &&
(object->flags & OBJ_TMPFS_NODE) == 0))) {
vm_object_set_flag(object, OBJ_ONEMAPPING);
} else if ((object->shadow_count == 1) &&
(object->handle == NULL) &&
(object->type == OBJT_DEFAULT ||
object->type == OBJT_SWAP)) {
vm_object_t robject;
robject = LIST_FIRST(&object->shadow_head);
KASSERT(robject != NULL,
("vm_object_deallocate: ref_count: %d, shadow_count: %d",
object->ref_count,
object->shadow_count));
KASSERT((robject->flags & OBJ_TMPFS_NODE) == 0,
("shadowed tmpfs v_object %p", object));
if (!VM_OBJECT_TRYWLOCK(robject)) {
/*
* Avoid a potential deadlock.
*/
object->ref_count++;
VM_OBJECT_WUNLOCK(object);
/*
* More likely than not the thread
* holding robject's lock has lower
* priority than the current thread.
* Let the lower priority thread run.
*/
pause("vmo_de", 1);
continue;
}
/*
* Collapse object into its shadow unless its
* shadow is dead. In that case, object will
* be deallocated by the thread that is
* deallocating its shadow.
*/
if ((robject->flags & OBJ_DEAD) == 0 &&
(robject->handle == NULL) &&
(robject->type == OBJT_DEFAULT ||
robject->type == OBJT_SWAP)) {
robject->ref_count++;
retry:
if (robject->paging_in_progress) {
VM_OBJECT_WUNLOCK(object);
vm_object_pip_wait(robject,
"objde1");
temp = robject->backing_object;
if (object == temp) {
VM_OBJECT_WLOCK(object);
goto retry;
}
} else if (object->paging_in_progress) {
VM_OBJECT_WUNLOCK(robject);
object->flags |= OBJ_PIPWNT;
VM_OBJECT_SLEEP(object, object,
PDROP | PVM, "objde2", 0);
VM_OBJECT_WLOCK(robject);
temp = robject->backing_object;
if (object == temp) {
VM_OBJECT_WLOCK(object);
goto retry;
}
} else
VM_OBJECT_WUNLOCK(object);
if (robject->ref_count == 1) {
robject->ref_count--;
object = robject;
goto doterm;
}
object = robject;
vm_object_collapse(object);
VM_OBJECT_WUNLOCK(object);
continue;
}
VM_OBJECT_WUNLOCK(robject);
}
VM_OBJECT_WUNLOCK(object);
return;
}
doterm:
umtx_shm_object_terminated(object);
temp = object->backing_object;
if (temp != NULL) {
KASSERT((object->flags & OBJ_TMPFS_NODE) == 0,
("shadowed tmpfs v_object 2 %p", object));
VM_OBJECT_WLOCK(temp);
LIST_REMOVE(object, shadow_list);
temp->shadow_count--;
VM_OBJECT_WUNLOCK(temp);
object->backing_object = NULL;
}
/*
* Don't double-terminate, we could be in a termination
* recursion due to the terminate having to sync data
* to disk.
*/
if ((object->flags & OBJ_DEAD) == 0)
vm_object_terminate(object);
else
VM_OBJECT_WUNLOCK(object);
object = temp;
}
}
/*
* vm_object_destroy removes the object from the global object list
* and frees the space for the object.
*/
void
vm_object_destroy(vm_object_t object)
{
/*
* Release the allocation charge.
*/
if (object->cred != NULL) {
swap_release_by_cred(object->charge, object->cred);
object->charge = 0;
crfree(object->cred);
object->cred = NULL;
}
/*
* Free the space for the object.
*/
uma_zfree(obj_zone, object);
}
/*
* vm_object_terminate_pages removes any remaining pageable pages
* from the object and resets the object to an empty state.
*/
static void
vm_object_terminate_pages(vm_object_t object)
{
vm_page_t p, p_next;
struct mtx *mtx;
VM_OBJECT_ASSERT_WLOCKED(object);
mtx = NULL;
/*
* Free any remaining pageable pages. This also removes them from the
* paging queues. However, don't free wired pages, just remove them
* from the object. Rather than incrementally removing each page from
* the object, the page and object are reset to any empty state.
*/
TAILQ_FOREACH_SAFE(p, &object->memq, listq, p_next) {
vm_page_assert_unbusied(p);
if ((object->flags & OBJ_UNMANAGED) == 0)
/*
* vm_page_free_prep() only needs the page
* lock for managed pages.
*/
vm_page_change_lock(p, &mtx);
p->object = NULL;
if (vm_page_wired(p))
continue;
VM_CNT_INC(v_pfree);
vm_page_free(p);
}
if (mtx != NULL)
mtx_unlock(mtx);
/*
* If the object contained any pages, then reset it to an empty state.
* None of the object's fields, including "resident_page_count", were
* modified by the preceding loop.
*/
if (object->resident_page_count != 0) {
vm_radix_reclaim_allnodes(&object->rtree);
TAILQ_INIT(&object->memq);
object->resident_page_count = 0;
if (object->type == OBJT_VNODE)
vdrop(object->handle);
}
}
/*
* vm_object_terminate actually destroys the specified object, freeing
* up all previously used resources.
*
* The object must be locked.
* This routine may block.
*/
void
vm_object_terminate(vm_object_t object)
{
VM_OBJECT_ASSERT_WLOCKED(object);
/*
* Make sure no one uses us.
*/
vm_object_set_flag(object, OBJ_DEAD);
/*
* wait for the pageout daemon to be done with the object
*/
vm_object_pip_wait(object, "objtrm");
KASSERT(!object->paging_in_progress,
("vm_object_terminate: pageout in progress"));
/*
* Clean and free the pages, as appropriate. All references to the
* object are gone, so we don't need to lock it.
*/
if (object->type == OBJT_VNODE) {
struct vnode *vp = (struct vnode *)object->handle;
/*
* Clean pages and flush buffers.
*/
vm_object_page_clean(object, 0, 0, OBJPC_SYNC);
VM_OBJECT_WUNLOCK(object);
vinvalbuf(vp, V_SAVE, 0, 0);
BO_LOCK(&vp->v_bufobj);
vp->v_bufobj.bo_flag |= BO_DEAD;
BO_UNLOCK(&vp->v_bufobj);
VM_OBJECT_WLOCK(object);
}
KASSERT(object->ref_count == 0,
("vm_object_terminate: object with references, ref_count=%d",
object->ref_count));
if ((object->flags & OBJ_PG_DTOR) == 0)
vm_object_terminate_pages(object);
#if VM_NRESERVLEVEL > 0
if (__predict_false(!LIST_EMPTY(&object->rvq)))
vm_reserv_break_all(object);
#endif
KASSERT(object->cred == NULL || object->type == OBJT_DEFAULT ||
object->type == OBJT_SWAP,
("%s: non-swap obj %p has cred", __func__, object));
/*
* Let the pager know object is dead.
*/
vm_pager_deallocate(object);
VM_OBJECT_WUNLOCK(object);
vm_object_destroy(object);
}
/*
* Make the page read-only so that we can clear the object flags. However, if
* this is a nosync mmap then the object is likely to stay dirty so do not
* mess with the page and do not clear the object flags. Returns TRUE if the
* page should be flushed, and FALSE otherwise.
*/
static boolean_t
vm_object_page_remove_write(vm_page_t p, int flags, boolean_t *clearobjflags)
{
/*
* If we have been asked to skip nosync pages and this is a
* nosync page, skip it. Note that the object flags were not
* cleared in this case so we do not have to set them.
*/
if ((flags & OBJPC_NOSYNC) != 0 && (p->oflags & VPO_NOSYNC) != 0) {
*clearobjflags = FALSE;
return (FALSE);
} else {
pmap_remove_write(p);
return (p->dirty != 0);
}
}
/*
* vm_object_page_clean
*
* Clean all dirty pages in the specified range of object. Leaves page
* on whatever queue it is currently on. If NOSYNC is set then do not
* write out pages with VPO_NOSYNC set (originally comes from MAP_NOSYNC),
* leaving the object dirty.
*
* When stuffing pages asynchronously, allow clustering. XXX we need a
* synchronous clustering mode implementation.
*
* Odd semantics: if start == end, we clean everything.
*
* The object must be locked.
*
* Returns FALSE if some page from the range was not written, as
* reported by the pager, and TRUE otherwise.
*/
boolean_t
vm_object_page_clean(vm_object_t object, vm_ooffset_t start, vm_ooffset_t end,
int flags)
{
vm_page_t np, p;
vm_pindex_t pi, tend, tstart;
int curgeneration, n, pagerflags;
boolean_t clearobjflags, eio, res;
VM_OBJECT_ASSERT_WLOCKED(object);
/*
* The OBJ_MIGHTBEDIRTY flag is only set for OBJT_VNODE
* objects. The check below prevents the function from
* operating on non-vnode objects.
*/
if ((object->flags & OBJ_MIGHTBEDIRTY) == 0 ||
object->resident_page_count == 0)
return (TRUE);
pagerflags = (flags & (OBJPC_SYNC | OBJPC_INVAL)) != 0 ?
VM_PAGER_PUT_SYNC : VM_PAGER_CLUSTER_OK;
pagerflags |= (flags & OBJPC_INVAL) != 0 ? VM_PAGER_PUT_INVAL : 0;
tstart = OFF_TO_IDX(start);
tend = (end == 0) ? object->size : OFF_TO_IDX(end + PAGE_MASK);
clearobjflags = tstart == 0 && tend >= object->size;
res = TRUE;
rescan:
curgeneration = object->generation;
for (p = vm_page_find_least(object, tstart); p != NULL; p = np) {
pi = p->pindex;
if (pi >= tend)
break;
np = TAILQ_NEXT(p, listq);
if (p->valid == 0)
continue;
if (vm_page_sleep_if_busy(p, "vpcwai")) {
if (object->generation != curgeneration) {
if ((flags & OBJPC_SYNC) != 0)
goto rescan;
else
clearobjflags = FALSE;
}
np = vm_page_find_least(object, pi);
continue;
}
if (!vm_object_page_remove_write(p, flags, &clearobjflags))
continue;
n = vm_object_page_collect_flush(object, p, pagerflags,
flags, &clearobjflags, &eio);
if (eio) {
res = FALSE;
clearobjflags = FALSE;
}
if (object->generation != curgeneration) {
if ((flags & OBJPC_SYNC) != 0)
goto rescan;
else
clearobjflags = FALSE;
}
/*
* If the VOP_PUTPAGES() did a truncated write, so
* that even the first page of the run is not fully
* written, vm_pageout_flush() returns 0 as the run
* length. Since the condition that caused truncated
* write may be permanent, e.g. exhausted free space,
* accepting n == 0 would cause an infinite loop.
*
* Forwarding the iterator leaves the unwritten page
* behind, but there is not much we can do there if
* filesystem refuses to write it.
*/
if (n == 0) {
n = 1;
clearobjflags = FALSE;
}
np = vm_page_find_least(object, pi + n);
}
#if 0
VOP_FSYNC(vp, (pagerflags & VM_PAGER_PUT_SYNC) ? MNT_WAIT : 0);
#endif
if (clearobjflags)
vm_object_clear_flag(object, OBJ_MIGHTBEDIRTY);
return (res);
}
static int
vm_object_page_collect_flush(vm_object_t object, vm_page_t p, int pagerflags,
int flags, boolean_t *clearobjflags, boolean_t *eio)
{
vm_page_t ma[vm_pageout_page_count], p_first, tp;
int count, i, mreq, runlen;
vm_page_lock_assert(p, MA_NOTOWNED);
VM_OBJECT_ASSERT_WLOCKED(object);
count = 1;
mreq = 0;
for (tp = p; count < vm_pageout_page_count; count++) {
tp = vm_page_next(tp);
if (tp == NULL || vm_page_busied(tp))
break;
if (!vm_object_page_remove_write(tp, flags, clearobjflags))
break;
}
for (p_first = p; count < vm_pageout_page_count; count++) {
tp = vm_page_prev(p_first);
if (tp == NULL || vm_page_busied(tp))
break;
if (!vm_object_page_remove_write(tp, flags, clearobjflags))
break;
p_first = tp;
mreq++;
}
for (tp = p_first, i = 0; i < count; tp = TAILQ_NEXT(tp, listq), i++)
ma[i] = tp;
vm_pageout_flush(ma, count, pagerflags, mreq, &runlen, eio);
return (runlen);
}
/*
* Note that there is absolutely no sense in writing out
* anonymous objects, so we track down the vnode object
* to write out.
* We invalidate (remove) all pages from the address space
* for semantic correctness.
*
* If the backing object is a device object with unmanaged pages, then any
* mappings to the specified range of pages must be removed before this
* function is called.
*
* Note: certain anonymous maps, such as MAP_NOSYNC maps,
* may start out with a NULL object.
*/
boolean_t
vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size,
boolean_t syncio, boolean_t invalidate)
{
vm_object_t backing_object;
struct vnode *vp;
struct mount *mp;
int error, flags, fsync_after;
boolean_t res;
if (object == NULL)
return (TRUE);
res = TRUE;
error = 0;
VM_OBJECT_WLOCK(object);
while ((backing_object = object->backing_object) != NULL) {
VM_OBJECT_WLOCK(backing_object);
offset += object->backing_object_offset;
VM_OBJECT_WUNLOCK(object);
object = backing_object;
if (object->size < OFF_TO_IDX(offset + size))
size = IDX_TO_OFF(object->size) - offset;
}
/*
* Flush pages if writing is allowed, invalidate them
* if invalidation requested. Pages undergoing I/O
* will be ignored by vm_object_page_remove().
*
* We cannot lock the vnode and then wait for paging
* to complete without deadlocking against vm_fault.
* Instead we simply call vm_object_page_remove() and
* allow it to block internally on a page-by-page
* basis when it encounters pages undergoing async
* I/O.
*/
if (object->type == OBJT_VNODE &&
(object->flags & OBJ_MIGHTBEDIRTY) != 0 &&
((vp = object->handle)->v_vflag & VV_NOSYNC) == 0) {
VM_OBJECT_WUNLOCK(object);
(void) vn_start_write(vp, &mp, V_WAIT);
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
if (syncio && !invalidate && offset == 0 &&
atop(size) == object->size) {
/*
* If syncing the whole mapping of the file,
* it is faster to schedule all the writes in
* async mode, also allowing the clustering,
* and then wait for i/o to complete.
*/
flags = 0;
fsync_after = TRUE;
} else {
flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0;
fsync_after = FALSE;
}
VM_OBJECT_WLOCK(object);
res = vm_object_page_clean(object, offset, offset + size,
flags);
VM_OBJECT_WUNLOCK(object);
if (fsync_after)
error = VOP_FSYNC(vp, MNT_WAIT, curthread);
VOP_UNLOCK(vp, 0);
vn_finished_write(mp);
if (error != 0)
res = FALSE;
VM_OBJECT_WLOCK(object);
}
if ((object->type == OBJT_VNODE ||
object->type == OBJT_DEVICE) && invalidate) {
if (object->type == OBJT_DEVICE)
/*
* The option OBJPR_NOTMAPPED must be passed here
* because vm_object_page_remove() cannot remove
* unmanaged mappings.
*/
flags = OBJPR_NOTMAPPED;
else if (old_msync)
flags = 0;
else
flags = OBJPR_CLEANONLY;
vm_object_page_remove(object, OFF_TO_IDX(offset),
OFF_TO_IDX(offset + size + PAGE_MASK), flags);
}
VM_OBJECT_WUNLOCK(object);
return (res);
}
/*
* Determine whether the given advice can be applied to the object. Advice is
* not applied to unmanaged pages since they never belong to page queues, and
* since MADV_FREE is destructive, it can apply only to anonymous pages that
* have been mapped at most once.
*/
static bool
vm_object_advice_applies(vm_object_t object, int advice)
{
if ((object->flags & OBJ_UNMANAGED) != 0)
return (false);
if (advice != MADV_FREE)
return (true);
return ((object->type == OBJT_DEFAULT || object->type == OBJT_SWAP) &&
(object->flags & OBJ_ONEMAPPING) != 0);
}
static void
vm_object_madvise_freespace(vm_object_t object, int advice, vm_pindex_t pindex,
vm_size_t size)
{
if (advice == MADV_FREE && object->type == OBJT_SWAP)
swap_pager_freespace(object, pindex, size);
}
/*
* vm_object_madvise:
*
* Implements the madvise function at the object/page level.
*
* MADV_WILLNEED (any object)
*
* Activate the specified pages if they are resident.
*
* MADV_DONTNEED (any object)
*
* Deactivate the specified pages if they are resident.
*
* MADV_FREE (OBJT_DEFAULT/OBJT_SWAP objects,
* OBJ_ONEMAPPING only)
*
* Deactivate and clean the specified pages if they are
* resident. This permits the process to reuse the pages
* without faulting or the kernel to reclaim the pages
* without I/O.
*/
void
vm_object_madvise(vm_object_t object, vm_pindex_t pindex, vm_pindex_t end,
int advice)
{
vm_pindex_t tpindex;
vm_object_t backing_object, tobject;
vm_page_t m, tm;
if (object == NULL)
return;
relookup:
VM_OBJECT_WLOCK(object);
if (!vm_object_advice_applies(object, advice)) {
VM_OBJECT_WUNLOCK(object);
return;
}
for (m = vm_page_find_least(object, pindex); pindex < end; pindex++) {
tobject = object;
/*
* If the next page isn't resident in the top-level object, we
* need to search the shadow chain. When applying MADV_FREE, we
* take care to release any swap space used to store
* non-resident pages.
*/
if (m == NULL || pindex < m->pindex) {
/*
* Optimize a common case: if the top-level object has
* no backing object, we can skip over the non-resident
* range in constant time.
*/
if (object->backing_object == NULL) {
tpindex = (m != NULL && m->pindex < end) ?
m->pindex : end;
vm_object_madvise_freespace(object, advice,
pindex, tpindex - pindex);
if ((pindex = tpindex) == end)
break;
goto next_page;
}
tpindex = pindex;
do {
vm_object_madvise_freespace(tobject, advice,
tpindex, 1);
/*
* Prepare to search the next object in the
* chain.
*/
backing_object = tobject->backing_object;
if (backing_object == NULL)
goto next_pindex;
VM_OBJECT_WLOCK(backing_object);
tpindex +=
OFF_TO_IDX(tobject->backing_object_offset);
if (tobject != object)
VM_OBJECT_WUNLOCK(tobject);
tobject = backing_object;
if (!vm_object_advice_applies(tobject, advice))
goto next_pindex;
} while ((tm = vm_page_lookup(tobject, tpindex)) ==
NULL);
} else {
next_page:
tm = m;
m = TAILQ_NEXT(m, listq);
}
/*
* If the page is not in a normal state, skip it.
*/
if (tm->valid != VM_PAGE_BITS_ALL)
goto next_pindex;
vm_page_lock(tm);
if (vm_page_held(tm)) {
vm_page_unlock(tm);
goto next_pindex;
}
KASSERT((tm->flags & PG_FICTITIOUS) == 0,
("vm_object_madvise: page %p is fictitious", tm));
KASSERT((tm->oflags & VPO_UNMANAGED) == 0,
("vm_object_madvise: page %p is not managed", tm));
if (vm_page_busied(tm)) {
if (object != tobject)
VM_OBJECT_WUNLOCK(tobject);
VM_OBJECT_WUNLOCK(object);
if (advice == MADV_WILLNEED) {
/*
* Reference the page before unlocking and
* sleeping so that the page daemon is less
* likely to reclaim it.
*/
vm_page_aflag_set(tm, PGA_REFERENCED);
}
vm_page_busy_sleep(tm, "madvpo", false);
goto relookup;
}
vm_page_advise(tm, advice);
vm_page_unlock(tm);
vm_object_madvise_freespace(tobject, advice, tm->pindex, 1);
next_pindex:
if (tobject != object)
VM_OBJECT_WUNLOCK(tobject);
}
VM_OBJECT_WUNLOCK(object);
}
/*
* vm_object_shadow:
*
* Create a new object which is backed by the
* specified existing object range. The source
* object reference is deallocated.
*
* The new object and offset into that object
* are returned in the source parameters.
*/
void
vm_object_shadow(
vm_object_t *object, /* IN/OUT */
vm_ooffset_t *offset, /* IN/OUT */
vm_size_t length)
{
vm_object_t source;
vm_object_t result;
source = *object;
/*
* Don't create the new object if the old object isn't shared.
*/
if (source != NULL) {
VM_OBJECT_WLOCK(source);
if (source->ref_count == 1 &&
source->handle == NULL &&
(source->type == OBJT_DEFAULT ||
source->type == OBJT_SWAP)) {
VM_OBJECT_WUNLOCK(source);
return;
}
VM_OBJECT_WUNLOCK(source);
}
/*
* Allocate a new object with the given length.
*/
result = vm_object_allocate(OBJT_DEFAULT, atop(length));
/*
* The new object shadows the source object, adding a reference to it.
* Our caller changes his reference to point to the new object,
* removing a reference to the source object. Net result: no change
* of reference count.
*
* Try to optimize the result object's page color when shadowing
* in order to maintain page coloring consistency in the combined
* shadowed object.
*/
result->backing_object = source;
/*
* Store the offset into the source object, and fix up the offset into
* the new object.
*/
result->backing_object_offset = *offset;
if (source != NULL) {
VM_OBJECT_WLOCK(source);
result->domain = source->domain;
LIST_INSERT_HEAD(&source->shadow_head, result, shadow_list);
source->shadow_count++;
#if VM_NRESERVLEVEL > 0
result->flags |= source->flags & OBJ_COLORED;
result->pg_color = (source->pg_color + OFF_TO_IDX(*offset)) &
((1 << (VM_NFREEORDER - 1)) - 1);
#endif
VM_OBJECT_WUNLOCK(source);
}
/*
* Return the new things
*/
*offset = 0;
*object = result;
}
/*
* vm_object_split:
*
* Split the pages in a map entry into a new object. This affords
* easier removal of unused pages, and keeps object inheritance from
* being a negative impact on memory usage.
*/
void
vm_object_split(vm_map_entry_t entry)
{
vm_page_t m, m_next;
vm_object_t orig_object, new_object, source;
vm_pindex_t idx, offidxstart;
vm_size_t size;
orig_object = entry->object.vm_object;
if (orig_object->type != OBJT_DEFAULT && orig_object->type != OBJT_SWAP)
return;
if (orig_object->ref_count <= 1)
return;
VM_OBJECT_WUNLOCK(orig_object);
offidxstart = OFF_TO_IDX(entry->offset);
size = atop(entry->end - entry->start);
/*
* If swap_pager_copy() is later called, it will convert new_object
* into a swap object.
*/
new_object = vm_object_allocate(OBJT_DEFAULT, size);
/*
* At this point, the new object is still private, so the order in
* which the original and new objects are locked does not matter.
*/
VM_OBJECT_WLOCK(new_object);
VM_OBJECT_WLOCK(orig_object);
new_object->domain = orig_object->domain;
source = orig_object->backing_object;
if (source != NULL) {
VM_OBJECT_WLOCK(source);
if ((source->flags & OBJ_DEAD) != 0) {
VM_OBJECT_WUNLOCK(source);
VM_OBJECT_WUNLOCK(orig_object);
VM_OBJECT_WUNLOCK(new_object);
vm_object_deallocate(new_object);
VM_OBJECT_WLOCK(orig_object);
return;
}
LIST_INSERT_HEAD(&source->shadow_head,
new_object, shadow_list);
source->shadow_count++;
vm_object_reference_locked(source); /* for new_object */
vm_object_clear_flag(source, OBJ_ONEMAPPING);
VM_OBJECT_WUNLOCK(source);
new_object->backing_object_offset =
orig_object->backing_object_offset + entry->offset;
new_object->backing_object = source;
}
if (orig_object->cred != NULL) {
new_object->cred = orig_object->cred;
crhold(orig_object->cred);
new_object->charge = ptoa(size);
KASSERT(orig_object->charge >= ptoa(size),
("orig_object->charge < 0"));
orig_object->charge -= ptoa(size);
}
retry:
m = vm_page_find_least(orig_object, offidxstart);
for (; m != NULL && (idx = m->pindex - offidxstart) < size;
m = m_next) {
m_next = TAILQ_NEXT(m, listq);
/*
* We must wait for pending I/O to complete before we can
* rename the page.
*
* We do not have to VM_PROT_NONE the page as mappings should
* not be changed by this operation.
*/
if (vm_page_busied(m)) {
VM_OBJECT_WUNLOCK(new_object);
vm_page_lock(m);
VM_OBJECT_WUNLOCK(orig_object);
vm_page_busy_sleep(m, "spltwt", false);
VM_OBJECT_WLOCK(orig_object);
VM_OBJECT_WLOCK(new_object);
goto retry;
}
/* vm_page_rename() will dirty the page. */
if (vm_page_rename(m, new_object, idx)) {
VM_OBJECT_WUNLOCK(new_object);
VM_OBJECT_WUNLOCK(orig_object);
vm_radix_wait();
VM_OBJECT_WLOCK(orig_object);
VM_OBJECT_WLOCK(new_object);
goto retry;
}
#if VM_NRESERVLEVEL > 0
/*
* If some of the reservation's allocated pages remain with
* the original object, then transferring the reservation to
* the new object is neither particularly beneficial nor
* particularly harmful as compared to leaving the reservation
* with the original object. If, however, all of the
* reservation's allocated pages are transferred to the new
* object, then transferring the reservation is typically
* beneficial. Determining which of these two cases applies
* would be more costly than unconditionally renaming the
* reservation.
*/
vm_reserv_rename(m, new_object, orig_object, offidxstart);
#endif
if (orig_object->type == OBJT_SWAP)
vm_page_xbusy(m);
}
if (orig_object->type == OBJT_SWAP) {
/*
* swap_pager_copy() can sleep, in which case the orig_object's
* and new_object's locks are released and reacquired.
*/
swap_pager_copy(orig_object, new_object, offidxstart, 0);
TAILQ_FOREACH(m, &new_object->memq, listq)
vm_page_xunbusy(m);
}
VM_OBJECT_WUNLOCK(orig_object);
VM_OBJECT_WUNLOCK(new_object);
entry->object.vm_object = new_object;
entry->offset = 0LL;
vm_object_deallocate(orig_object);
VM_OBJECT_WLOCK(new_object);
}
#define OBSC_COLLAPSE_NOWAIT 0x0002
#define OBSC_COLLAPSE_WAIT 0x0004
static vm_page_t
vm_object_collapse_scan_wait(vm_object_t object, vm_page_t p, vm_page_t next,
int op)
{
vm_object_t backing_object;
VM_OBJECT_ASSERT_WLOCKED(object);
backing_object = object->backing_object;
VM_OBJECT_ASSERT_WLOCKED(backing_object);
KASSERT(p == NULL || vm_page_busied(p), ("unbusy page %p", p));
KASSERT(p == NULL || p->object == object || p->object == backing_object,
("invalid ownership %p %p %p", p, object, backing_object));
if ((op & OBSC_COLLAPSE_NOWAIT) != 0)
return (next);
if (p != NULL)
vm_page_lock(p);
VM_OBJECT_WUNLOCK(object);
VM_OBJECT_WUNLOCK(backing_object);
/* The page is only NULL when rename fails. */
if (p == NULL)
vm_radix_wait();
else
vm_page_busy_sleep(p, "vmocol", false);
VM_OBJECT_WLOCK(object);
VM_OBJECT_WLOCK(backing_object);
return (TAILQ_FIRST(&backing_object->memq));
}
static bool
vm_object_scan_all_shadowed(vm_object_t object)
{
vm_object_t backing_object;
vm_page_t p, pp;
vm_pindex_t backing_offset_index, new_pindex, pi, ps;
VM_OBJECT_ASSERT_WLOCKED(object);
VM_OBJECT_ASSERT_WLOCKED(object->backing_object);
backing_object = object->backing_object;
if (backing_object->type != OBJT_DEFAULT &&
backing_object->type != OBJT_SWAP)
return (false);
pi = backing_offset_index = OFF_TO_IDX(object->backing_object_offset);
p = vm_page_find_least(backing_object, pi);
ps = swap_pager_find_least(backing_object, pi);
/*
* Only check pages inside the parent object's range and
* inside the parent object's mapping of the backing object.
*/
for (;; pi++) {
if (p != NULL && p->pindex < pi)
p = TAILQ_NEXT(p, listq);
if (ps < pi)
ps = swap_pager_find_least(backing_object, pi);
if (p == NULL && ps >= backing_object->size)
break;
else if (p == NULL)
pi = ps;
else
pi = MIN(p->pindex, ps);
new_pindex = pi - backing_offset_index;
if (new_pindex >= object->size)
break;
/*
* See if the parent has the page or if the parent's object
* pager has the page. If the parent has the page but the page
* is not valid, the parent's object pager must have the page.
*
* If this fails, the parent does not completely shadow the
* object and we might as well give up now.
*/
pp = vm_page_lookup(object, new_pindex);
if ((pp == NULL || pp->valid == 0) &&
!vm_pager_has_page(object, new_pindex, NULL, NULL))
return (false);
}
return (true);
}
static bool
vm_object_collapse_scan(vm_object_t object, int op)
{
vm_object_t backing_object;
vm_page_t next, p, pp;
vm_pindex_t backing_offset_index, new_pindex;
VM_OBJECT_ASSERT_WLOCKED(object);
VM_OBJECT_ASSERT_WLOCKED(object->backing_object);
backing_object = object->backing_object;
backing_offset_index = OFF_TO_IDX(object->backing_object_offset);
/*
* Initial conditions
*/
if ((op & OBSC_COLLAPSE_WAIT) != 0)
vm_object_set_flag(backing_object, OBJ_DEAD);
/*
* Our scan
*/
for (p = TAILQ_FIRST(&backing_object->memq); p != NULL; p = next) {
next = TAILQ_NEXT(p, listq);
new_pindex = p->pindex - backing_offset_index;
/*
* Check for busy page
*/
if (vm_page_busied(p)) {
next = vm_object_collapse_scan_wait(object, p, next, op);
continue;
}
KASSERT(p->object == backing_object,
("vm_object_collapse_scan: object mismatch"));
if (p->pindex < backing_offset_index ||
new_pindex >= object->size) {
if (backing_object->type == OBJT_SWAP)
swap_pager_freespace(backing_object, p->pindex,
1);
/*
* Page is out of the parent object's range, we can
* simply destroy it.
*/
vm_page_lock(p);
KASSERT(!pmap_page_is_mapped(p),
("freeing mapped page %p", p));
- if (!vm_page_wired(p))
+ if (vm_page_remove(p))
vm_page_free(p);
- else
- vm_page_remove(p);
vm_page_unlock(p);
continue;
}
pp = vm_page_lookup(object, new_pindex);
if (pp != NULL && vm_page_busied(pp)) {
/*
* The page in the parent is busy and possibly not
* (yet) valid. Until its state is finalized by the
* busy bit owner, we can't tell whether it shadows the
* original page. Therefore, we must either skip it
* and the original (backing_object) page or wait for
* its state to be finalized.
*
* This is due to a race with vm_fault() where we must
* unbusy the original (backing_obj) page before we can
* (re)lock the parent. Hence we can get here.
*/
next = vm_object_collapse_scan_wait(object, pp, next,
op);
continue;
}
KASSERT(pp == NULL || pp->valid != 0,
("unbusy invalid page %p", pp));
if (pp != NULL || vm_pager_has_page(object, new_pindex, NULL,
NULL)) {
/*
* The page already exists in the parent OR swap exists
* for this location in the parent. Leave the parent's
* page alone. Destroy the original page from the
* backing object.
*/
if (backing_object->type == OBJT_SWAP)
swap_pager_freespace(backing_object, p->pindex,
1);
vm_page_lock(p);
KASSERT(!pmap_page_is_mapped(p),
("freeing mapped page %p", p));
- if (!vm_page_wired(p))
+ if (vm_page_remove(p))
vm_page_free(p);
- else
- vm_page_remove(p);
vm_page_unlock(p);
continue;
}
/*
* Page does not exist in parent, rename the page from the
* backing object to the main object.
*
* If the page was mapped to a process, it can remain mapped
* through the rename. vm_page_rename() will dirty the page.
*/
if (vm_page_rename(p, object, new_pindex)) {
next = vm_object_collapse_scan_wait(object, NULL, next,
op);
continue;
}
/* Use the old pindex to free the right page. */
if (backing_object->type == OBJT_SWAP)
swap_pager_freespace(backing_object,
new_pindex + backing_offset_index, 1);
#if VM_NRESERVLEVEL > 0
/*
* Rename the reservation.
*/
vm_reserv_rename(p, object, backing_object,
backing_offset_index);
#endif
}
return (true);
}
/*
* this version of collapse allows the operation to occur earlier and
* when paging_in_progress is true for an object... This is not a complete
* operation, but should plug 99.9% of the rest of the leaks.
*/
static void
vm_object_qcollapse(vm_object_t object)
{
vm_object_t backing_object = object->backing_object;
VM_OBJECT_ASSERT_WLOCKED(object);
VM_OBJECT_ASSERT_WLOCKED(backing_object);
if (backing_object->ref_count != 1)
return;
vm_object_collapse_scan(object, OBSC_COLLAPSE_NOWAIT);
}
/*
* vm_object_collapse:
*
* Collapse an object with the object backing it.
* Pages in the backing object are moved into the
* parent, and the backing object is deallocated.
*/
void
vm_object_collapse(vm_object_t object)
{
vm_object_t backing_object, new_backing_object;
VM_OBJECT_ASSERT_WLOCKED(object);
while (TRUE) {
/*
* Verify that the conditions are right for collapse:
*
* The object exists and the backing object exists.
*/
if ((backing_object = object->backing_object) == NULL)
break;
/*
* we check the backing object first, because it is most likely
* not collapsable.
*/
VM_OBJECT_WLOCK(backing_object);
if (backing_object->handle != NULL ||
(backing_object->type != OBJT_DEFAULT &&
backing_object->type != OBJT_SWAP) ||
(backing_object->flags & (OBJ_DEAD | OBJ_NOSPLIT)) != 0 ||
object->handle != NULL ||
(object->type != OBJT_DEFAULT &&
object->type != OBJT_SWAP) ||
(object->flags & OBJ_DEAD)) {
VM_OBJECT_WUNLOCK(backing_object);
break;
}
if (object->paging_in_progress != 0 ||
backing_object->paging_in_progress != 0) {
vm_object_qcollapse(object);
VM_OBJECT_WUNLOCK(backing_object);
break;
}
/*
* We know that we can either collapse the backing object (if
* the parent is the only reference to it) or (perhaps) have
* the parent bypass the object if the parent happens to shadow
* all the resident pages in the entire backing object.
*
* This is ignoring pager-backed pages such as swap pages.
* vm_object_collapse_scan fails the shadowing test in this
* case.
*/
if (backing_object->ref_count == 1) {
vm_object_pip_add(object, 1);
vm_object_pip_add(backing_object, 1);
/*
* If there is exactly one reference to the backing
* object, we can collapse it into the parent.
*/
vm_object_collapse_scan(object, OBSC_COLLAPSE_WAIT);
#if VM_NRESERVLEVEL > 0
/*
* Break any reservations from backing_object.
*/
if (__predict_false(!LIST_EMPTY(&backing_object->rvq)))
vm_reserv_break_all(backing_object);
#endif
/*
* Move the pager from backing_object to object.
*/
if (backing_object->type == OBJT_SWAP) {
/*
* swap_pager_copy() can sleep, in which case
* the backing_object's and object's locks are
* released and reacquired.
* Since swap_pager_copy() is being asked to
* destroy the source, it will change the
* backing_object's type to OBJT_DEFAULT.
*/
swap_pager_copy(
backing_object,
object,
OFF_TO_IDX(object->backing_object_offset), TRUE);
}
/*
* Object now shadows whatever backing_object did.
* Note that the reference to
* backing_object->backing_object moves from within
* backing_object to within object.
*/
LIST_REMOVE(object, shadow_list);
backing_object->shadow_count--;
if (backing_object->backing_object) {
VM_OBJECT_WLOCK(backing_object->backing_object);
LIST_REMOVE(backing_object, shadow_list);
LIST_INSERT_HEAD(
&backing_object->backing_object->shadow_head,
object, shadow_list);
/*
* The shadow_count has not changed.
*/
VM_OBJECT_WUNLOCK(backing_object->backing_object);
}
object->backing_object = backing_object->backing_object;
object->backing_object_offset +=
backing_object->backing_object_offset;
/*
* Discard backing_object.
*
* Since the backing object has no pages, no pager left,
* and no object references within it, all that is
* necessary is to dispose of it.
*/
KASSERT(backing_object->ref_count == 1, (
"backing_object %p was somehow re-referenced during collapse!",
backing_object));
vm_object_pip_wakeup(backing_object);
backing_object->type = OBJT_DEAD;
backing_object->ref_count = 0;
VM_OBJECT_WUNLOCK(backing_object);
vm_object_destroy(backing_object);
vm_object_pip_wakeup(object);
counter_u64_add(object_collapses, 1);
} else {
/*
* If we do not entirely shadow the backing object,
* there is nothing we can do so we give up.
*/
if (object->resident_page_count != object->size &&
!vm_object_scan_all_shadowed(object)) {
VM_OBJECT_WUNLOCK(backing_object);
break;
}
/*
* Make the parent shadow the next object in the
* chain. Deallocating backing_object will not remove
* it, since its reference count is at least 2.
*/
LIST_REMOVE(object, shadow_list);
backing_object->shadow_count--;
new_backing_object = backing_object->backing_object;
if ((object->backing_object = new_backing_object) != NULL) {
VM_OBJECT_WLOCK(new_backing_object);
LIST_INSERT_HEAD(
&new_backing_object->shadow_head,
object,
shadow_list
);
new_backing_object->shadow_count++;
vm_object_reference_locked(new_backing_object);
VM_OBJECT_WUNLOCK(new_backing_object);
object->backing_object_offset +=
backing_object->backing_object_offset;
}
/*
* Drop the reference count on backing_object. Since
* its ref_count was at least 2, it will not vanish.
*/
backing_object->ref_count--;
VM_OBJECT_WUNLOCK(backing_object);
counter_u64_add(object_bypasses, 1);
}
/*
* Try again with this object's new backing object.
*/
}
}
/*
* vm_object_page_remove:
*
* For the given object, either frees or invalidates each of the
* specified pages. In general, a page is freed. However, if a page is
* wired for any reason other than the existence of a managed, wired
* mapping, then it may be invalidated but not removed from the object.
* Pages are specified by the given range ["start", "end") and the option
* OBJPR_CLEANONLY. As a special case, if "end" is zero, then the range
* extends from "start" to the end of the object. If the option
* OBJPR_CLEANONLY is specified, then only the non-dirty pages within the
* specified range are affected. If the option OBJPR_NOTMAPPED is
* specified, then the pages within the specified range must have no
* mappings. Otherwise, if this option is not specified, any mappings to
* the specified pages are removed before the pages are freed or
* invalidated.
*
* In general, this operation should only be performed on objects that
* contain managed pages. There are, however, two exceptions. First, it
* is performed on the kernel and kmem objects by vm_map_entry_delete().
* Second, it is used by msync(..., MS_INVALIDATE) to invalidate device-
* backed pages. In both of these cases, the option OBJPR_CLEANONLY must
* not be specified and the option OBJPR_NOTMAPPED must be specified.
*
* The object must be locked.
*/
void
vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end,
int options)
{
vm_page_t p, next;
struct mtx *mtx;
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT((object->flags & OBJ_UNMANAGED) == 0 ||
(options & (OBJPR_CLEANONLY | OBJPR_NOTMAPPED)) == OBJPR_NOTMAPPED,
("vm_object_page_remove: illegal options for object %p", object));
if (object->resident_page_count == 0)
return;
vm_object_pip_add(object, 1);
again:
p = vm_page_find_least(object, start);
mtx = NULL;
/*
* Here, the variable "p" is either (1) the page with the least pindex
* greater than or equal to the parameter "start" or (2) NULL.
*/
for (; p != NULL && (p->pindex < end || end == 0); p = next) {
next = TAILQ_NEXT(p, listq);
/*
* If the page is wired for any reason besides the existence
* of managed, wired mappings, then it cannot be freed. For
* example, fictitious pages, which represent device memory,
* are inherently wired and cannot be freed. They can,
* however, be invalidated if the option OBJPR_CLEANONLY is
* not specified.
*/
vm_page_change_lock(p, &mtx);
if (vm_page_xbusied(p)) {
VM_OBJECT_WUNLOCK(object);
vm_page_busy_sleep(p, "vmopax", true);
VM_OBJECT_WLOCK(object);
goto again;
}
if (vm_page_wired(p)) {
if ((options & OBJPR_NOTMAPPED) == 0 &&
object->ref_count != 0)
pmap_remove_all(p);
if ((options & OBJPR_CLEANONLY) == 0) {
p->valid = 0;
vm_page_undirty(p);
}
continue;
}
if (vm_page_busied(p)) {
VM_OBJECT_WUNLOCK(object);
vm_page_busy_sleep(p, "vmopar", false);
VM_OBJECT_WLOCK(object);
goto again;
}
KASSERT((p->flags & PG_FICTITIOUS) == 0,
("vm_object_page_remove: page %p is fictitious", p));
if ((options & OBJPR_CLEANONLY) != 0 && p->valid != 0) {
if ((options & OBJPR_NOTMAPPED) == 0 &&
object->ref_count != 0)
pmap_remove_write(p);
if (p->dirty != 0)
continue;
}
if ((options & OBJPR_NOTMAPPED) == 0 && object->ref_count != 0)
pmap_remove_all(p);
vm_page_free(p);
}
if (mtx != NULL)
mtx_unlock(mtx);
vm_object_pip_wakeup(object);
}
/*
* vm_object_page_noreuse:
*
* For the given object, attempt to move the specified pages to
* the head of the inactive queue. This bypasses regular LRU
* operation and allows the pages to be reused quickly under memory
* pressure. If a page is wired for any reason, then it will not
* be queued. Pages are specified by the range ["start", "end").
* As a special case, if "end" is zero, then the range extends from
* "start" to the end of the object.
*
* This operation should only be performed on objects that
* contain non-fictitious, managed pages.
*
* The object must be locked.
*/
void
vm_object_page_noreuse(vm_object_t object, vm_pindex_t start, vm_pindex_t end)
{
struct mtx *mtx;
vm_page_t p, next;
VM_OBJECT_ASSERT_LOCKED(object);
KASSERT((object->flags & (OBJ_FICTITIOUS | OBJ_UNMANAGED)) == 0,
("vm_object_page_noreuse: illegal object %p", object));
if (object->resident_page_count == 0)
return;
p = vm_page_find_least(object, start);
/*
* Here, the variable "p" is either (1) the page with the least pindex
* greater than or equal to the parameter "start" or (2) NULL.
*/
mtx = NULL;
for (; p != NULL && (p->pindex < end || end == 0); p = next) {
next = TAILQ_NEXT(p, listq);
vm_page_change_lock(p, &mtx);
vm_page_deactivate_noreuse(p);
}
if (mtx != NULL)
mtx_unlock(mtx);
}
/*
* Populate the specified range of the object with valid pages. Returns
* TRUE if the range is successfully populated and FALSE otherwise.
*
* Note: This function should be optimized to pass a larger array of
* pages to vm_pager_get_pages() before it is applied to a non-
* OBJT_DEVICE object.
*
* The object must be locked.
*/
boolean_t
vm_object_populate(vm_object_t object, vm_pindex_t start, vm_pindex_t end)
{
vm_page_t m;
vm_pindex_t pindex;
int rv;
VM_OBJECT_ASSERT_WLOCKED(object);
for (pindex = start; pindex < end; pindex++) {
m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL);
if (m->valid != VM_PAGE_BITS_ALL) {
rv = vm_pager_get_pages(object, &m, 1, NULL, NULL);
if (rv != VM_PAGER_OK) {
vm_page_lock(m);
vm_page_free(m);
vm_page_unlock(m);
break;
}
}
/*
* Keep "m" busy because a subsequent iteration may unlock
* the object.
*/
}
if (pindex > start) {
m = vm_page_lookup(object, start);
while (m != NULL && m->pindex < pindex) {
vm_page_xunbusy(m);
m = TAILQ_NEXT(m, listq);
}
}
return (pindex == end);
}
/*
* Routine: vm_object_coalesce
* Function: Coalesces two objects backing up adjoining
* regions of memory into a single object.
*
* returns TRUE if objects were combined.
*
* NOTE: Only works at the moment if the second object is NULL -
* if it's not, which object do we lock first?
*
* Parameters:
* prev_object First object to coalesce
* prev_offset Offset into prev_object
* prev_size Size of reference to prev_object
* next_size Size of reference to the second object
* reserved Indicator that extension region has
* swap accounted for
*
* Conditions:
* The object must *not* be locked.
*/
boolean_t
vm_object_coalesce(vm_object_t prev_object, vm_ooffset_t prev_offset,
vm_size_t prev_size, vm_size_t next_size, boolean_t reserved)
{
vm_pindex_t next_pindex;
if (prev_object == NULL)
return (TRUE);
VM_OBJECT_WLOCK(prev_object);
if ((prev_object->type != OBJT_DEFAULT &&
prev_object->type != OBJT_SWAP) ||
(prev_object->flags & OBJ_TMPFS_NODE) != 0) {
VM_OBJECT_WUNLOCK(prev_object);
return (FALSE);
}
/*
* Try to collapse the object first
*/
vm_object_collapse(prev_object);
/*
* Can't coalesce if: . more than one reference . paged out . shadows
* another object . has a copy elsewhere (any of which mean that the
* pages not mapped to prev_entry may be in use anyway)
*/
if (prev_object->backing_object != NULL) {
VM_OBJECT_WUNLOCK(prev_object);
return (FALSE);
}
prev_size >>= PAGE_SHIFT;
next_size >>= PAGE_SHIFT;
next_pindex = OFF_TO_IDX(prev_offset) + prev_size;
if (prev_object->ref_count > 1 &&
prev_object->size != next_pindex &&
(prev_object->flags & OBJ_ONEMAPPING) == 0) {
VM_OBJECT_WUNLOCK(prev_object);
return (FALSE);
}
/*
* Account for the charge.
*/
if (prev_object->cred != NULL) {
/*
* If prev_object was charged, then this mapping,
* although not charged now, may become writable
* later. Non-NULL cred in the object would prevent
* swap reservation during enabling of the write
* access, so reserve swap now. Failed reservation
* cause allocation of the separate object for the map
* entry, and swap reservation for this entry is
* managed in appropriate time.
*/
if (!reserved && !swap_reserve_by_cred(ptoa(next_size),
prev_object->cred)) {
VM_OBJECT_WUNLOCK(prev_object);
return (FALSE);
}
prev_object->charge += ptoa(next_size);
}
/*
* Remove any pages that may still be in the object from a previous
* deallocation.
*/
if (next_pindex < prev_object->size) {
vm_object_page_remove(prev_object, next_pindex, next_pindex +
next_size, 0);
if (prev_object->type == OBJT_SWAP)
swap_pager_freespace(prev_object,
next_pindex, next_size);
#if 0
if (prev_object->cred != NULL) {
KASSERT(prev_object->charge >=
ptoa(prev_object->size - next_pindex),
("object %p overcharged 1 %jx %jx", prev_object,
(uintmax_t)next_pindex, (uintmax_t)next_size));
prev_object->charge -= ptoa(prev_object->size -
next_pindex);
}
#endif
}
/*
* Extend the object if necessary.
*/
if (next_pindex + next_size > prev_object->size)
prev_object->size = next_pindex + next_size;
VM_OBJECT_WUNLOCK(prev_object);
return (TRUE);
}
void
vm_object_set_writeable_dirty(vm_object_t object)
{
VM_OBJECT_ASSERT_WLOCKED(object);
if (object->type != OBJT_VNODE) {
if ((object->flags & OBJ_TMPFS_NODE) != 0) {
KASSERT(object->type == OBJT_SWAP, ("non-swap tmpfs"));
vm_object_set_flag(object, OBJ_TMPFS_DIRTY);
}
return;
}
object->generation++;
if ((object->flags & OBJ_MIGHTBEDIRTY) != 0)
return;
vm_object_set_flag(object, OBJ_MIGHTBEDIRTY);
}
/*
* vm_object_unwire:
*
* For each page offset within the specified range of the given object,
* find the highest-level page in the shadow chain and unwire it. A page
* must exist at every page offset, and the highest-level page must be
* wired.
*/
void
vm_object_unwire(vm_object_t object, vm_ooffset_t offset, vm_size_t length,
uint8_t queue)
{
vm_object_t tobject, t1object;
vm_page_t m, tm;
vm_pindex_t end_pindex, pindex, tpindex;
int depth, locked_depth;
KASSERT((offset & PAGE_MASK) == 0,
("vm_object_unwire: offset is not page aligned"));
KASSERT((length & PAGE_MASK) == 0,
("vm_object_unwire: length is not a multiple of PAGE_SIZE"));
/* The wired count of a fictitious page never changes. */
if ((object->flags & OBJ_FICTITIOUS) != 0)
return;
pindex = OFF_TO_IDX(offset);
end_pindex = pindex + atop(length);
again:
locked_depth = 1;
VM_OBJECT_RLOCK(object);
m = vm_page_find_least(object, pindex);
while (pindex < end_pindex) {
if (m == NULL || pindex < m->pindex) {
/*
* The first object in the shadow chain doesn't
* contain a page at the current index. Therefore,
* the page must exist in a backing object.
*/
tobject = object;
tpindex = pindex;
depth = 0;
do {
tpindex +=
OFF_TO_IDX(tobject->backing_object_offset);
tobject = tobject->backing_object;
KASSERT(tobject != NULL,
("vm_object_unwire: missing page"));
if ((tobject->flags & OBJ_FICTITIOUS) != 0)
goto next_page;
depth++;
if (depth == locked_depth) {
locked_depth++;
VM_OBJECT_RLOCK(tobject);
}
} while ((tm = vm_page_lookup(tobject, tpindex)) ==
NULL);
} else {
tm = m;
m = TAILQ_NEXT(m, listq);
}
vm_page_lock(tm);
if (vm_page_xbusied(tm)) {
for (tobject = object; locked_depth >= 1;
locked_depth--) {
t1object = tobject->backing_object;
VM_OBJECT_RUNLOCK(tobject);
tobject = t1object;
}
vm_page_busy_sleep(tm, "unwbo", true);
goto again;
}
vm_page_unwire(tm, queue);
vm_page_unlock(tm);
next_page:
pindex++;
}
/* Release the accumulated object locks. */
for (tobject = object; locked_depth >= 1; locked_depth--) {
t1object = tobject->backing_object;
VM_OBJECT_RUNLOCK(tobject);
tobject = t1object;
}
}
/*
* Return the vnode for the given object, or NULL if none exists.
* For tmpfs objects, the function may return NULL if there is
* no vnode allocated at the time of the call.
*/
struct vnode *
vm_object_vnode(vm_object_t object)
{
struct vnode *vp;
VM_OBJECT_ASSERT_LOCKED(object);
if (object->type == OBJT_VNODE) {
vp = object->handle;
KASSERT(vp != NULL, ("%s: OBJT_VNODE has no vnode", __func__));
} else if (object->type == OBJT_SWAP &&
(object->flags & OBJ_TMPFS) != 0) {
vp = object->un_pager.swp.swp_tmpfs;
KASSERT(vp != NULL, ("%s: OBJT_TMPFS has no vnode", __func__));
} else {
vp = NULL;
}
return (vp);
}
/*
* Return the kvme type of the given object.
* If vpp is not NULL, set it to the object's vm_object_vnode() or NULL.
*/
int
vm_object_kvme_type(vm_object_t object, struct vnode **vpp)
{
VM_OBJECT_ASSERT_LOCKED(object);
if (vpp != NULL)
*vpp = vm_object_vnode(object);
switch (object->type) {
case OBJT_DEFAULT:
return (KVME_TYPE_DEFAULT);
case OBJT_VNODE:
return (KVME_TYPE_VNODE);
case OBJT_SWAP:
if ((object->flags & OBJ_TMPFS_NODE) != 0)
return (KVME_TYPE_VNODE);
return (KVME_TYPE_SWAP);
case OBJT_DEVICE:
return (KVME_TYPE_DEVICE);
case OBJT_PHYS:
return (KVME_TYPE_PHYS);
case OBJT_DEAD:
return (KVME_TYPE_DEAD);
case OBJT_SG:
return (KVME_TYPE_SG);
case OBJT_MGTDEVICE:
return (KVME_TYPE_MGTDEVICE);
default:
return (KVME_TYPE_UNKNOWN);
}
}
static int
sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
{
struct kinfo_vmobject *kvo;
char *fullpath, *freepath;
struct vnode *vp;
struct vattr va;
vm_object_t obj;
vm_page_t m;
int count, error;
if (req->oldptr == NULL) {
/*
* If an old buffer has not been provided, generate an
* estimate of the space needed for a subsequent call.
*/
mtx_lock(&vm_object_list_mtx);
count = 0;
TAILQ_FOREACH(obj, &vm_object_list, object_list) {
if (obj->type == OBJT_DEAD)
continue;
count++;
}
mtx_unlock(&vm_object_list_mtx);
return (SYSCTL_OUT(req, NULL, sizeof(struct kinfo_vmobject) *
count * 11 / 10));
}
kvo = malloc(sizeof(*kvo), M_TEMP, M_WAITOK);
error = 0;
/*
* VM objects are type stable and are never removed from the
* list once added. This allows us to safely read obj->object_list
* after reacquiring the VM object lock.
*/
mtx_lock(&vm_object_list_mtx);
TAILQ_FOREACH(obj, &vm_object_list, object_list) {
if (obj->type == OBJT_DEAD)
continue;
VM_OBJECT_RLOCK(obj);
if (obj->type == OBJT_DEAD) {
VM_OBJECT_RUNLOCK(obj);
continue;
}
mtx_unlock(&vm_object_list_mtx);
kvo->kvo_size = ptoa(obj->size);
kvo->kvo_resident = obj->resident_page_count;
kvo->kvo_ref_count = obj->ref_count;
kvo->kvo_shadow_count = obj->shadow_count;
kvo->kvo_memattr = obj->memattr;
kvo->kvo_active = 0;
kvo->kvo_inactive = 0;
TAILQ_FOREACH(m, &obj->memq, listq) {
/*
* A page may belong to the object but be
* dequeued and set to PQ_NONE while the
* object lock is not held. This makes the
* reads of m->queue below racy, and we do not
* count pages set to PQ_NONE. However, this
* sysctl is only meant to give an
* approximation of the system anyway.
*/
if (m->queue == PQ_ACTIVE)
kvo->kvo_active++;
else if (m->queue == PQ_INACTIVE)
kvo->kvo_inactive++;
}
kvo->kvo_vn_fileid = 0;
kvo->kvo_vn_fsid = 0;
kvo->kvo_vn_fsid_freebsd11 = 0;
freepath = NULL;
fullpath = "";
kvo->kvo_type = vm_object_kvme_type(obj, &vp);
if (vp != NULL)
vref(vp);
VM_OBJECT_RUNLOCK(obj);
if (vp != NULL) {
vn_fullpath(curthread, vp, &fullpath, &freepath);
vn_lock(vp, LK_SHARED | LK_RETRY);
if (VOP_GETATTR(vp, &va, curthread->td_ucred) == 0) {
kvo->kvo_vn_fileid = va.va_fileid;
kvo->kvo_vn_fsid = va.va_fsid;
kvo->kvo_vn_fsid_freebsd11 = va.va_fsid;
/* truncate */
}
vput(vp);
}
strlcpy(kvo->kvo_path, fullpath, sizeof(kvo->kvo_path));
if (freepath != NULL)
free(freepath, M_TEMP);
/* Pack record size down */
kvo->kvo_structsize = offsetof(struct kinfo_vmobject, kvo_path)
+ strlen(kvo->kvo_path) + 1;
kvo->kvo_structsize = roundup(kvo->kvo_structsize,
sizeof(uint64_t));
error = SYSCTL_OUT(req, kvo, kvo->kvo_structsize);
mtx_lock(&vm_object_list_mtx);
if (error)
break;
}
mtx_unlock(&vm_object_list_mtx);
free(kvo, M_TEMP);
return (error);
}
SYSCTL_PROC(_vm, OID_AUTO, objects, CTLTYPE_STRUCT | CTLFLAG_RW | CTLFLAG_SKIP |
CTLFLAG_MPSAFE, NULL, 0, sysctl_vm_object_list, "S,kinfo_vmobject",
"List of VM objects");
#include "opt_ddb.h"
#ifdef DDB
#include
#include
#include
static int
_vm_object_in_map(vm_map_t map, vm_object_t object, vm_map_entry_t entry)
{
vm_map_t tmpm;
vm_map_entry_t tmpe;
vm_object_t obj;
int entcount;
if (map == 0)
return 0;
if (entry == 0) {
tmpe = map->header.next;
entcount = map->nentries;
while (entcount-- && (tmpe != &map->header)) {
if (_vm_object_in_map(map, object, tmpe)) {
return 1;
}
tmpe = tmpe->next;
}
} else if (entry->eflags & MAP_ENTRY_IS_SUB_MAP) {
tmpm = entry->object.sub_map;
tmpe = tmpm->header.next;
entcount = tmpm->nentries;
while (entcount-- && tmpe != &tmpm->header) {
if (_vm_object_in_map(tmpm, object, tmpe)) {
return 1;
}
tmpe = tmpe->next;
}
} else if ((obj = entry->object.vm_object) != NULL) {
for (; obj; obj = obj->backing_object)
if (obj == object) {
return 1;
}
}
return 0;
}
static int
vm_object_in_map(vm_object_t object)
{
struct proc *p;
/* sx_slock(&allproc_lock); */
FOREACH_PROC_IN_SYSTEM(p) {
if (!p->p_vmspace /* || (p->p_flag & (P_SYSTEM|P_WEXIT)) */)
continue;
if (_vm_object_in_map(&p->p_vmspace->vm_map, object, 0)) {
/* sx_sunlock(&allproc_lock); */
return 1;
}
}
/* sx_sunlock(&allproc_lock); */
if (_vm_object_in_map(kernel_map, object, 0))
return 1;
return 0;
}
DB_SHOW_COMMAND(vmochk, vm_object_check)
{
vm_object_t object;
/*
* make sure that internal objs are in a map somewhere
* and none have zero ref counts.
*/
TAILQ_FOREACH(object, &vm_object_list, object_list) {
if (object->handle == NULL &&
(object->type == OBJT_DEFAULT || object->type == OBJT_SWAP)) {
if (object->ref_count == 0) {
db_printf("vmochk: internal obj has zero ref count: %ld\n",
(long)object->size);
}
if (!vm_object_in_map(object)) {
db_printf(
"vmochk: internal obj is not in a map: "
"ref: %d, size: %lu: 0x%lx, backing_object: %p\n",
object->ref_count, (u_long)object->size,
(u_long)object->size,
(void *)object->backing_object);
}
}
}
}
/*
* vm_object_print: [ debug ]
*/
DB_SHOW_COMMAND(object, vm_object_print_static)
{
/* XXX convert args. */
vm_object_t object = (vm_object_t)addr;
boolean_t full = have_addr;
vm_page_t p;
/* XXX count is an (unused) arg. Avoid shadowing it. */
#define count was_count
int count;
if (object == NULL)
return;
db_iprintf(
"Object %p: type=%d, size=0x%jx, res=%d, ref=%d, flags=0x%x ruid %d charge %jx\n",
object, (int)object->type, (uintmax_t)object->size,
object->resident_page_count, object->ref_count, object->flags,
object->cred ? object->cred->cr_ruid : -1, (uintmax_t)object->charge);
db_iprintf(" sref=%d, backing_object(%d)=(%p)+0x%jx\n",
object->shadow_count,
object->backing_object ? object->backing_object->ref_count : 0,
object->backing_object, (uintmax_t)object->backing_object_offset);
if (!full)
return;
db_indent += 2;
count = 0;
TAILQ_FOREACH(p, &object->memq, listq) {
if (count == 0)
db_iprintf("memory:=");
else if (count == 6) {
db_printf("\n");
db_iprintf(" ...");
count = 0;
} else
db_printf(",");
count++;
db_printf("(off=0x%jx,page=0x%jx)",
(uintmax_t)p->pindex, (uintmax_t)VM_PAGE_TO_PHYS(p));
}
if (count != 0)
db_printf("\n");
db_indent -= 2;
}
/* XXX. */
#undef count
/* XXX need this non-static entry for calling from vm_map_print. */
void
vm_object_print(
/* db_expr_t */ long addr,
boolean_t have_addr,
/* db_expr_t */ long count,
char *modif)
{
vm_object_print_static(addr, have_addr, count, modif);
}
DB_SHOW_COMMAND(vmopag, vm_object_print_pages)
{
vm_object_t object;
vm_pindex_t fidx;
vm_paddr_t pa;
vm_page_t m, prev_m;
int rcount, nl, c;
nl = 0;
TAILQ_FOREACH(object, &vm_object_list, object_list) {
db_printf("new object: %p\n", (void *)object);
if (nl > 18) {
c = cngetc();
if (c != ' ')
return;
nl = 0;
}
nl++;
rcount = 0;
fidx = 0;
pa = -1;
TAILQ_FOREACH(m, &object->memq, listq) {
if (m->pindex > 128)
break;
if ((prev_m = TAILQ_PREV(m, pglist, listq)) != NULL &&
prev_m->pindex + 1 != m->pindex) {
if (rcount) {
db_printf(" index(%ld)run(%d)pa(0x%lx)\n",
(long)fidx, rcount, (long)pa);
if (nl > 18) {
c = cngetc();
if (c != ' ')
return;
nl = 0;
}
nl++;
rcount = 0;
}
}
if (rcount &&
(VM_PAGE_TO_PHYS(m) == pa + rcount * PAGE_SIZE)) {
++rcount;
continue;
}
if (rcount) {
db_printf(" index(%ld)run(%d)pa(0x%lx)\n",
(long)fidx, rcount, (long)pa);
if (nl > 18) {
c = cngetc();
if (c != ' ')
return;
nl = 0;
}
nl++;
}
fidx = m->pindex;
pa = VM_PAGE_TO_PHYS(m);
rcount = 1;
}
if (rcount) {
db_printf(" index(%ld)run(%d)pa(0x%lx)\n",
(long)fidx, rcount, (long)pa);
if (nl > 18) {
c = cngetc();
if (c != ' ')
return;
nl = 0;
}
nl++;
}
}
}
#endif /* DDB */
Index: head/sys/vm/vm_page.c
===================================================================
--- head/sys/vm/vm_page.c (revision 349431)
+++ head/sys/vm/vm_page.c (revision 349432)
@@ -1,4528 +1,4531 @@
/*-
* SPDX-License-Identifier: (BSD-3-Clause AND MIT-CMU)
*
* Copyright (c) 1991 Regents of the University of California.
* All rights reserved.
* Copyright (c) 1998 Matthew Dillon. All Rights Reserved.
*
* This code is derived from software contributed to Berkeley by
* The Mach Operating System project at Carnegie-Mellon University.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* from: @(#)vm_page.c 7.4 (Berkeley) 5/7/91
*/
/*-
* Copyright (c) 1987, 1990 Carnegie-Mellon University.
* All rights reserved.
*
* Authors: Avadis Tevanian, Jr., Michael Wayne Young
*
* Permission to use, copy, modify and distribute this software and
* its documentation is hereby granted, provided that both the copyright
* notice and this permission notice appear in all copies of the
* software, derivative works or modified versions, and any portions
* thereof, and that both notices appear in supporting documentation.
*
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
* FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
*
* Carnegie Mellon requests users of this software to return to
*
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
* School of Computer Science
* Carnegie Mellon University
* Pittsburgh PA 15213-3890
*
* any improvements or extensions that they make and grant Carnegie the
* rights to redistribute these changes.
*/
/*
* Resident memory management module.
*/
#include
__FBSDID("$FreeBSD$");
#include "opt_vm.h"
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
extern int uma_startup_count(int);
extern void uma_startup(void *, int);
extern int vmem_startup_count(void);
struct vm_domain vm_dom[MAXMEMDOM];
DPCPU_DEFINE_STATIC(struct vm_batchqueue, pqbatch[MAXMEMDOM][PQ_COUNT]);
struct mtx_padalign __exclusive_cache_line pa_lock[PA_LOCK_COUNT];
struct mtx_padalign __exclusive_cache_line vm_domainset_lock;
/* The following fields are protected by the domainset lock. */
domainset_t __exclusive_cache_line vm_min_domains;
domainset_t __exclusive_cache_line vm_severe_domains;
static int vm_min_waiters;
static int vm_severe_waiters;
static int vm_pageproc_waiters;
/*
* bogus page -- for I/O to/from partially complete buffers,
* or for paging into sparsely invalid regions.
*/
vm_page_t bogus_page;
vm_page_t vm_page_array;
long vm_page_array_size;
long first_page;
static int boot_pages;
SYSCTL_INT(_vm, OID_AUTO, boot_pages, CTLFLAG_RDTUN | CTLFLAG_NOFETCH,
&boot_pages, 0,
"number of pages allocated for bootstrapping the VM system");
static int pa_tryrelock_restart;
SYSCTL_INT(_vm, OID_AUTO, tryrelock_restart, CTLFLAG_RD,
&pa_tryrelock_restart, 0, "Number of tryrelock restarts");
static TAILQ_HEAD(, vm_page) blacklist_head;
static int sysctl_vm_page_blacklist(SYSCTL_HANDLER_ARGS);
SYSCTL_PROC(_vm, OID_AUTO, page_blacklist, CTLTYPE_STRING | CTLFLAG_RD |
CTLFLAG_MPSAFE, NULL, 0, sysctl_vm_page_blacklist, "A", "Blacklist pages");
static uma_zone_t fakepg_zone;
static void vm_page_alloc_check(vm_page_t m);
static void vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits);
static void vm_page_dequeue_complete(vm_page_t m);
static void vm_page_enqueue(vm_page_t m, uint8_t queue);
static void vm_page_init(void *dummy);
static int vm_page_insert_after(vm_page_t m, vm_object_t object,
vm_pindex_t pindex, vm_page_t mpred);
static void vm_page_insert_radixdone(vm_page_t m, vm_object_t object,
vm_page_t mpred);
static int vm_page_reclaim_run(int req_class, int domain, u_long npages,
vm_page_t m_run, vm_paddr_t high);
static int vm_domain_alloc_fail(struct vm_domain *vmd, vm_object_t object,
int req);
static int vm_page_import(void *arg, void **store, int cnt, int domain,
int flags);
static void vm_page_release(void *arg, void **store, int cnt);
SYSINIT(vm_page, SI_SUB_VM, SI_ORDER_SECOND, vm_page_init, NULL);
static void
vm_page_init(void *dummy)
{
fakepg_zone = uma_zcreate("fakepg", sizeof(struct vm_page), NULL, NULL,
NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE | UMA_ZONE_VM);
bogus_page = vm_page_alloc(NULL, 0, VM_ALLOC_NOOBJ |
VM_ALLOC_NORMAL | VM_ALLOC_WIRED);
}
/*
* The cache page zone is initialized later since we need to be able to allocate
* pages before UMA is fully initialized.
*/
static void
vm_page_init_cache_zones(void *dummy __unused)
{
struct vm_domain *vmd;
int i;
for (i = 0; i < vm_ndomains; i++) {
vmd = VM_DOMAIN(i);
/*
* Don't allow the page cache to take up more than .25% of
* memory.
*/
if (vmd->vmd_page_count / 400 < 256 * mp_ncpus)
continue;
vmd->vmd_pgcache = uma_zcache_create("vm pgcache",
sizeof(struct vm_page), NULL, NULL, NULL, NULL,
vm_page_import, vm_page_release, vmd,
UMA_ZONE_MAXBUCKET | UMA_ZONE_VM);
(void )uma_zone_set_maxcache(vmd->vmd_pgcache, 0);
}
}
SYSINIT(vm_page2, SI_SUB_VM_CONF, SI_ORDER_ANY, vm_page_init_cache_zones, NULL);
/* Make sure that u_long is at least 64 bits when PAGE_SIZE is 32K. */
#if PAGE_SIZE == 32768
#ifdef CTASSERT
CTASSERT(sizeof(u_long) >= 8);
#endif
#endif
/*
* Try to acquire a physical address lock while a pmap is locked. If we
* fail to trylock we unlock and lock the pmap directly and cache the
* locked pa in *locked. The caller should then restart their loop in case
* the virtual to physical mapping has changed.
*/
int
vm_page_pa_tryrelock(pmap_t pmap, vm_paddr_t pa, vm_paddr_t *locked)
{
vm_paddr_t lockpa;
lockpa = *locked;
*locked = pa;
if (lockpa) {
PA_LOCK_ASSERT(lockpa, MA_OWNED);
if (PA_LOCKPTR(pa) == PA_LOCKPTR(lockpa))
return (0);
PA_UNLOCK(lockpa);
}
if (PA_TRYLOCK(pa))
return (0);
PMAP_UNLOCK(pmap);
atomic_add_int(&pa_tryrelock_restart, 1);
PA_LOCK(pa);
PMAP_LOCK(pmap);
return (EAGAIN);
}
/*
* vm_set_page_size:
*
* Sets the page size, perhaps based upon the memory
* size. Must be called before any use of page-size
* dependent functions.
*/
void
vm_set_page_size(void)
{
if (vm_cnt.v_page_size == 0)
vm_cnt.v_page_size = PAGE_SIZE;
if (((vm_cnt.v_page_size - 1) & vm_cnt.v_page_size) != 0)
panic("vm_set_page_size: page size not a power of two");
}
/*
* vm_page_blacklist_next:
*
* Find the next entry in the provided string of blacklist
* addresses. Entries are separated by space, comma, or newline.
* If an invalid integer is encountered then the rest of the
* string is skipped. Updates the list pointer to the next
* character, or NULL if the string is exhausted or invalid.
*/
static vm_paddr_t
vm_page_blacklist_next(char **list, char *end)
{
vm_paddr_t bad;
char *cp, *pos;
if (list == NULL || *list == NULL)
return (0);
if (**list =='\0') {
*list = NULL;
return (0);
}
/*
* If there's no end pointer then the buffer is coming from
* the kenv and we know it's null-terminated.
*/
if (end == NULL)
end = *list + strlen(*list);
/* Ensure that strtoq() won't walk off the end */
if (*end != '\0') {
if (*end == '\n' || *end == ' ' || *end == ',')
*end = '\0';
else {
printf("Blacklist not terminated, skipping\n");
*list = NULL;
return (0);
}
}
for (pos = *list; *pos != '\0'; pos = cp) {
bad = strtoq(pos, &cp, 0);
if (*cp == '\0' || *cp == ' ' || *cp == ',' || *cp == '\n') {
if (bad == 0) {
if (++cp < end)
continue;
else
break;
}
} else
break;
if (*cp == '\0' || ++cp >= end)
*list = NULL;
else
*list = cp;
return (trunc_page(bad));
}
printf("Garbage in RAM blacklist, skipping\n");
*list = NULL;
return (0);
}
bool
vm_page_blacklist_add(vm_paddr_t pa, bool verbose)
{
struct vm_domain *vmd;
vm_page_t m;
int ret;
m = vm_phys_paddr_to_vm_page(pa);
if (m == NULL)
return (true); /* page does not exist, no failure */
vmd = vm_pagequeue_domain(m);
vm_domain_free_lock(vmd);
ret = vm_phys_unfree_page(m);
vm_domain_free_unlock(vmd);
if (ret != 0) {
vm_domain_freecnt_inc(vmd, -1);
TAILQ_INSERT_TAIL(&blacklist_head, m, listq);
if (verbose)
printf("Skipping page with pa 0x%jx\n", (uintmax_t)pa);
}
return (ret);
}
/*
* vm_page_blacklist_check:
*
* Iterate through the provided string of blacklist addresses, pulling
* each entry out of the physical allocator free list and putting it
* onto a list for reporting via the vm.page_blacklist sysctl.
*/
static void
vm_page_blacklist_check(char *list, char *end)
{
vm_paddr_t pa;
char *next;
next = list;
while (next != NULL) {
if ((pa = vm_page_blacklist_next(&next, end)) == 0)
continue;
vm_page_blacklist_add(pa, bootverbose);
}
}
/*
* vm_page_blacklist_load:
*
* Search for a special module named "ram_blacklist". It'll be a
* plain text file provided by the user via the loader directive
* of the same name.
*/
static void
vm_page_blacklist_load(char **list, char **end)
{
void *mod;
u_char *ptr;
u_int len;
mod = NULL;
ptr = NULL;
mod = preload_search_by_type("ram_blacklist");
if (mod != NULL) {
ptr = preload_fetch_addr(mod);
len = preload_fetch_size(mod);
}
*list = ptr;
if (ptr != NULL)
*end = ptr + len;
else
*end = NULL;
return;
}
static int
sysctl_vm_page_blacklist(SYSCTL_HANDLER_ARGS)
{
vm_page_t m;
struct sbuf sbuf;
int error, first;
first = 1;
error = sysctl_wire_old_buffer(req, 0);
if (error != 0)
return (error);
sbuf_new_for_sysctl(&sbuf, NULL, 128, req);
TAILQ_FOREACH(m, &blacklist_head, listq) {
sbuf_printf(&sbuf, "%s%#jx", first ? "" : ",",
(uintmax_t)m->phys_addr);
first = 0;
}
error = sbuf_finish(&sbuf);
sbuf_delete(&sbuf);
return (error);
}
/*
* Initialize a dummy page for use in scans of the specified paging queue.
* In principle, this function only needs to set the flag PG_MARKER.
* Nonetheless, it write busies and initializes the hold count to one as
* safety precautions.
*/
static void
vm_page_init_marker(vm_page_t marker, int queue, uint8_t aflags)
{
bzero(marker, sizeof(*marker));
marker->flags = PG_MARKER;
marker->aflags = aflags;
marker->busy_lock = VPB_SINGLE_EXCLUSIVER;
marker->queue = queue;
marker->hold_count = 1;
}
static void
vm_page_domain_init(int domain)
{
struct vm_domain *vmd;
struct vm_pagequeue *pq;
int i;
vmd = VM_DOMAIN(domain);
bzero(vmd, sizeof(*vmd));
*__DECONST(char **, &vmd->vmd_pagequeues[PQ_INACTIVE].pq_name) =
"vm inactive pagequeue";
*__DECONST(char **, &vmd->vmd_pagequeues[PQ_ACTIVE].pq_name) =
"vm active pagequeue";
*__DECONST(char **, &vmd->vmd_pagequeues[PQ_LAUNDRY].pq_name) =
"vm laundry pagequeue";
*__DECONST(char **, &vmd->vmd_pagequeues[PQ_UNSWAPPABLE].pq_name) =
"vm unswappable pagequeue";
vmd->vmd_domain = domain;
vmd->vmd_page_count = 0;
vmd->vmd_free_count = 0;
vmd->vmd_segs = 0;
vmd->vmd_oom = FALSE;
for (i = 0; i < PQ_COUNT; i++) {
pq = &vmd->vmd_pagequeues[i];
TAILQ_INIT(&pq->pq_pl);
mtx_init(&pq->pq_mutex, pq->pq_name, "vm pagequeue",
MTX_DEF | MTX_DUPOK);
pq->pq_pdpages = 0;
vm_page_init_marker(&vmd->vmd_markers[i], i, 0);
}
mtx_init(&vmd->vmd_free_mtx, "vm page free queue", NULL, MTX_DEF);
mtx_init(&vmd->vmd_pageout_mtx, "vm pageout lock", NULL, MTX_DEF);
snprintf(vmd->vmd_name, sizeof(vmd->vmd_name), "%d", domain);
/*
* inacthead is used to provide FIFO ordering for LRU-bypassing
* insertions.
*/
vm_page_init_marker(&vmd->vmd_inacthead, PQ_INACTIVE, PGA_ENQUEUED);
TAILQ_INSERT_HEAD(&vmd->vmd_pagequeues[PQ_INACTIVE].pq_pl,
&vmd->vmd_inacthead, plinks.q);
/*
* The clock pages are used to implement active queue scanning without
* requeues. Scans start at clock[0], which is advanced after the scan
* ends. When the two clock hands meet, they are reset and scanning
* resumes from the head of the queue.
*/
vm_page_init_marker(&vmd->vmd_clock[0], PQ_ACTIVE, PGA_ENQUEUED);
vm_page_init_marker(&vmd->vmd_clock[1], PQ_ACTIVE, PGA_ENQUEUED);
TAILQ_INSERT_HEAD(&vmd->vmd_pagequeues[PQ_ACTIVE].pq_pl,
&vmd->vmd_clock[0], plinks.q);
TAILQ_INSERT_TAIL(&vmd->vmd_pagequeues[PQ_ACTIVE].pq_pl,
&vmd->vmd_clock[1], plinks.q);
}
/*
* Initialize a physical page in preparation for adding it to the free
* lists.
*/
static void
vm_page_init_page(vm_page_t m, vm_paddr_t pa, int segind)
{
m->object = NULL;
m->wire_count = 0;
m->busy_lock = VPB_UNBUSIED;
m->hold_count = 0;
m->flags = m->aflags = 0;
m->phys_addr = pa;
m->queue = PQ_NONE;
m->psind = 0;
m->segind = segind;
m->order = VM_NFREEORDER;
m->pool = VM_FREEPOOL_DEFAULT;
m->valid = m->dirty = 0;
pmap_page_init(m);
}
/*
* vm_page_startup:
*
* Initializes the resident memory module. Allocates physical memory for
* bootstrapping UMA and some data structures that are used to manage
* physical pages. Initializes these structures, and populates the free
* page queues.
*/
vm_offset_t
vm_page_startup(vm_offset_t vaddr)
{
struct vm_phys_seg *seg;
vm_page_t m;
char *list, *listend;
vm_offset_t mapped;
vm_paddr_t end, high_avail, low_avail, new_end, page_range, size;
vm_paddr_t biggestsize, last_pa, pa;
u_long pagecount;
int biggestone, i, segind;
#ifdef WITNESS
int witness_size;
#endif
#if defined(__i386__) && defined(VM_PHYSSEG_DENSE)
long ii;
#endif
biggestsize = 0;
biggestone = 0;
vaddr = round_page(vaddr);
for (i = 0; phys_avail[i + 1]; i += 2) {
phys_avail[i] = round_page(phys_avail[i]);
phys_avail[i + 1] = trunc_page(phys_avail[i + 1]);
}
for (i = 0; phys_avail[i + 1]; i += 2) {
size = phys_avail[i + 1] - phys_avail[i];
if (size > biggestsize) {
biggestone = i;
biggestsize = size;
}
}
end = phys_avail[biggestone+1];
/*
* Initialize the page and queue locks.
*/
mtx_init(&vm_domainset_lock, "vm domainset lock", NULL, MTX_DEF);
for (i = 0; i < PA_LOCK_COUNT; i++)
mtx_init(&pa_lock[i], "vm page", NULL, MTX_DEF);
for (i = 0; i < vm_ndomains; i++)
vm_page_domain_init(i);
/*
* Allocate memory for use when boot strapping the kernel memory
* allocator. Tell UMA how many zones we are going to create
* before going fully functional. UMA will add its zones.
*
* VM startup zones: vmem, vmem_btag, VM OBJECT, RADIX NODE, MAP,
* KMAP ENTRY, MAP ENTRY, VMSPACE.
*/
boot_pages = uma_startup_count(8);
#ifndef UMA_MD_SMALL_ALLOC
/* vmem_startup() calls uma_prealloc(). */
boot_pages += vmem_startup_count();
/* vm_map_startup() calls uma_prealloc(). */
boot_pages += howmany(MAX_KMAP,
UMA_SLAB_SPACE / sizeof(struct vm_map));
/*
* Before going fully functional kmem_init() does allocation
* from "KMAP ENTRY" and vmem_create() does allocation from "vmem".
*/
boot_pages += 2;
#endif
/*
* CTFLAG_RDTUN doesn't work during the early boot process, so we must
* manually fetch the value.
*/
TUNABLE_INT_FETCH("vm.boot_pages", &boot_pages);
new_end = end - (boot_pages * UMA_SLAB_SIZE);
new_end = trunc_page(new_end);
mapped = pmap_map(&vaddr, new_end, end,
VM_PROT_READ | VM_PROT_WRITE);
bzero((void *)mapped, end - new_end);
uma_startup((void *)mapped, boot_pages);
#ifdef WITNESS
witness_size = round_page(witness_startup_count());
new_end -= witness_size;
mapped = pmap_map(&vaddr, new_end, new_end + witness_size,
VM_PROT_READ | VM_PROT_WRITE);
bzero((void *)mapped, witness_size);
witness_startup((void *)mapped);
#endif
#if defined(__aarch64__) || defined(__amd64__) || defined(__arm__) || \
defined(__i386__) || defined(__mips__) || defined(__riscv)
/*
* Allocate a bitmap to indicate that a random physical page
* needs to be included in a minidump.
*
* The amd64 port needs this to indicate which direct map pages
* need to be dumped, via calls to dump_add_page()/dump_drop_page().
*
* However, i386 still needs this workspace internally within the
* minidump code. In theory, they are not needed on i386, but are
* included should the sf_buf code decide to use them.
*/
last_pa = 0;
for (i = 0; dump_avail[i + 1] != 0; i += 2)
if (dump_avail[i + 1] > last_pa)
last_pa = dump_avail[i + 1];
page_range = last_pa / PAGE_SIZE;
vm_page_dump_size = round_page(roundup2(page_range, NBBY) / NBBY);
new_end -= vm_page_dump_size;
vm_page_dump = (void *)(uintptr_t)pmap_map(&vaddr, new_end,
new_end + vm_page_dump_size, VM_PROT_READ | VM_PROT_WRITE);
bzero((void *)vm_page_dump, vm_page_dump_size);
#else
(void)last_pa;
#endif
#if defined(__aarch64__) || defined(__amd64__) || defined(__mips__) || \
defined(__riscv)
/*
* Include the UMA bootstrap pages, witness pages and vm_page_dump
* in a crash dump. When pmap_map() uses the direct map, they are
* not automatically included.
*/
for (pa = new_end; pa < end; pa += PAGE_SIZE)
dump_add_page(pa);
#endif
phys_avail[biggestone + 1] = new_end;
#ifdef __amd64__
/*
* Request that the physical pages underlying the message buffer be
* included in a crash dump. Since the message buffer is accessed
* through the direct map, they are not automatically included.
*/
pa = DMAP_TO_PHYS((vm_offset_t)msgbufp->msg_ptr);
last_pa = pa + round_page(msgbufsize);
while (pa < last_pa) {
dump_add_page(pa);
pa += PAGE_SIZE;
}
#endif
/*
* Compute the number of pages of memory that will be available for
* use, taking into account the overhead of a page structure per page.
* In other words, solve
* "available physical memory" - round_page(page_range *
* sizeof(struct vm_page)) = page_range * PAGE_SIZE
* for page_range.
*/
low_avail = phys_avail[0];
high_avail = phys_avail[1];
for (i = 0; i < vm_phys_nsegs; i++) {
if (vm_phys_segs[i].start < low_avail)
low_avail = vm_phys_segs[i].start;
if (vm_phys_segs[i].end > high_avail)
high_avail = vm_phys_segs[i].end;
}
/* Skip the first chunk. It is already accounted for. */
for (i = 2; phys_avail[i + 1] != 0; i += 2) {
if (phys_avail[i] < low_avail)
low_avail = phys_avail[i];
if (phys_avail[i + 1] > high_avail)
high_avail = phys_avail[i + 1];
}
first_page = low_avail / PAGE_SIZE;
#ifdef VM_PHYSSEG_SPARSE
size = 0;
for (i = 0; i < vm_phys_nsegs; i++)
size += vm_phys_segs[i].end - vm_phys_segs[i].start;
for (i = 0; phys_avail[i + 1] != 0; i += 2)
size += phys_avail[i + 1] - phys_avail[i];
#elif defined(VM_PHYSSEG_DENSE)
size = high_avail - low_avail;
#else
#error "Either VM_PHYSSEG_DENSE or VM_PHYSSEG_SPARSE must be defined."
#endif
#ifdef VM_PHYSSEG_DENSE
/*
* In the VM_PHYSSEG_DENSE case, the number of pages can account for
* the overhead of a page structure per page only if vm_page_array is
* allocated from the last physical memory chunk. Otherwise, we must
* allocate page structures representing the physical memory
* underlying vm_page_array, even though they will not be used.
*/
if (new_end != high_avail)
page_range = size / PAGE_SIZE;
else
#endif
{
page_range = size / (PAGE_SIZE + sizeof(struct vm_page));
/*
* If the partial bytes remaining are large enough for
* a page (PAGE_SIZE) without a corresponding
* 'struct vm_page', then new_end will contain an
* extra page after subtracting the length of the VM
* page array. Compensate by subtracting an extra
* page from new_end.
*/
if (size % (PAGE_SIZE + sizeof(struct vm_page)) >= PAGE_SIZE) {
if (new_end == high_avail)
high_avail -= PAGE_SIZE;
new_end -= PAGE_SIZE;
}
}
end = new_end;
/*
* Reserve an unmapped guard page to trap access to vm_page_array[-1].
* However, because this page is allocated from KVM, out-of-bounds
* accesses using the direct map will not be trapped.
*/
vaddr += PAGE_SIZE;
/*
* Allocate physical memory for the page structures, and map it.
*/
new_end = trunc_page(end - page_range * sizeof(struct vm_page));
mapped = pmap_map(&vaddr, new_end, end,
VM_PROT_READ | VM_PROT_WRITE);
vm_page_array = (vm_page_t)mapped;
vm_page_array_size = page_range;
#if VM_NRESERVLEVEL > 0
/*
* Allocate physical memory for the reservation management system's
* data structures, and map it.
*/
if (high_avail == end)
high_avail = new_end;
new_end = vm_reserv_startup(&vaddr, new_end, high_avail);
#endif
#if defined(__aarch64__) || defined(__amd64__) || defined(__mips__) || \
defined(__riscv)
/*
* Include vm_page_array and vm_reserv_array in a crash dump.
*/
for (pa = new_end; pa < end; pa += PAGE_SIZE)
dump_add_page(pa);
#endif
phys_avail[biggestone + 1] = new_end;
/*
* Add physical memory segments corresponding to the available
* physical pages.
*/
for (i = 0; phys_avail[i + 1] != 0; i += 2)
vm_phys_add_seg(phys_avail[i], phys_avail[i + 1]);
/*
* Initialize the physical memory allocator.
*/
vm_phys_init();
/*
* Initialize the page structures and add every available page to the
* physical memory allocator's free lists.
*/
#if defined(__i386__) && defined(VM_PHYSSEG_DENSE)
for (ii = 0; ii < vm_page_array_size; ii++) {
m = &vm_page_array[ii];
vm_page_init_page(m, (first_page + ii) << PAGE_SHIFT, 0);
m->flags = PG_FICTITIOUS;
}
#endif
vm_cnt.v_page_count = 0;
for (segind = 0; segind < vm_phys_nsegs; segind++) {
seg = &vm_phys_segs[segind];
for (m = seg->first_page, pa = seg->start; pa < seg->end;
m++, pa += PAGE_SIZE)
vm_page_init_page(m, pa, segind);
/*
* Add the segment to the free lists only if it is covered by
* one of the ranges in phys_avail. Because we've added the
* ranges to the vm_phys_segs array, we can assume that each
* segment is either entirely contained in one of the ranges,
* or doesn't overlap any of them.
*/
for (i = 0; phys_avail[i + 1] != 0; i += 2) {
struct vm_domain *vmd;
if (seg->start < phys_avail[i] ||
seg->end > phys_avail[i + 1])
continue;
m = seg->first_page;
pagecount = (u_long)atop(seg->end - seg->start);
vmd = VM_DOMAIN(seg->domain);
vm_domain_free_lock(vmd);
vm_phys_enqueue_contig(m, pagecount);
vm_domain_free_unlock(vmd);
vm_domain_freecnt_inc(vmd, pagecount);
vm_cnt.v_page_count += (u_int)pagecount;
vmd = VM_DOMAIN(seg->domain);
vmd->vmd_page_count += (u_int)pagecount;
vmd->vmd_segs |= 1UL << m->segind;
break;
}
}
/*
* Remove blacklisted pages from the physical memory allocator.
*/
TAILQ_INIT(&blacklist_head);
vm_page_blacklist_load(&list, &listend);
vm_page_blacklist_check(list, listend);
list = kern_getenv("vm.blacklist");
vm_page_blacklist_check(list, NULL);
freeenv(list);
#if VM_NRESERVLEVEL > 0
/*
* Initialize the reservation management system.
*/
vm_reserv_init();
#endif
return (vaddr);
}
void
vm_page_reference(vm_page_t m)
{
vm_page_aflag_set(m, PGA_REFERENCED);
}
/*
* vm_page_busy_downgrade:
*
* Downgrade an exclusive busy page into a single shared busy page.
*/
void
vm_page_busy_downgrade(vm_page_t m)
{
u_int x;
bool locked;
vm_page_assert_xbusied(m);
locked = mtx_owned(vm_page_lockptr(m));
for (;;) {
x = m->busy_lock;
x &= VPB_BIT_WAITERS;
if (x != 0 && !locked)
vm_page_lock(m);
if (atomic_cmpset_rel_int(&m->busy_lock,
VPB_SINGLE_EXCLUSIVER | x, VPB_SHARERS_WORD(1)))
break;
if (x != 0 && !locked)
vm_page_unlock(m);
}
if (x != 0) {
wakeup(m);
if (!locked)
vm_page_unlock(m);
}
}
/*
* vm_page_sbusied:
*
* Return a positive value if the page is shared busied, 0 otherwise.
*/
int
vm_page_sbusied(vm_page_t m)
{
u_int x;
x = m->busy_lock;
return ((x & VPB_BIT_SHARED) != 0 && x != VPB_UNBUSIED);
}
/*
* vm_page_sunbusy:
*
* Shared unbusy a page.
*/
void
vm_page_sunbusy(vm_page_t m)
{
u_int x;
vm_page_lock_assert(m, MA_NOTOWNED);
vm_page_assert_sbusied(m);
for (;;) {
x = m->busy_lock;
if (VPB_SHARERS(x) > 1) {
if (atomic_cmpset_int(&m->busy_lock, x,
x - VPB_ONE_SHARER))
break;
continue;
}
if ((x & VPB_BIT_WAITERS) == 0) {
KASSERT(x == VPB_SHARERS_WORD(1),
("vm_page_sunbusy: invalid lock state"));
if (atomic_cmpset_int(&m->busy_lock,
VPB_SHARERS_WORD(1), VPB_UNBUSIED))
break;
continue;
}
KASSERT(x == (VPB_SHARERS_WORD(1) | VPB_BIT_WAITERS),
("vm_page_sunbusy: invalid lock state for waiters"));
vm_page_lock(m);
if (!atomic_cmpset_int(&m->busy_lock, x, VPB_UNBUSIED)) {
vm_page_unlock(m);
continue;
}
wakeup(m);
vm_page_unlock(m);
break;
}
}
/*
* vm_page_busy_sleep:
*
* Sleep and release the page lock, using the page pointer as wchan.
* This is used to implement the hard-path of busying mechanism.
*
* The given page must be locked.
*
* If nonshared is true, sleep only if the page is xbusy.
*/
void
vm_page_busy_sleep(vm_page_t m, const char *wmesg, bool nonshared)
{
u_int x;
vm_page_assert_locked(m);
x = m->busy_lock;
if (x == VPB_UNBUSIED || (nonshared && (x & VPB_BIT_SHARED) != 0) ||
((x & VPB_BIT_WAITERS) == 0 &&
!atomic_cmpset_int(&m->busy_lock, x, x | VPB_BIT_WAITERS))) {
vm_page_unlock(m);
return;
}
msleep(m, vm_page_lockptr(m), PVM | PDROP, wmesg, 0);
}
/*
* vm_page_trysbusy:
*
* Try to shared busy a page.
* If the operation succeeds 1 is returned otherwise 0.
* The operation never sleeps.
*/
int
vm_page_trysbusy(vm_page_t m)
{
u_int x;
for (;;) {
x = m->busy_lock;
if ((x & VPB_BIT_SHARED) == 0)
return (0);
if (atomic_cmpset_acq_int(&m->busy_lock, x, x + VPB_ONE_SHARER))
return (1);
}
}
static void
vm_page_xunbusy_locked(vm_page_t m)
{
vm_page_assert_xbusied(m);
vm_page_assert_locked(m);
atomic_store_rel_int(&m->busy_lock, VPB_UNBUSIED);
/* There is a waiter, do wakeup() instead of vm_page_flash(). */
wakeup(m);
}
void
vm_page_xunbusy_maybelocked(vm_page_t m)
{
bool lockacq;
vm_page_assert_xbusied(m);
/*
* Fast path for unbusy. If it succeeds, we know that there
* are no waiters, so we do not need a wakeup.
*/
if (atomic_cmpset_rel_int(&m->busy_lock, VPB_SINGLE_EXCLUSIVER,
VPB_UNBUSIED))
return;
lockacq = !mtx_owned(vm_page_lockptr(m));
if (lockacq)
vm_page_lock(m);
vm_page_xunbusy_locked(m);
if (lockacq)
vm_page_unlock(m);
}
/*
* vm_page_xunbusy_hard:
*
* Called after the first try the exclusive unbusy of a page failed.
* It is assumed that the waiters bit is on.
*/
void
vm_page_xunbusy_hard(vm_page_t m)
{
vm_page_assert_xbusied(m);
vm_page_lock(m);
vm_page_xunbusy_locked(m);
vm_page_unlock(m);
}
/*
* vm_page_flash:
*
* Wakeup anyone waiting for the page.
* The ownership bits do not change.
*
* The given page must be locked.
*/
void
vm_page_flash(vm_page_t m)
{
u_int x;
vm_page_lock_assert(m, MA_OWNED);
for (;;) {
x = m->busy_lock;
if ((x & VPB_BIT_WAITERS) == 0)
return;
if (atomic_cmpset_int(&m->busy_lock, x,
x & (~VPB_BIT_WAITERS)))
break;
}
wakeup(m);
}
/*
* Avoid releasing and reacquiring the same page lock.
*/
void
vm_page_change_lock(vm_page_t m, struct mtx **mtx)
{
struct mtx *mtx1;
mtx1 = vm_page_lockptr(m);
if (*mtx == mtx1)
return;
if (*mtx != NULL)
mtx_unlock(*mtx);
*mtx = mtx1;
mtx_lock(mtx1);
}
/*
* Keep page from being freed by the page daemon
* much of the same effect as wiring, except much lower
* overhead and should be used only for *very* temporary
* holding ("wiring").
*/
void
vm_page_hold(vm_page_t mem)
{
vm_page_lock_assert(mem, MA_OWNED);
mem->hold_count++;
}
void
vm_page_unhold(vm_page_t mem)
{
vm_page_lock_assert(mem, MA_OWNED);
KASSERT(mem->hold_count >= 1, ("vm_page_unhold: hold count < 0!!!"));
--mem->hold_count;
if (mem->hold_count == 0 && (mem->flags & PG_UNHOLDFREE) != 0)
vm_page_free_toq(mem);
}
/*
* vm_page_unhold_pages:
*
* Unhold each of the pages that is referenced by the given array.
*/
void
vm_page_unhold_pages(vm_page_t *ma, int count)
{
struct mtx *mtx;
mtx = NULL;
for (; count != 0; count--) {
vm_page_change_lock(*ma, &mtx);
vm_page_unhold(*ma);
ma++;
}
if (mtx != NULL)
mtx_unlock(mtx);
}
vm_page_t
PHYS_TO_VM_PAGE(vm_paddr_t pa)
{
vm_page_t m;
#ifdef VM_PHYSSEG_SPARSE
m = vm_phys_paddr_to_vm_page(pa);
if (m == NULL)
m = vm_phys_fictitious_to_vm_page(pa);
return (m);
#elif defined(VM_PHYSSEG_DENSE)
long pi;
pi = atop(pa);
if (pi >= first_page && (pi - first_page) < vm_page_array_size) {
m = &vm_page_array[pi - first_page];
return (m);
}
return (vm_phys_fictitious_to_vm_page(pa));
#else
#error "Either VM_PHYSSEG_DENSE or VM_PHYSSEG_SPARSE must be defined."
#endif
}
/*
* vm_page_getfake:
*
* Create a fictitious page with the specified physical address and
* memory attribute. The memory attribute is the only the machine-
* dependent aspect of a fictitious page that must be initialized.
*/
vm_page_t
vm_page_getfake(vm_paddr_t paddr, vm_memattr_t memattr)
{
vm_page_t m;
m = uma_zalloc(fakepg_zone, M_WAITOK | M_ZERO);
vm_page_initfake(m, paddr, memattr);
return (m);
}
void
vm_page_initfake(vm_page_t m, vm_paddr_t paddr, vm_memattr_t memattr)
{
if ((m->flags & PG_FICTITIOUS) != 0) {
/*
* The page's memattr might have changed since the
* previous initialization. Update the pmap to the
* new memattr.
*/
goto memattr;
}
m->phys_addr = paddr;
m->queue = PQ_NONE;
/* Fictitious pages don't use "segind". */
m->flags = PG_FICTITIOUS;
/* Fictitious pages don't use "order" or "pool". */
m->oflags = VPO_UNMANAGED;
m->busy_lock = VPB_SINGLE_EXCLUSIVER;
m->wire_count = 1;
pmap_page_init(m);
memattr:
pmap_page_set_memattr(m, memattr);
}
/*
* vm_page_putfake:
*
* Release a fictitious page.
*/
void
vm_page_putfake(vm_page_t m)
{
KASSERT((m->oflags & VPO_UNMANAGED) != 0, ("managed %p", m));
KASSERT((m->flags & PG_FICTITIOUS) != 0,
("vm_page_putfake: bad page %p", m));
uma_zfree(fakepg_zone, m);
}
/*
* vm_page_updatefake:
*
* Update the given fictitious page to the specified physical address and
* memory attribute.
*/
void
vm_page_updatefake(vm_page_t m, vm_paddr_t paddr, vm_memattr_t memattr)
{
KASSERT((m->flags & PG_FICTITIOUS) != 0,
("vm_page_updatefake: bad page %p", m));
m->phys_addr = paddr;
pmap_page_set_memattr(m, memattr);
}
/*
* vm_page_free:
*
* Free a page.
*/
void
vm_page_free(vm_page_t m)
{
m->flags &= ~PG_ZERO;
vm_page_free_toq(m);
}
/*
* vm_page_free_zero:
*
* Free a page to the zerod-pages queue
*/
void
vm_page_free_zero(vm_page_t m)
{
m->flags |= PG_ZERO;
vm_page_free_toq(m);
}
/*
* Unbusy and handle the page queueing for a page from a getpages request that
* was optionally read ahead or behind.
*/
void
vm_page_readahead_finish(vm_page_t m)
{
/* We shouldn't put invalid pages on queues. */
KASSERT(m->valid != 0, ("%s: %p is invalid", __func__, m));
/*
* Since the page is not the actually needed one, whether it should
* be activated or deactivated is not obvious. Empirical results
* have shown that deactivating the page is usually the best choice,
* unless the page is wanted by another thread.
*/
vm_page_lock(m);
if ((m->busy_lock & VPB_BIT_WAITERS) != 0)
vm_page_activate(m);
else
vm_page_deactivate(m);
vm_page_unlock(m);
vm_page_xunbusy(m);
}
/*
* vm_page_sleep_if_busy:
*
* Sleep and release the page queues lock if the page is busied.
* Returns TRUE if the thread slept.
*
* The given page must be unlocked and object containing it must
* be locked.
*/
int
vm_page_sleep_if_busy(vm_page_t m, const char *msg)
{
vm_object_t obj;
vm_page_lock_assert(m, MA_NOTOWNED);
VM_OBJECT_ASSERT_WLOCKED(m->object);
if (vm_page_busied(m)) {
/*
* The page-specific object must be cached because page
* identity can change during the sleep, causing the
* re-lock of a different object.
* It is assumed that a reference to the object is already
* held by the callers.
*/
obj = m->object;
vm_page_lock(m);
VM_OBJECT_WUNLOCK(obj);
vm_page_busy_sleep(m, msg, false);
VM_OBJECT_WLOCK(obj);
return (TRUE);
}
return (FALSE);
}
/*
* vm_page_dirty_KBI: [ internal use only ]
*
* Set all bits in the page's dirty field.
*
* The object containing the specified page must be locked if the
* call is made from the machine-independent layer.
*
* See vm_page_clear_dirty_mask().
*
* This function should only be called by vm_page_dirty().
*/
void
vm_page_dirty_KBI(vm_page_t m)
{
/* Refer to this operation by its public name. */
KASSERT(m->valid == VM_PAGE_BITS_ALL,
("vm_page_dirty: page is invalid!"));
m->dirty = VM_PAGE_BITS_ALL;
}
/*
* vm_page_insert: [ internal use only ]
*
* Inserts the given mem entry into the object and object list.
*
* The object must be locked.
*/
int
vm_page_insert(vm_page_t m, vm_object_t object, vm_pindex_t pindex)
{
vm_page_t mpred;
VM_OBJECT_ASSERT_WLOCKED(object);
mpred = vm_radix_lookup_le(&object->rtree, pindex);
return (vm_page_insert_after(m, object, pindex, mpred));
}
/*
* vm_page_insert_after:
*
* Inserts the page "m" into the specified object at offset "pindex".
*
* The page "mpred" must immediately precede the offset "pindex" within
* the specified object.
*
* The object must be locked.
*/
static int
vm_page_insert_after(vm_page_t m, vm_object_t object, vm_pindex_t pindex,
vm_page_t mpred)
{
vm_page_t msucc;
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT(m->object == NULL,
("vm_page_insert_after: page already inserted"));
if (mpred != NULL) {
KASSERT(mpred->object == object,
("vm_page_insert_after: object doesn't contain mpred"));
KASSERT(mpred->pindex < pindex,
("vm_page_insert_after: mpred doesn't precede pindex"));
msucc = TAILQ_NEXT(mpred, listq);
} else
msucc = TAILQ_FIRST(&object->memq);
if (msucc != NULL)
KASSERT(msucc->pindex > pindex,
("vm_page_insert_after: msucc doesn't succeed pindex"));
/*
* Record the object/offset pair in this page
*/
m->object = object;
m->pindex = pindex;
/*
* Now link into the object's ordered list of backed pages.
*/
if (vm_radix_insert(&object->rtree, m)) {
m->object = NULL;
m->pindex = 0;
return (1);
}
vm_page_insert_radixdone(m, object, mpred);
return (0);
}
/*
* vm_page_insert_radixdone:
*
* Complete page "m" insertion into the specified object after the
* radix trie hooking.
*
* The page "mpred" must precede the offset "m->pindex" within the
* specified object.
*
* The object must be locked.
*/
static void
vm_page_insert_radixdone(vm_page_t m, vm_object_t object, vm_page_t mpred)
{
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT(object != NULL && m->object == object,
("vm_page_insert_radixdone: page %p has inconsistent object", m));
if (mpred != NULL) {
KASSERT(mpred->object == object,
("vm_page_insert_after: object doesn't contain mpred"));
KASSERT(mpred->pindex < m->pindex,
("vm_page_insert_after: mpred doesn't precede pindex"));
}
if (mpred != NULL)
TAILQ_INSERT_AFTER(&object->memq, mpred, m, listq);
else
TAILQ_INSERT_HEAD(&object->memq, m, listq);
/*
* Show that the object has one more resident page.
*/
object->resident_page_count++;
/*
* Hold the vnode until the last page is released.
*/
if (object->resident_page_count == 1 && object->type == OBJT_VNODE)
vhold(object->handle);
/*
* Since we are inserting a new and possibly dirty page,
* update the object's OBJ_MIGHTBEDIRTY flag.
*/
if (pmap_page_is_write_mapped(m))
vm_object_set_writeable_dirty(object);
}
/*
* vm_page_remove:
*
* Removes the specified page from its containing object, but does not
- * invalidate any backing storage.
+ * invalidate any backing storage. Return true if the page may be safely
+ * freed and false otherwise.
*
* The object must be locked. The page must be locked if it is managed.
*/
-void
+bool
vm_page_remove(vm_page_t m)
{
vm_object_t object;
vm_page_t mrem;
+ object = m->object;
+
if ((m->oflags & VPO_UNMANAGED) == 0)
vm_page_assert_locked(m);
- if ((object = m->object) == NULL)
- return;
VM_OBJECT_ASSERT_WLOCKED(object);
if (vm_page_xbusied(m))
vm_page_xunbusy_maybelocked(m);
mrem = vm_radix_remove(&object->rtree, m->pindex);
KASSERT(mrem == m, ("removed page %p, expected page %p", mrem, m));
/*
* Now remove from the object's list of backed pages.
*/
TAILQ_REMOVE(&object->memq, m, listq);
/*
* And show that the object has one fewer resident page.
*/
object->resident_page_count--;
/*
* The vnode may now be recycled.
*/
if (object->resident_page_count == 0 && object->type == OBJT_VNODE)
vdrop(object->handle);
m->object = NULL;
+ return (!vm_page_wired(m));
}
/*
* vm_page_lookup:
*
* Returns the page associated with the object/offset
* pair specified; if none is found, NULL is returned.
*
* The object must be locked.
*/
vm_page_t
vm_page_lookup(vm_object_t object, vm_pindex_t pindex)
{
VM_OBJECT_ASSERT_LOCKED(object);
return (vm_radix_lookup(&object->rtree, pindex));
}
/*
* vm_page_find_least:
*
* Returns the page associated with the object with least pindex
* greater than or equal to the parameter pindex, or NULL.
*
* The object must be locked.
*/
vm_page_t
vm_page_find_least(vm_object_t object, vm_pindex_t pindex)
{
vm_page_t m;
VM_OBJECT_ASSERT_LOCKED(object);
if ((m = TAILQ_FIRST(&object->memq)) != NULL && m->pindex < pindex)
m = vm_radix_lookup_ge(&object->rtree, pindex);
return (m);
}
/*
* Returns the given page's successor (by pindex) within the object if it is
* resident; if none is found, NULL is returned.
*
* The object must be locked.
*/
vm_page_t
vm_page_next(vm_page_t m)
{
vm_page_t next;
VM_OBJECT_ASSERT_LOCKED(m->object);
if ((next = TAILQ_NEXT(m, listq)) != NULL) {
MPASS(next->object == m->object);
if (next->pindex != m->pindex + 1)
next = NULL;
}
return (next);
}
/*
* Returns the given page's predecessor (by pindex) within the object if it is
* resident; if none is found, NULL is returned.
*
* The object must be locked.
*/
vm_page_t
vm_page_prev(vm_page_t m)
{
vm_page_t prev;
VM_OBJECT_ASSERT_LOCKED(m->object);
if ((prev = TAILQ_PREV(m, pglist, listq)) != NULL) {
MPASS(prev->object == m->object);
if (prev->pindex != m->pindex - 1)
prev = NULL;
}
return (prev);
}
/*
* Uses the page mnew as a replacement for an existing page at index
* pindex which must be already present in the object.
*
* The existing page must not be on a paging queue.
*/
vm_page_t
vm_page_replace(vm_page_t mnew, vm_object_t object, vm_pindex_t pindex)
{
vm_page_t mold;
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT(mnew->object == NULL,
("vm_page_replace: page %p already in object", mnew));
KASSERT(mnew->queue == PQ_NONE,
("vm_page_replace: new page %p is on a paging queue", mnew));
/*
* This function mostly follows vm_page_insert() and
* vm_page_remove() without the radix, object count and vnode
* dance. Double check such functions for more comments.
*/
mnew->object = object;
mnew->pindex = pindex;
mold = vm_radix_replace(&object->rtree, mnew);
KASSERT(mold->queue == PQ_NONE,
("vm_page_replace: old page %p is on a paging queue", mold));
/* Keep the resident page list in sorted order. */
TAILQ_INSERT_AFTER(&object->memq, mold, mnew, listq);
TAILQ_REMOVE(&object->memq, mold, listq);
mold->object = NULL;
vm_page_xunbusy_maybelocked(mold);
/*
* The object's resident_page_count does not change because we have
* swapped one page for another, but OBJ_MIGHTBEDIRTY.
*/
if (pmap_page_is_write_mapped(mnew))
vm_object_set_writeable_dirty(object);
return (mold);
}
/*
* vm_page_rename:
*
* Move the given memory entry from its
* current object to the specified target object/offset.
*
* Note: swap associated with the page must be invalidated by the move. We
* have to do this for several reasons: (1) we aren't freeing the
* page, (2) we are dirtying the page, (3) the VM system is probably
* moving the page from object A to B, and will then later move
* the backing store from A to B and we can't have a conflict.
*
* Note: we *always* dirty the page. It is necessary both for the
* fact that we moved it, and because we may be invalidating
* swap.
*
* The objects must be locked.
*/
int
vm_page_rename(vm_page_t m, vm_object_t new_object, vm_pindex_t new_pindex)
{
vm_page_t mpred;
vm_pindex_t opidx;
VM_OBJECT_ASSERT_WLOCKED(new_object);
mpred = vm_radix_lookup_le(&new_object->rtree, new_pindex);
KASSERT(mpred == NULL || mpred->pindex != new_pindex,
("vm_page_rename: pindex already renamed"));
/*
* Create a custom version of vm_page_insert() which does not depend
* by m_prev and can cheat on the implementation aspects of the
* function.
*/
opidx = m->pindex;
m->pindex = new_pindex;
if (vm_radix_insert(&new_object->rtree, m)) {
m->pindex = opidx;
return (1);
}
/*
* The operation cannot fail anymore. The removal must happen before
* the listq iterator is tainted.
*/
m->pindex = opidx;
vm_page_lock(m);
- vm_page_remove(m);
+ (void)vm_page_remove(m);
/* Return back to the new pindex to complete vm_page_insert(). */
m->pindex = new_pindex;
m->object = new_object;
vm_page_unlock(m);
vm_page_insert_radixdone(m, new_object, mpred);
vm_page_dirty(m);
return (0);
}
/*
* vm_page_alloc:
*
* Allocate and return a page that is associated with the specified
* object and offset pair. By default, this page is exclusive busied.
*
* The caller must always specify an allocation class.
*
* allocation classes:
* VM_ALLOC_NORMAL normal process request
* VM_ALLOC_SYSTEM system *really* needs a page
* VM_ALLOC_INTERRUPT interrupt time request
*
* optional allocation flags:
* VM_ALLOC_COUNT(number) the number of additional pages that the caller
* intends to allocate
* VM_ALLOC_NOBUSY do not exclusive busy the page
* VM_ALLOC_NODUMP do not include the page in a kernel core dump
* VM_ALLOC_NOOBJ page is not associated with an object and
* should not be exclusive busy
* VM_ALLOC_SBUSY shared busy the allocated page
* VM_ALLOC_WIRED wire the allocated page
* VM_ALLOC_ZERO prefer a zeroed page
*/
vm_page_t
vm_page_alloc(vm_object_t object, vm_pindex_t pindex, int req)
{
return (vm_page_alloc_after(object, pindex, req, object != NULL ?
vm_radix_lookup_le(&object->rtree, pindex) : NULL));
}
vm_page_t
vm_page_alloc_domain(vm_object_t object, vm_pindex_t pindex, int domain,
int req)
{
return (vm_page_alloc_domain_after(object, pindex, domain, req,
object != NULL ? vm_radix_lookup_le(&object->rtree, pindex) :
NULL));
}
/*
* Allocate a page in the specified object with the given page index. To
* optimize insertion of the page into the object, the caller must also specifiy
* the resident page in the object with largest index smaller than the given
* page index, or NULL if no such page exists.
*/
vm_page_t
vm_page_alloc_after(vm_object_t object, vm_pindex_t pindex,
int req, vm_page_t mpred)
{
struct vm_domainset_iter di;
vm_page_t m;
int domain;
vm_domainset_iter_page_init(&di, object, pindex, &domain, &req);
do {
m = vm_page_alloc_domain_after(object, pindex, domain, req,
mpred);
if (m != NULL)
break;
} while (vm_domainset_iter_page(&di, object, &domain) == 0);
return (m);
}
/*
* Returns true if the number of free pages exceeds the minimum
* for the request class and false otherwise.
*/
int
vm_domain_allocate(struct vm_domain *vmd, int req, int npages)
{
u_int limit, old, new;
req = req & VM_ALLOC_CLASS_MASK;
/*
* The page daemon is allowed to dig deeper into the free page list.
*/
if (curproc == pageproc && req != VM_ALLOC_INTERRUPT)
req = VM_ALLOC_SYSTEM;
if (req == VM_ALLOC_INTERRUPT)
limit = 0;
else if (req == VM_ALLOC_SYSTEM)
limit = vmd->vmd_interrupt_free_min;
else
limit = vmd->vmd_free_reserved;
/*
* Attempt to reserve the pages. Fail if we're below the limit.
*/
limit += npages;
old = vmd->vmd_free_count;
do {
if (old < limit)
return (0);
new = old - npages;
} while (atomic_fcmpset_int(&vmd->vmd_free_count, &old, new) == 0);
/* Wake the page daemon if we've crossed the threshold. */
if (vm_paging_needed(vmd, new) && !vm_paging_needed(vmd, old))
pagedaemon_wakeup(vmd->vmd_domain);
/* Only update bitsets on transitions. */
if ((old >= vmd->vmd_free_min && new < vmd->vmd_free_min) ||
(old >= vmd->vmd_free_severe && new < vmd->vmd_free_severe))
vm_domain_set(vmd);
return (1);
}
vm_page_t
vm_page_alloc_domain_after(vm_object_t object, vm_pindex_t pindex, int domain,
int req, vm_page_t mpred)
{
struct vm_domain *vmd;
vm_page_t m;
int flags;
KASSERT((object != NULL) == ((req & VM_ALLOC_NOOBJ) == 0) &&
(object != NULL || (req & VM_ALLOC_SBUSY) == 0) &&
((req & (VM_ALLOC_NOBUSY | VM_ALLOC_SBUSY)) !=
(VM_ALLOC_NOBUSY | VM_ALLOC_SBUSY)),
("inconsistent object(%p)/req(%x)", object, req));
KASSERT(object == NULL || (req & VM_ALLOC_WAITOK) == 0,
("Can't sleep and retry object insertion."));
KASSERT(mpred == NULL || mpred->pindex < pindex,
("mpred %p doesn't precede pindex 0x%jx", mpred,
(uintmax_t)pindex));
if (object != NULL)
VM_OBJECT_ASSERT_WLOCKED(object);
again:
m = NULL;
#if VM_NRESERVLEVEL > 0
/*
* Can we allocate the page from a reservation?
*/
if (vm_object_reserv(object) &&
(m = vm_reserv_alloc_page(object, pindex, domain, req, mpred)) !=
NULL) {
domain = vm_phys_domain(m);
vmd = VM_DOMAIN(domain);
goto found;
}
#endif
vmd = VM_DOMAIN(domain);
if (object != NULL && vmd->vmd_pgcache != NULL) {
m = uma_zalloc(vmd->vmd_pgcache, M_NOWAIT);
if (m != NULL)
goto found;
}
if (vm_domain_allocate(vmd, req, 1)) {
/*
* If not, allocate it from the free page queues.
*/
vm_domain_free_lock(vmd);
m = vm_phys_alloc_pages(domain, object != NULL ?
VM_FREEPOOL_DEFAULT : VM_FREEPOOL_DIRECT, 0);
vm_domain_free_unlock(vmd);
if (m == NULL) {
vm_domain_freecnt_inc(vmd, 1);
#if VM_NRESERVLEVEL > 0
if (vm_reserv_reclaim_inactive(domain))
goto again;
#endif
}
}
if (m == NULL) {
/*
* Not allocatable, give up.
*/
if (vm_domain_alloc_fail(vmd, object, req))
goto again;
return (NULL);
}
/*
* At this point we had better have found a good page.
*/
KASSERT(m != NULL, ("missing page"));
found:
vm_page_dequeue(m);
vm_page_alloc_check(m);
/*
* Initialize the page. Only the PG_ZERO flag is inherited.
*/
flags = 0;
if ((req & VM_ALLOC_ZERO) != 0)
flags = PG_ZERO;
flags &= m->flags;
if ((req & VM_ALLOC_NODUMP) != 0)
flags |= PG_NODUMP;
m->flags = flags;
m->aflags = 0;
m->oflags = object == NULL || (object->flags & OBJ_UNMANAGED) != 0 ?
VPO_UNMANAGED : 0;
m->busy_lock = VPB_UNBUSIED;
if ((req & (VM_ALLOC_NOBUSY | VM_ALLOC_NOOBJ | VM_ALLOC_SBUSY)) == 0)
m->busy_lock = VPB_SINGLE_EXCLUSIVER;
if ((req & VM_ALLOC_SBUSY) != 0)
m->busy_lock = VPB_SHARERS_WORD(1);
if (req & VM_ALLOC_WIRED) {
/*
* The page lock is not required for wiring a page until that
* page is inserted into the object.
*/
vm_wire_add(1);
m->wire_count = 1;
}
m->act_count = 0;
if (object != NULL) {
if (vm_page_insert_after(m, object, pindex, mpred)) {
if (req & VM_ALLOC_WIRED) {
vm_wire_sub(1);
m->wire_count = 0;
}
KASSERT(m->object == NULL, ("page %p has object", m));
m->oflags = VPO_UNMANAGED;
m->busy_lock = VPB_UNBUSIED;
/* Don't change PG_ZERO. */
vm_page_free_toq(m);
if (req & VM_ALLOC_WAITFAIL) {
VM_OBJECT_WUNLOCK(object);
vm_radix_wait();
VM_OBJECT_WLOCK(object);
}
return (NULL);
}
/* Ignore device objects; the pager sets "memattr" for them. */
if (object->memattr != VM_MEMATTR_DEFAULT &&
(object->flags & OBJ_FICTITIOUS) == 0)
pmap_page_set_memattr(m, object->memattr);
} else
m->pindex = pindex;
return (m);
}
/*
* vm_page_alloc_contig:
*
* Allocate a contiguous set of physical pages of the given size "npages"
* from the free lists. All of the physical pages must be at or above
* the given physical address "low" and below the given physical address
* "high". The given value "alignment" determines the alignment of the
* first physical page in the set. If the given value "boundary" is
* non-zero, then the set of physical pages cannot cross any physical
* address boundary that is a multiple of that value. Both "alignment"
* and "boundary" must be a power of two.
*
* If the specified memory attribute, "memattr", is VM_MEMATTR_DEFAULT,
* then the memory attribute setting for the physical pages is configured
* to the object's memory attribute setting. Otherwise, the memory
* attribute setting for the physical pages is configured to "memattr",
* overriding the object's memory attribute setting. However, if the
* object's memory attribute setting is not VM_MEMATTR_DEFAULT, then the
* memory attribute setting for the physical pages cannot be configured
* to VM_MEMATTR_DEFAULT.
*
* The specified object may not contain fictitious pages.
*
* The caller must always specify an allocation class.
*
* allocation classes:
* VM_ALLOC_NORMAL normal process request
* VM_ALLOC_SYSTEM system *really* needs a page
* VM_ALLOC_INTERRUPT interrupt time request
*
* optional allocation flags:
* VM_ALLOC_NOBUSY do not exclusive busy the page
* VM_ALLOC_NODUMP do not include the page in a kernel core dump
* VM_ALLOC_NOOBJ page is not associated with an object and
* should not be exclusive busy
* VM_ALLOC_SBUSY shared busy the allocated page
* VM_ALLOC_WIRED wire the allocated page
* VM_ALLOC_ZERO prefer a zeroed page
*/
vm_page_t
vm_page_alloc_contig(vm_object_t object, vm_pindex_t pindex, int req,
u_long npages, vm_paddr_t low, vm_paddr_t high, u_long alignment,
vm_paddr_t boundary, vm_memattr_t memattr)
{
struct vm_domainset_iter di;
vm_page_t m;
int domain;
vm_domainset_iter_page_init(&di, object, pindex, &domain, &req);
do {
m = vm_page_alloc_contig_domain(object, pindex, domain, req,
npages, low, high, alignment, boundary, memattr);
if (m != NULL)
break;
} while (vm_domainset_iter_page(&di, object, &domain) == 0);
return (m);
}
vm_page_t
vm_page_alloc_contig_domain(vm_object_t object, vm_pindex_t pindex, int domain,
int req, u_long npages, vm_paddr_t low, vm_paddr_t high, u_long alignment,
vm_paddr_t boundary, vm_memattr_t memattr)
{
struct vm_domain *vmd;
vm_page_t m, m_ret, mpred;
u_int busy_lock, flags, oflags;
mpred = NULL; /* XXX: pacify gcc */
KASSERT((object != NULL) == ((req & VM_ALLOC_NOOBJ) == 0) &&
(object != NULL || (req & VM_ALLOC_SBUSY) == 0) &&
((req & (VM_ALLOC_NOBUSY | VM_ALLOC_SBUSY)) !=
(VM_ALLOC_NOBUSY | VM_ALLOC_SBUSY)),
("vm_page_alloc_contig: inconsistent object(%p)/req(%x)", object,
req));
KASSERT(object == NULL || (req & VM_ALLOC_WAITOK) == 0,
("Can't sleep and retry object insertion."));
if (object != NULL) {
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT((object->flags & OBJ_FICTITIOUS) == 0,
("vm_page_alloc_contig: object %p has fictitious pages",
object));
}
KASSERT(npages > 0, ("vm_page_alloc_contig: npages is zero"));
if (object != NULL) {
mpred = vm_radix_lookup_le(&object->rtree, pindex);
KASSERT(mpred == NULL || mpred->pindex != pindex,
("vm_page_alloc_contig: pindex already allocated"));
}
/*
* Can we allocate the pages without the number of free pages falling
* below the lower bound for the allocation class?
*/
again:
#if VM_NRESERVLEVEL > 0
/*
* Can we allocate the pages from a reservation?
*/
if (vm_object_reserv(object) &&
(m_ret = vm_reserv_alloc_contig(object, pindex, domain, req,
mpred, npages, low, high, alignment, boundary)) != NULL) {
domain = vm_phys_domain(m_ret);
vmd = VM_DOMAIN(domain);
goto found;
}
#endif
m_ret = NULL;
vmd = VM_DOMAIN(domain);
if (vm_domain_allocate(vmd, req, npages)) {
/*
* allocate them from the free page queues.
*/
vm_domain_free_lock(vmd);
m_ret = vm_phys_alloc_contig(domain, npages, low, high,
alignment, boundary);
vm_domain_free_unlock(vmd);
if (m_ret == NULL) {
vm_domain_freecnt_inc(vmd, npages);
#if VM_NRESERVLEVEL > 0
if (vm_reserv_reclaim_contig(domain, npages, low,
high, alignment, boundary))
goto again;
#endif
}
}
if (m_ret == NULL) {
if (vm_domain_alloc_fail(vmd, object, req))
goto again;
return (NULL);
}
#if VM_NRESERVLEVEL > 0
found:
#endif
for (m = m_ret; m < &m_ret[npages]; m++) {
vm_page_dequeue(m);
vm_page_alloc_check(m);
}
/*
* Initialize the pages. Only the PG_ZERO flag is inherited.
*/
flags = 0;
if ((req & VM_ALLOC_ZERO) != 0)
flags = PG_ZERO;
if ((req & VM_ALLOC_NODUMP) != 0)
flags |= PG_NODUMP;
oflags = object == NULL || (object->flags & OBJ_UNMANAGED) != 0 ?
VPO_UNMANAGED : 0;
busy_lock = VPB_UNBUSIED;
if ((req & (VM_ALLOC_NOBUSY | VM_ALLOC_NOOBJ | VM_ALLOC_SBUSY)) == 0)
busy_lock = VPB_SINGLE_EXCLUSIVER;
if ((req & VM_ALLOC_SBUSY) != 0)
busy_lock = VPB_SHARERS_WORD(1);
if ((req & VM_ALLOC_WIRED) != 0)
vm_wire_add(npages);
if (object != NULL) {
if (object->memattr != VM_MEMATTR_DEFAULT &&
memattr == VM_MEMATTR_DEFAULT)
memattr = object->memattr;
}
for (m = m_ret; m < &m_ret[npages]; m++) {
m->aflags = 0;
m->flags = (m->flags | PG_NODUMP) & flags;
m->busy_lock = busy_lock;
if ((req & VM_ALLOC_WIRED) != 0)
m->wire_count = 1;
m->act_count = 0;
m->oflags = oflags;
if (object != NULL) {
if (vm_page_insert_after(m, object, pindex, mpred)) {
if ((req & VM_ALLOC_WIRED) != 0)
vm_wire_sub(npages);
KASSERT(m->object == NULL,
("page %p has object", m));
mpred = m;
for (m = m_ret; m < &m_ret[npages]; m++) {
if (m <= mpred &&
(req & VM_ALLOC_WIRED) != 0)
m->wire_count = 0;
m->oflags = VPO_UNMANAGED;
m->busy_lock = VPB_UNBUSIED;
/* Don't change PG_ZERO. */
vm_page_free_toq(m);
}
if (req & VM_ALLOC_WAITFAIL) {
VM_OBJECT_WUNLOCK(object);
vm_radix_wait();
VM_OBJECT_WLOCK(object);
}
return (NULL);
}
mpred = m;
} else
m->pindex = pindex;
if (memattr != VM_MEMATTR_DEFAULT)
pmap_page_set_memattr(m, memattr);
pindex++;
}
return (m_ret);
}
/*
* Check a page that has been freshly dequeued from a freelist.
*/
static void
vm_page_alloc_check(vm_page_t m)
{
KASSERT(m->object == NULL, ("page %p has object", m));
KASSERT(m->queue == PQ_NONE && (m->aflags & PGA_QUEUE_STATE_MASK) == 0,
("page %p has unexpected queue %d, flags %#x",
m, m->queue, (m->aflags & PGA_QUEUE_STATE_MASK)));
KASSERT(!vm_page_held(m), ("page %p is held", m));
KASSERT(!vm_page_busied(m), ("page %p is busy", m));
KASSERT(m->dirty == 0, ("page %p is dirty", m));
KASSERT(pmap_page_get_memattr(m) == VM_MEMATTR_DEFAULT,
("page %p has unexpected memattr %d",
m, pmap_page_get_memattr(m)));
KASSERT(m->valid == 0, ("free page %p is valid", m));
}
/*
* vm_page_alloc_freelist:
*
* Allocate a physical page from the specified free page list.
*
* The caller must always specify an allocation class.
*
* allocation classes:
* VM_ALLOC_NORMAL normal process request
* VM_ALLOC_SYSTEM system *really* needs a page
* VM_ALLOC_INTERRUPT interrupt time request
*
* optional allocation flags:
* VM_ALLOC_COUNT(number) the number of additional pages that the caller
* intends to allocate
* VM_ALLOC_WIRED wire the allocated page
* VM_ALLOC_ZERO prefer a zeroed page
*/
vm_page_t
vm_page_alloc_freelist(int freelist, int req)
{
struct vm_domainset_iter di;
vm_page_t m;
int domain;
vm_domainset_iter_page_init(&di, NULL, 0, &domain, &req);
do {
m = vm_page_alloc_freelist_domain(domain, freelist, req);
if (m != NULL)
break;
} while (vm_domainset_iter_page(&di, NULL, &domain) == 0);
return (m);
}
vm_page_t
vm_page_alloc_freelist_domain(int domain, int freelist, int req)
{
struct vm_domain *vmd;
vm_page_t m;
u_int flags;
m = NULL;
vmd = VM_DOMAIN(domain);
again:
if (vm_domain_allocate(vmd, req, 1)) {
vm_domain_free_lock(vmd);
m = vm_phys_alloc_freelist_pages(domain, freelist,
VM_FREEPOOL_DIRECT, 0);
vm_domain_free_unlock(vmd);
if (m == NULL)
vm_domain_freecnt_inc(vmd, 1);
}
if (m == NULL) {
if (vm_domain_alloc_fail(vmd, NULL, req))
goto again;
return (NULL);
}
vm_page_dequeue(m);
vm_page_alloc_check(m);
/*
* Initialize the page. Only the PG_ZERO flag is inherited.
*/
m->aflags = 0;
flags = 0;
if ((req & VM_ALLOC_ZERO) != 0)
flags = PG_ZERO;
m->flags &= flags;
if ((req & VM_ALLOC_WIRED) != 0) {
/*
* The page lock is not required for wiring a page that does
* not belong to an object.
*/
vm_wire_add(1);
m->wire_count = 1;
}
/* Unmanaged pages don't use "act_count". */
m->oflags = VPO_UNMANAGED;
return (m);
}
static int
vm_page_import(void *arg, void **store, int cnt, int domain, int flags)
{
struct vm_domain *vmd;
int i;
vmd = arg;
/* Only import if we can bring in a full bucket. */
if (cnt == 1 || !vm_domain_allocate(vmd, VM_ALLOC_NORMAL, cnt))
return (0);
domain = vmd->vmd_domain;
vm_domain_free_lock(vmd);
i = vm_phys_alloc_npages(domain, VM_FREEPOOL_DEFAULT, cnt,
(vm_page_t *)store);
vm_domain_free_unlock(vmd);
if (cnt != i)
vm_domain_freecnt_inc(vmd, cnt - i);
return (i);
}
static void
vm_page_release(void *arg, void **store, int cnt)
{
struct vm_domain *vmd;
vm_page_t m;
int i;
vmd = arg;
vm_domain_free_lock(vmd);
for (i = 0; i < cnt; i++) {
m = (vm_page_t)store[i];
vm_phys_free_pages(m, 0);
}
vm_domain_free_unlock(vmd);
vm_domain_freecnt_inc(vmd, cnt);
}
#define VPSC_ANY 0 /* No restrictions. */
#define VPSC_NORESERV 1 /* Skip reservations; implies VPSC_NOSUPER. */
#define VPSC_NOSUPER 2 /* Skip superpages. */
/*
* vm_page_scan_contig:
*
* Scan vm_page_array[] between the specified entries "m_start" and
* "m_end" for a run of contiguous physical pages that satisfy the
* specified conditions, and return the lowest page in the run. The
* specified "alignment" determines the alignment of the lowest physical
* page in the run. If the specified "boundary" is non-zero, then the
* run of physical pages cannot span a physical address that is a
* multiple of "boundary".
*
* "m_end" is never dereferenced, so it need not point to a vm_page
* structure within vm_page_array[].
*
* "npages" must be greater than zero. "m_start" and "m_end" must not
* span a hole (or discontiguity) in the physical address space. Both
* "alignment" and "boundary" must be a power of two.
*/
vm_page_t
vm_page_scan_contig(u_long npages, vm_page_t m_start, vm_page_t m_end,
u_long alignment, vm_paddr_t boundary, int options)
{
struct mtx *m_mtx;
vm_object_t object;
vm_paddr_t pa;
vm_page_t m, m_run;
#if VM_NRESERVLEVEL > 0
int level;
#endif
int m_inc, order, run_ext, run_len;
KASSERT(npages > 0, ("npages is 0"));
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
m_run = NULL;
run_len = 0;
m_mtx = NULL;
for (m = m_start; m < m_end && run_len < npages; m += m_inc) {
KASSERT((m->flags & PG_MARKER) == 0,
("page %p is PG_MARKER", m));
KASSERT((m->flags & PG_FICTITIOUS) == 0 || m->wire_count == 1,
("fictitious page %p has invalid wire count", m));
/*
* If the current page would be the start of a run, check its
* physical address against the end, alignment, and boundary
* conditions. If it doesn't satisfy these conditions, either
* terminate the scan or advance to the next page that
* satisfies the failed condition.
*/
if (run_len == 0) {
KASSERT(m_run == NULL, ("m_run != NULL"));
if (m + npages > m_end)
break;
pa = VM_PAGE_TO_PHYS(m);
if ((pa & (alignment - 1)) != 0) {
m_inc = atop(roundup2(pa, alignment) - pa);
continue;
}
if (rounddown2(pa ^ (pa + ptoa(npages) - 1),
boundary) != 0) {
m_inc = atop(roundup2(pa, boundary) - pa);
continue;
}
} else
KASSERT(m_run != NULL, ("m_run == NULL"));
vm_page_change_lock(m, &m_mtx);
m_inc = 1;
retry:
if (vm_page_held(m))
run_ext = 0;
#if VM_NRESERVLEVEL > 0
else if ((level = vm_reserv_level(m)) >= 0 &&
(options & VPSC_NORESERV) != 0) {
run_ext = 0;
/* Advance to the end of the reservation. */
pa = VM_PAGE_TO_PHYS(m);
m_inc = atop(roundup2(pa + 1, vm_reserv_size(level)) -
pa);
}
#endif
else if ((object = m->object) != NULL) {
/*
* The page is considered eligible for relocation if
* and only if it could be laundered or reclaimed by
* the page daemon.
*/
if (!VM_OBJECT_TRYRLOCK(object)) {
mtx_unlock(m_mtx);
VM_OBJECT_RLOCK(object);
mtx_lock(m_mtx);
if (m->object != object) {
/*
* The page may have been freed.
*/
VM_OBJECT_RUNLOCK(object);
goto retry;
} else if (vm_page_held(m)) {
run_ext = 0;
goto unlock;
}
}
KASSERT((m->flags & PG_UNHOLDFREE) == 0,
("page %p is PG_UNHOLDFREE", m));
/* Don't care: PG_NODUMP, PG_ZERO. */
if (object->type != OBJT_DEFAULT &&
object->type != OBJT_SWAP &&
object->type != OBJT_VNODE) {
run_ext = 0;
#if VM_NRESERVLEVEL > 0
} else if ((options & VPSC_NOSUPER) != 0 &&
(level = vm_reserv_level_iffullpop(m)) >= 0) {
run_ext = 0;
/* Advance to the end of the superpage. */
pa = VM_PAGE_TO_PHYS(m);
m_inc = atop(roundup2(pa + 1,
vm_reserv_size(level)) - pa);
#endif
} else if (object->memattr == VM_MEMATTR_DEFAULT &&
vm_page_queue(m) != PQ_NONE && !vm_page_busied(m)) {
/*
* The page is allocated but eligible for
* relocation. Extend the current run by one
* page.
*/
KASSERT(pmap_page_get_memattr(m) ==
VM_MEMATTR_DEFAULT,
("page %p has an unexpected memattr", m));
KASSERT((m->oflags & (VPO_SWAPINPROG |
VPO_SWAPSLEEP | VPO_UNMANAGED)) == 0,
("page %p has unexpected oflags", m));
/* Don't care: VPO_NOSYNC. */
run_ext = 1;
} else
run_ext = 0;
unlock:
VM_OBJECT_RUNLOCK(object);
#if VM_NRESERVLEVEL > 0
} else if (level >= 0) {
/*
* The page is reserved but not yet allocated. In
* other words, it is still free. Extend the current
* run by one page.
*/
run_ext = 1;
#endif
} else if ((order = m->order) < VM_NFREEORDER) {
/*
* The page is enqueued in the physical memory
* allocator's free page queues. Moreover, it is the
* first page in a power-of-two-sized run of
* contiguous free pages. Add these pages to the end
* of the current run, and jump ahead.
*/
run_ext = 1 << order;
m_inc = 1 << order;
} else {
/*
* Skip the page for one of the following reasons: (1)
* It is enqueued in the physical memory allocator's
* free page queues. However, it is not the first
* page in a run of contiguous free pages. (This case
* rarely occurs because the scan is performed in
* ascending order.) (2) It is not reserved, and it is
* transitioning from free to allocated. (Conversely,
* the transition from allocated to free for managed
* pages is blocked by the page lock.) (3) It is
* allocated but not contained by an object and not
* wired, e.g., allocated by Xen's balloon driver.
*/
run_ext = 0;
}
/*
* Extend or reset the current run of pages.
*/
if (run_ext > 0) {
if (run_len == 0)
m_run = m;
run_len += run_ext;
} else {
if (run_len > 0) {
m_run = NULL;
run_len = 0;
}
}
}
if (m_mtx != NULL)
mtx_unlock(m_mtx);
if (run_len >= npages)
return (m_run);
return (NULL);
}
/*
* vm_page_reclaim_run:
*
* Try to relocate each of the allocated virtual pages within the
* specified run of physical pages to a new physical address. Free the
* physical pages underlying the relocated virtual pages. A virtual page
* is relocatable if and only if it could be laundered or reclaimed by
* the page daemon. Whenever possible, a virtual page is relocated to a
* physical address above "high".
*
* Returns 0 if every physical page within the run was already free or
* just freed by a successful relocation. Otherwise, returns a non-zero
* value indicating why the last attempt to relocate a virtual page was
* unsuccessful.
*
* "req_class" must be an allocation class.
*/
static int
vm_page_reclaim_run(int req_class, int domain, u_long npages, vm_page_t m_run,
vm_paddr_t high)
{
struct vm_domain *vmd;
struct mtx *m_mtx;
struct spglist free;
vm_object_t object;
vm_paddr_t pa;
vm_page_t m, m_end, m_new;
int error, order, req;
KASSERT((req_class & VM_ALLOC_CLASS_MASK) == req_class,
("req_class is not an allocation class"));
SLIST_INIT(&free);
error = 0;
m = m_run;
m_end = m_run + npages;
m_mtx = NULL;
for (; error == 0 && m < m_end; m++) {
KASSERT((m->flags & (PG_FICTITIOUS | PG_MARKER)) == 0,
("page %p is PG_FICTITIOUS or PG_MARKER", m));
/*
* Avoid releasing and reacquiring the same page lock.
*/
vm_page_change_lock(m, &m_mtx);
retry:
if (vm_page_held(m))
error = EBUSY;
else if ((object = m->object) != NULL) {
/*
* The page is relocated if and only if it could be
* laundered or reclaimed by the page daemon.
*/
if (!VM_OBJECT_TRYWLOCK(object)) {
mtx_unlock(m_mtx);
VM_OBJECT_WLOCK(object);
mtx_lock(m_mtx);
if (m->object != object) {
/*
* The page may have been freed.
*/
VM_OBJECT_WUNLOCK(object);
goto retry;
} else if (vm_page_held(m)) {
error = EBUSY;
goto unlock;
}
}
KASSERT((m->flags & PG_UNHOLDFREE) == 0,
("page %p is PG_UNHOLDFREE", m));
/* Don't care: PG_NODUMP, PG_ZERO. */
if (object->type != OBJT_DEFAULT &&
object->type != OBJT_SWAP &&
object->type != OBJT_VNODE)
error = EINVAL;
else if (object->memattr != VM_MEMATTR_DEFAULT)
error = EINVAL;
else if (vm_page_queue(m) != PQ_NONE &&
!vm_page_busied(m)) {
KASSERT(pmap_page_get_memattr(m) ==
VM_MEMATTR_DEFAULT,
("page %p has an unexpected memattr", m));
KASSERT((m->oflags & (VPO_SWAPINPROG |
VPO_SWAPSLEEP | VPO_UNMANAGED)) == 0,
("page %p has unexpected oflags", m));
/* Don't care: VPO_NOSYNC. */
if (m->valid != 0) {
/*
* First, try to allocate a new page
* that is above "high". Failing
* that, try to allocate a new page
* that is below "m_run". Allocate
* the new page between the end of
* "m_run" and "high" only as a last
* resort.
*/
req = req_class | VM_ALLOC_NOOBJ;
if ((m->flags & PG_NODUMP) != 0)
req |= VM_ALLOC_NODUMP;
if (trunc_page(high) !=
~(vm_paddr_t)PAGE_MASK) {
m_new = vm_page_alloc_contig(
NULL, 0, req, 1,
round_page(high),
~(vm_paddr_t)0,
PAGE_SIZE, 0,
VM_MEMATTR_DEFAULT);
} else
m_new = NULL;
if (m_new == NULL) {
pa = VM_PAGE_TO_PHYS(m_run);
m_new = vm_page_alloc_contig(
NULL, 0, req, 1,
0, pa - 1, PAGE_SIZE, 0,
VM_MEMATTR_DEFAULT);
}
if (m_new == NULL) {
pa += ptoa(npages);
m_new = vm_page_alloc_contig(
NULL, 0, req, 1,
pa, high, PAGE_SIZE, 0,
VM_MEMATTR_DEFAULT);
}
if (m_new == NULL) {
error = ENOMEM;
goto unlock;
}
KASSERT(!vm_page_wired(m_new),
("page %p is wired", m_new));
/*
* Replace "m" with the new page. For
* vm_page_replace(), "m" must be busy
* and dequeued. Finally, change "m"
* as if vm_page_free() was called.
*/
if (object->ref_count != 0)
pmap_remove_all(m);
m_new->aflags = m->aflags &
~PGA_QUEUE_STATE_MASK;
KASSERT(m_new->oflags == VPO_UNMANAGED,
("page %p is managed", m_new));
m_new->oflags = m->oflags & VPO_NOSYNC;
pmap_copy_page(m, m_new);
m_new->valid = m->valid;
m_new->dirty = m->dirty;
m->flags &= ~PG_ZERO;
vm_page_xbusy(m);
vm_page_dequeue(m);
vm_page_replace_checked(m_new, object,
m->pindex, m);
if (vm_page_free_prep(m))
SLIST_INSERT_HEAD(&free, m,
plinks.s.ss);
/*
* The new page must be deactivated
* before the object is unlocked.
*/
vm_page_change_lock(m_new, &m_mtx);
vm_page_deactivate(m_new);
} else {
m->flags &= ~PG_ZERO;
vm_page_dequeue(m);
if (vm_page_free_prep(m))
SLIST_INSERT_HEAD(&free, m,
plinks.s.ss);
KASSERT(m->dirty == 0,
("page %p is dirty", m));
}
} else
error = EBUSY;
unlock:
VM_OBJECT_WUNLOCK(object);
} else {
MPASS(vm_phys_domain(m) == domain);
vmd = VM_DOMAIN(domain);
vm_domain_free_lock(vmd);
order = m->order;
if (order < VM_NFREEORDER) {
/*
* The page is enqueued in the physical memory
* allocator's free page queues. Moreover, it
* is the first page in a power-of-two-sized
* run of contiguous free pages. Jump ahead
* to the last page within that run, and
* continue from there.
*/
m += (1 << order) - 1;
}
#if VM_NRESERVLEVEL > 0
else if (vm_reserv_is_page_free(m))
order = 0;
#endif
vm_domain_free_unlock(vmd);
if (order == VM_NFREEORDER)
error = EINVAL;
}
}
if (m_mtx != NULL)
mtx_unlock(m_mtx);
if ((m = SLIST_FIRST(&free)) != NULL) {
int cnt;
vmd = VM_DOMAIN(domain);
cnt = 0;
vm_domain_free_lock(vmd);
do {
MPASS(vm_phys_domain(m) == domain);
SLIST_REMOVE_HEAD(&free, plinks.s.ss);
vm_phys_free_pages(m, 0);
cnt++;
} while ((m = SLIST_FIRST(&free)) != NULL);
vm_domain_free_unlock(vmd);
vm_domain_freecnt_inc(vmd, cnt);
}
return (error);
}
#define NRUNS 16
CTASSERT(powerof2(NRUNS));
#define RUN_INDEX(count) ((count) & (NRUNS - 1))
#define MIN_RECLAIM 8
/*
* vm_page_reclaim_contig:
*
* Reclaim allocated, contiguous physical memory satisfying the specified
* conditions by relocating the virtual pages using that physical memory.
* Returns true if reclamation is successful and false otherwise. Since
* relocation requires the allocation of physical pages, reclamation may
* fail due to a shortage of free pages. When reclamation fails, callers
* are expected to perform vm_wait() before retrying a failed allocation
* operation, e.g., vm_page_alloc_contig().
*
* The caller must always specify an allocation class through "req".
*
* allocation classes:
* VM_ALLOC_NORMAL normal process request
* VM_ALLOC_SYSTEM system *really* needs a page
* VM_ALLOC_INTERRUPT interrupt time request
*
* The optional allocation flags are ignored.
*
* "npages" must be greater than zero. Both "alignment" and "boundary"
* must be a power of two.
*/
bool
vm_page_reclaim_contig_domain(int domain, int req, u_long npages,
vm_paddr_t low, vm_paddr_t high, u_long alignment, vm_paddr_t boundary)
{
struct vm_domain *vmd;
vm_paddr_t curr_low;
vm_page_t m_run, m_runs[NRUNS];
u_long count, reclaimed;
int error, i, options, req_class;
KASSERT(npages > 0, ("npages is 0"));
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
req_class = req & VM_ALLOC_CLASS_MASK;
/*
* The page daemon is allowed to dig deeper into the free page list.
*/
if (curproc == pageproc && req_class != VM_ALLOC_INTERRUPT)
req_class = VM_ALLOC_SYSTEM;
/*
* Return if the number of free pages cannot satisfy the requested
* allocation.
*/
vmd = VM_DOMAIN(domain);
count = vmd->vmd_free_count;
if (count < npages + vmd->vmd_free_reserved || (count < npages +
vmd->vmd_interrupt_free_min && req_class == VM_ALLOC_SYSTEM) ||
(count < npages && req_class == VM_ALLOC_INTERRUPT))
return (false);
/*
* Scan up to three times, relaxing the restrictions ("options") on
* the reclamation of reservations and superpages each time.
*/
for (options = VPSC_NORESERV;;) {
/*
* Find the highest runs that satisfy the given constraints
* and restrictions, and record them in "m_runs".
*/
curr_low = low;
count = 0;
for (;;) {
m_run = vm_phys_scan_contig(domain, npages, curr_low,
high, alignment, boundary, options);
if (m_run == NULL)
break;
curr_low = VM_PAGE_TO_PHYS(m_run) + ptoa(npages);
m_runs[RUN_INDEX(count)] = m_run;
count++;
}
/*
* Reclaim the highest runs in LIFO (descending) order until
* the number of reclaimed pages, "reclaimed", is at least
* MIN_RECLAIM. Reset "reclaimed" each time because each
* reclamation is idempotent, and runs will (likely) recur
* from one scan to the next as restrictions are relaxed.
*/
reclaimed = 0;
for (i = 0; count > 0 && i < NRUNS; i++) {
count--;
m_run = m_runs[RUN_INDEX(count)];
error = vm_page_reclaim_run(req_class, domain, npages,
m_run, high);
if (error == 0) {
reclaimed += npages;
if (reclaimed >= MIN_RECLAIM)
return (true);
}
}
/*
* Either relax the restrictions on the next scan or return if
* the last scan had no restrictions.
*/
if (options == VPSC_NORESERV)
options = VPSC_NOSUPER;
else if (options == VPSC_NOSUPER)
options = VPSC_ANY;
else if (options == VPSC_ANY)
return (reclaimed != 0);
}
}
bool
vm_page_reclaim_contig(int req, u_long npages, vm_paddr_t low, vm_paddr_t high,
u_long alignment, vm_paddr_t boundary)
{
struct vm_domainset_iter di;
int domain;
bool ret;
vm_domainset_iter_page_init(&di, NULL, 0, &domain, &req);
do {
ret = vm_page_reclaim_contig_domain(domain, req, npages, low,
high, alignment, boundary);
if (ret)
break;
} while (vm_domainset_iter_page(&di, NULL, &domain) == 0);
return (ret);
}
/*
* Set the domain in the appropriate page level domainset.
*/
void
vm_domain_set(struct vm_domain *vmd)
{
mtx_lock(&vm_domainset_lock);
if (!vmd->vmd_minset && vm_paging_min(vmd)) {
vmd->vmd_minset = 1;
DOMAINSET_SET(vmd->vmd_domain, &vm_min_domains);
}
if (!vmd->vmd_severeset && vm_paging_severe(vmd)) {
vmd->vmd_severeset = 1;
DOMAINSET_SET(vmd->vmd_domain, &vm_severe_domains);
}
mtx_unlock(&vm_domainset_lock);
}
/*
* Clear the domain from the appropriate page level domainset.
*/
void
vm_domain_clear(struct vm_domain *vmd)
{
mtx_lock(&vm_domainset_lock);
if (vmd->vmd_minset && !vm_paging_min(vmd)) {
vmd->vmd_minset = 0;
DOMAINSET_CLR(vmd->vmd_domain, &vm_min_domains);
if (vm_min_waiters != 0) {
vm_min_waiters = 0;
wakeup(&vm_min_domains);
}
}
if (vmd->vmd_severeset && !vm_paging_severe(vmd)) {
vmd->vmd_severeset = 0;
DOMAINSET_CLR(vmd->vmd_domain, &vm_severe_domains);
if (vm_severe_waiters != 0) {
vm_severe_waiters = 0;
wakeup(&vm_severe_domains);
}
}
/*
* If pageout daemon needs pages, then tell it that there are
* some free.
*/
if (vmd->vmd_pageout_pages_needed &&
vmd->vmd_free_count >= vmd->vmd_pageout_free_min) {
wakeup(&vmd->vmd_pageout_pages_needed);
vmd->vmd_pageout_pages_needed = 0;
}
/* See comments in vm_wait_doms(). */
if (vm_pageproc_waiters) {
vm_pageproc_waiters = 0;
wakeup(&vm_pageproc_waiters);
}
mtx_unlock(&vm_domainset_lock);
}
/*
* Wait for free pages to exceed the min threshold globally.
*/
void
vm_wait_min(void)
{
mtx_lock(&vm_domainset_lock);
while (vm_page_count_min()) {
vm_min_waiters++;
msleep(&vm_min_domains, &vm_domainset_lock, PVM, "vmwait", 0);
}
mtx_unlock(&vm_domainset_lock);
}
/*
* Wait for free pages to exceed the severe threshold globally.
*/
void
vm_wait_severe(void)
{
mtx_lock(&vm_domainset_lock);
while (vm_page_count_severe()) {
vm_severe_waiters++;
msleep(&vm_severe_domains, &vm_domainset_lock, PVM,
"vmwait", 0);
}
mtx_unlock(&vm_domainset_lock);
}
u_int
vm_wait_count(void)
{
return (vm_severe_waiters + vm_min_waiters + vm_pageproc_waiters);
}
void
vm_wait_doms(const domainset_t *wdoms)
{
/*
* We use racey wakeup synchronization to avoid expensive global
* locking for the pageproc when sleeping with a non-specific vm_wait.
* To handle this, we only sleep for one tick in this instance. It
* is expected that most allocations for the pageproc will come from
* kmem or vm_page_grab* which will use the more specific and
* race-free vm_wait_domain().
*/
if (curproc == pageproc) {
mtx_lock(&vm_domainset_lock);
vm_pageproc_waiters++;
msleep(&vm_pageproc_waiters, &vm_domainset_lock, PVM | PDROP,
"pageprocwait", 1);
} else {
/*
* XXX Ideally we would wait only until the allocation could
* be satisfied. This condition can cause new allocators to
* consume all freed pages while old allocators wait.
*/
mtx_lock(&vm_domainset_lock);
if (vm_page_count_min_set(wdoms)) {
vm_min_waiters++;
msleep(&vm_min_domains, &vm_domainset_lock,
PVM | PDROP, "vmwait", 0);
} else
mtx_unlock(&vm_domainset_lock);
}
}
/*
* vm_wait_domain:
*
* Sleep until free pages are available for allocation.
* - Called in various places after failed memory allocations.
*/
void
vm_wait_domain(int domain)
{
struct vm_domain *vmd;
domainset_t wdom;
vmd = VM_DOMAIN(domain);
vm_domain_free_assert_unlocked(vmd);
if (curproc == pageproc) {
mtx_lock(&vm_domainset_lock);
if (vmd->vmd_free_count < vmd->vmd_pageout_free_min) {
vmd->vmd_pageout_pages_needed = 1;
msleep(&vmd->vmd_pageout_pages_needed,
&vm_domainset_lock, PDROP | PSWP, "VMWait", 0);
} else
mtx_unlock(&vm_domainset_lock);
} else {
if (pageproc == NULL)
panic("vm_wait in early boot");
DOMAINSET_ZERO(&wdom);
DOMAINSET_SET(vmd->vmd_domain, &wdom);
vm_wait_doms(&wdom);
}
}
/*
* vm_wait:
*
* Sleep until free pages are available for allocation in the
* affinity domains of the obj. If obj is NULL, the domain set
* for the calling thread is used.
* Called in various places after failed memory allocations.
*/
void
vm_wait(vm_object_t obj)
{
struct domainset *d;
d = NULL;
/*
* Carefully fetch pointers only once: the struct domainset
* itself is ummutable but the pointer might change.
*/
if (obj != NULL)
d = obj->domain.dr_policy;
if (d == NULL)
d = curthread->td_domain.dr_policy;
vm_wait_doms(&d->ds_mask);
}
/*
* vm_domain_alloc_fail:
*
* Called when a page allocation function fails. Informs the
* pagedaemon and performs the requested wait. Requires the
* domain_free and object lock on entry. Returns with the
* object lock held and free lock released. Returns an error when
* retry is necessary.
*
*/
static int
vm_domain_alloc_fail(struct vm_domain *vmd, vm_object_t object, int req)
{
vm_domain_free_assert_unlocked(vmd);
atomic_add_int(&vmd->vmd_pageout_deficit,
max((u_int)req >> VM_ALLOC_COUNT_SHIFT, 1));
if (req & (VM_ALLOC_WAITOK | VM_ALLOC_WAITFAIL)) {
if (object != NULL)
VM_OBJECT_WUNLOCK(object);
vm_wait_domain(vmd->vmd_domain);
if (object != NULL)
VM_OBJECT_WLOCK(object);
if (req & VM_ALLOC_WAITOK)
return (EAGAIN);
}
return (0);
}
/*
* vm_waitpfault:
*
* Sleep until free pages are available for allocation.
* - Called only in vm_fault so that processes page faulting
* can be easily tracked.
* - Sleeps at a lower priority than vm_wait() so that vm_wait()ing
* processes will be able to grab memory first. Do not change
* this balance without careful testing first.
*/
void
vm_waitpfault(struct domainset *dset)
{
/*
* XXX Ideally we would wait only until the allocation could
* be satisfied. This condition can cause new allocators to
* consume all freed pages while old allocators wait.
*/
mtx_lock(&vm_domainset_lock);
if (vm_page_count_min_set(&dset->ds_mask)) {
vm_min_waiters++;
msleep(&vm_min_domains, &vm_domainset_lock, PUSER | PDROP,
"pfault", 0);
} else
mtx_unlock(&vm_domainset_lock);
}
struct vm_pagequeue *
vm_page_pagequeue(vm_page_t m)
{
return (&vm_pagequeue_domain(m)->vmd_pagequeues[m->queue]);
}
static struct mtx *
vm_page_pagequeue_lockptr(vm_page_t m)
{
uint8_t queue;
if ((queue = atomic_load_8(&m->queue)) == PQ_NONE)
return (NULL);
return (&vm_pagequeue_domain(m)->vmd_pagequeues[queue].pq_mutex);
}
static inline void
vm_pqbatch_process_page(struct vm_pagequeue *pq, vm_page_t m)
{
struct vm_domain *vmd;
uint8_t qflags;
CRITICAL_ASSERT(curthread);
vm_pagequeue_assert_locked(pq);
/*
* The page daemon is allowed to set m->queue = PQ_NONE without
* the page queue lock held. In this case it is about to free the page,
* which must not have any queue state.
*/
qflags = atomic_load_8(&m->aflags) & PGA_QUEUE_STATE_MASK;
KASSERT(pq == vm_page_pagequeue(m) || qflags == 0,
("page %p doesn't belong to queue %p but has queue state %#x",
m, pq, qflags));
if ((qflags & PGA_DEQUEUE) != 0) {
if (__predict_true((qflags & PGA_ENQUEUED) != 0)) {
TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
vm_pagequeue_cnt_dec(pq);
}
vm_page_dequeue_complete(m);
} else if ((qflags & (PGA_REQUEUE | PGA_REQUEUE_HEAD)) != 0) {
if ((qflags & PGA_ENQUEUED) != 0)
TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
else {
vm_pagequeue_cnt_inc(pq);
vm_page_aflag_set(m, PGA_ENQUEUED);
}
if ((qflags & PGA_REQUEUE_HEAD) != 0) {
KASSERT(m->queue == PQ_INACTIVE,
("head enqueue not supported for page %p", m));
vmd = vm_pagequeue_domain(m);
TAILQ_INSERT_BEFORE(&vmd->vmd_inacthead, m, plinks.q);
} else
TAILQ_INSERT_TAIL(&pq->pq_pl, m, plinks.q);
/*
* PGA_REQUEUE and PGA_REQUEUE_HEAD must be cleared after
* setting PGA_ENQUEUED in order to synchronize with the
* page daemon.
*/
vm_page_aflag_clear(m, PGA_REQUEUE | PGA_REQUEUE_HEAD);
}
}
static void
vm_pqbatch_process(struct vm_pagequeue *pq, struct vm_batchqueue *bq,
uint8_t queue)
{
vm_page_t m;
int i;
for (i = 0; i < bq->bq_cnt; i++) {
m = bq->bq_pa[i];
if (__predict_false(m->queue != queue))
continue;
vm_pqbatch_process_page(pq, m);
}
vm_batchqueue_init(bq);
}
static void
vm_pqbatch_submit_page(vm_page_t m, uint8_t queue)
{
struct vm_batchqueue *bq;
struct vm_pagequeue *pq;
int domain;
KASSERT((m->oflags & VPO_UNMANAGED) == 0,
("page %p is unmanaged", m));
KASSERT(mtx_owned(vm_page_lockptr(m)) ||
(m->object == NULL && (m->aflags & PGA_DEQUEUE) != 0),
("missing synchronization for page %p", m));
KASSERT(queue < PQ_COUNT, ("invalid queue %d", queue));
domain = vm_phys_domain(m);
pq = &vm_pagequeue_domain(m)->vmd_pagequeues[queue];
critical_enter();
bq = DPCPU_PTR(pqbatch[domain][queue]);
if (vm_batchqueue_insert(bq, m)) {
critical_exit();
return;
}
if (!vm_pagequeue_trylock(pq)) {
critical_exit();
vm_pagequeue_lock(pq);
critical_enter();
bq = DPCPU_PTR(pqbatch[domain][queue]);
}
vm_pqbatch_process(pq, bq, queue);
/*
* The page may have been logically dequeued before we acquired the
* page queue lock. In this case, since we either hold the page lock
* or the page is being freed, a different thread cannot be concurrently
* enqueuing the page.
*/
if (__predict_true(m->queue == queue))
vm_pqbatch_process_page(pq, m);
else {
KASSERT(m->queue == PQ_NONE,
("invalid queue transition for page %p", m));
KASSERT((m->aflags & PGA_ENQUEUED) == 0,
("page %p is enqueued with invalid queue index", m));
vm_page_aflag_clear(m, PGA_QUEUE_STATE_MASK);
}
vm_pagequeue_unlock(pq);
critical_exit();
}
/*
* vm_page_drain_pqbatch: [ internal use only ]
*
* Force all per-CPU page queue batch queues to be drained. This is
* intended for use in severe memory shortages, to ensure that pages
* do not remain stuck in the batch queues.
*/
void
vm_page_drain_pqbatch(void)
{
struct thread *td;
struct vm_domain *vmd;
struct vm_pagequeue *pq;
int cpu, domain, queue;
td = curthread;
CPU_FOREACH(cpu) {
thread_lock(td);
sched_bind(td, cpu);
thread_unlock(td);
for (domain = 0; domain < vm_ndomains; domain++) {
vmd = VM_DOMAIN(domain);
for (queue = 0; queue < PQ_COUNT; queue++) {
pq = &vmd->vmd_pagequeues[queue];
vm_pagequeue_lock(pq);
critical_enter();
vm_pqbatch_process(pq,
DPCPU_PTR(pqbatch[domain][queue]), queue);
critical_exit();
vm_pagequeue_unlock(pq);
}
}
}
thread_lock(td);
sched_unbind(td);
thread_unlock(td);
}
/*
* Complete the logical removal of a page from a page queue. We must be
* careful to synchronize with the page daemon, which may be concurrently
* examining the page with only the page lock held. The page must not be
* in a state where it appears to be logically enqueued.
*/
static void
vm_page_dequeue_complete(vm_page_t m)
{
m->queue = PQ_NONE;
atomic_thread_fence_rel();
vm_page_aflag_clear(m, PGA_QUEUE_STATE_MASK);
}
/*
* vm_page_dequeue_deferred: [ internal use only ]
*
* Request removal of the given page from its current page
* queue. Physical removal from the queue may be deferred
* indefinitely.
*
* The page must be locked.
*/
void
vm_page_dequeue_deferred(vm_page_t m)
{
uint8_t queue;
vm_page_assert_locked(m);
if ((queue = vm_page_queue(m)) == PQ_NONE)
return;
vm_page_aflag_set(m, PGA_DEQUEUE);
vm_pqbatch_submit_page(m, queue);
}
/*
* A variant of vm_page_dequeue_deferred() that does not assert the page
* lock and is only to be called from vm_page_free_prep(). It is just an
* open-coded implementation of vm_page_dequeue_deferred(). Because the
* page is being freed, we can assume that nothing else is scheduling queue
* operations on this page, so we get for free the mutual exclusion that
* is otherwise provided by the page lock.
*/
static void
vm_page_dequeue_deferred_free(vm_page_t m)
{
uint8_t queue;
KASSERT(m->object == NULL, ("page %p has an object reference", m));
if ((m->aflags & PGA_DEQUEUE) != 0)
return;
atomic_thread_fence_acq();
if ((queue = m->queue) == PQ_NONE)
return;
vm_page_aflag_set(m, PGA_DEQUEUE);
vm_pqbatch_submit_page(m, queue);
}
/*
* vm_page_dequeue:
*
* Remove the page from whichever page queue it's in, if any.
* The page must either be locked or unallocated. This constraint
* ensures that the queue state of the page will remain consistent
* after this function returns.
*/
void
vm_page_dequeue(vm_page_t m)
{
struct mtx *lock, *lock1;
struct vm_pagequeue *pq;
uint8_t aflags;
KASSERT(mtx_owned(vm_page_lockptr(m)) || m->order == VM_NFREEORDER,
("page %p is allocated and unlocked", m));
for (;;) {
lock = vm_page_pagequeue_lockptr(m);
if (lock == NULL) {
/*
* A thread may be concurrently executing
* vm_page_dequeue_complete(). Ensure that all queue
* state is cleared before we return.
*/
aflags = atomic_load_8(&m->aflags);
if ((aflags & PGA_QUEUE_STATE_MASK) == 0)
return;
KASSERT((aflags & PGA_DEQUEUE) != 0,
("page %p has unexpected queue state flags %#x",
m, aflags));
/*
* Busy wait until the thread updating queue state is
* finished. Such a thread must be executing in a
* critical section.
*/
cpu_spinwait();
continue;
}
mtx_lock(lock);
if ((lock1 = vm_page_pagequeue_lockptr(m)) == lock)
break;
mtx_unlock(lock);
lock = lock1;
}
KASSERT(lock == vm_page_pagequeue_lockptr(m),
("%s: page %p migrated directly between queues", __func__, m));
KASSERT((m->aflags & PGA_DEQUEUE) != 0 ||
mtx_owned(vm_page_lockptr(m)),
("%s: queued unlocked page %p", __func__, m));
if ((m->aflags & PGA_ENQUEUED) != 0) {
pq = vm_page_pagequeue(m);
TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
vm_pagequeue_cnt_dec(pq);
}
vm_page_dequeue_complete(m);
mtx_unlock(lock);
}
/*
* Schedule the given page for insertion into the specified page queue.
* Physical insertion of the page may be deferred indefinitely.
*/
static void
vm_page_enqueue(vm_page_t m, uint8_t queue)
{
vm_page_assert_locked(m);
KASSERT(m->queue == PQ_NONE && (m->aflags & PGA_QUEUE_STATE_MASK) == 0,
("%s: page %p is already enqueued", __func__, m));
m->queue = queue;
if ((m->aflags & PGA_REQUEUE) == 0)
vm_page_aflag_set(m, PGA_REQUEUE);
vm_pqbatch_submit_page(m, queue);
}
/*
* vm_page_requeue: [ internal use only ]
*
* Schedule a requeue of the given page.
*
* The page must be locked.
*/
void
vm_page_requeue(vm_page_t m)
{
vm_page_assert_locked(m);
KASSERT(vm_page_queue(m) != PQ_NONE,
("%s: page %p is not logically enqueued", __func__, m));
if ((m->aflags & PGA_REQUEUE) == 0)
vm_page_aflag_set(m, PGA_REQUEUE);
vm_pqbatch_submit_page(m, atomic_load_8(&m->queue));
}
/*
* vm_page_free_prep:
*
* Prepares the given page to be put on the free list,
* disassociating it from any VM object. The caller may return
* the page to the free list only if this function returns true.
*
* The object must be locked. The page must be locked if it is
* managed.
*/
bool
vm_page_free_prep(vm_page_t m)
{
#if defined(DIAGNOSTIC) && defined(PHYS_TO_DMAP)
if (PMAP_HAS_DMAP && (m->flags & PG_ZERO) != 0) {
uint64_t *p;
int i;
p = (uint64_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
for (i = 0; i < PAGE_SIZE / sizeof(uint64_t); i++, p++)
KASSERT(*p == 0, ("vm_page_free_prep %p PG_ZERO %d %jx",
m, i, (uintmax_t)*p));
}
#endif
if ((m->oflags & VPO_UNMANAGED) == 0) {
vm_page_lock_assert(m, MA_OWNED);
KASSERT(!pmap_page_is_mapped(m),
("vm_page_free_prep: freeing mapped page %p", m));
} else
KASSERT(m->queue == PQ_NONE,
("vm_page_free_prep: unmanaged page %p is queued", m));
VM_CNT_INC(v_tfree);
if (vm_page_sbusied(m))
panic("vm_page_free_prep: freeing busy page %p", m);
- vm_page_remove(m);
+ if (m->object != NULL)
+ (void)vm_page_remove(m);
/*
* If fictitious remove object association and
* return.
*/
if ((m->flags & PG_FICTITIOUS) != 0) {
KASSERT(m->wire_count == 1,
("fictitious page %p is not wired", m));
KASSERT(m->queue == PQ_NONE,
("fictitious page %p is queued", m));
return (false);
}
/*
* Pages need not be dequeued before they are returned to the physical
* memory allocator, but they must at least be marked for a deferred
* dequeue.
*/
if ((m->oflags & VPO_UNMANAGED) == 0)
vm_page_dequeue_deferred_free(m);
m->valid = 0;
vm_page_undirty(m);
if (vm_page_wired(m) != 0)
panic("vm_page_free_prep: freeing wired page %p", m);
if (m->hold_count != 0) {
m->flags &= ~PG_ZERO;
KASSERT((m->flags & PG_UNHOLDFREE) == 0,
("vm_page_free_prep: freeing PG_UNHOLDFREE page %p", m));
m->flags |= PG_UNHOLDFREE;
return (false);
}
/*
* Restore the default memory attribute to the page.
*/
if (pmap_page_get_memattr(m) != VM_MEMATTR_DEFAULT)
pmap_page_set_memattr(m, VM_MEMATTR_DEFAULT);
#if VM_NRESERVLEVEL > 0
if (vm_reserv_free_page(m))
return (false);
#endif
return (true);
}
/*
* vm_page_free_toq:
*
* Returns the given page to the free list, disassociating it
* from any VM object.
*
* The object must be locked. The page must be locked if it is
* managed.
*/
void
vm_page_free_toq(vm_page_t m)
{
struct vm_domain *vmd;
if (!vm_page_free_prep(m))
return;
vmd = vm_pagequeue_domain(m);
if (m->pool == VM_FREEPOOL_DEFAULT && vmd->vmd_pgcache != NULL) {
uma_zfree(vmd->vmd_pgcache, m);
return;
}
vm_domain_free_lock(vmd);
vm_phys_free_pages(m, 0);
vm_domain_free_unlock(vmd);
vm_domain_freecnt_inc(vmd, 1);
}
/*
* vm_page_free_pages_toq:
*
* Returns a list of pages to the free list, disassociating it
* from any VM object. In other words, this is equivalent to
* calling vm_page_free_toq() for each page of a list of VM objects.
*
* The objects must be locked. The pages must be locked if it is
* managed.
*/
void
vm_page_free_pages_toq(struct spglist *free, bool update_wire_count)
{
vm_page_t m;
int count;
if (SLIST_EMPTY(free))
return;
count = 0;
while ((m = SLIST_FIRST(free)) != NULL) {
count++;
SLIST_REMOVE_HEAD(free, plinks.s.ss);
vm_page_free_toq(m);
}
if (update_wire_count)
vm_wire_sub(count);
}
/*
* vm_page_wire:
*
* Mark this page as wired down. If the page is fictitious, then
* its wire count must remain one.
*
* The page must be locked.
*/
void
vm_page_wire(vm_page_t m)
{
vm_page_assert_locked(m);
if ((m->flags & PG_FICTITIOUS) != 0) {
KASSERT(m->wire_count == 1,
("vm_page_wire: fictitious page %p's wire count isn't one",
m));
return;
}
if (!vm_page_wired(m)) {
KASSERT((m->oflags & VPO_UNMANAGED) == 0 ||
m->queue == PQ_NONE,
("vm_page_wire: unmanaged page %p is queued", m));
vm_wire_add(1);
}
m->wire_count++;
KASSERT(m->wire_count != 0, ("vm_page_wire: wire_count overflow m=%p", m));
}
/*
* vm_page_unwire:
*
* Release one wiring of the specified page, potentially allowing it to be
* paged out. Returns TRUE if the number of wirings transitions to zero and
* FALSE otherwise.
*
* Only managed pages belonging to an object can be paged out. If the number
* of wirings transitions to zero and the page is eligible for page out, then
* the page is added to the specified paging queue (unless PQ_NONE is
* specified, in which case the page is dequeued if it belongs to a paging
* queue).
*
* If a page is fictitious, then its wire count must always be one.
*
* A managed page must be locked.
*/
bool
vm_page_unwire(vm_page_t m, uint8_t queue)
{
bool unwired;
KASSERT(queue < PQ_COUNT || queue == PQ_NONE,
("vm_page_unwire: invalid queue %u request for page %p",
queue, m));
if ((m->oflags & VPO_UNMANAGED) == 0)
vm_page_assert_locked(m);
unwired = vm_page_unwire_noq(m);
if (!unwired || (m->oflags & VPO_UNMANAGED) != 0 || m->object == NULL)
return (unwired);
if (vm_page_queue(m) == queue) {
if (queue == PQ_ACTIVE)
vm_page_reference(m);
else if (queue != PQ_NONE)
vm_page_requeue(m);
} else {
vm_page_dequeue(m);
if (queue != PQ_NONE) {
vm_page_enqueue(m, queue);
if (queue == PQ_ACTIVE)
/* Initialize act_count. */
vm_page_activate(m);
}
}
return (unwired);
}
/*
*
* vm_page_unwire_noq:
*
* Unwire a page without (re-)inserting it into a page queue. It is up
* to the caller to enqueue, requeue, or free the page as appropriate.
* In most cases, vm_page_unwire() should be used instead.
*/
bool
vm_page_unwire_noq(vm_page_t m)
{
if ((m->oflags & VPO_UNMANAGED) == 0)
vm_page_assert_locked(m);
if ((m->flags & PG_FICTITIOUS) != 0) {
KASSERT(m->wire_count == 1,
("vm_page_unwire: fictitious page %p's wire count isn't one", m));
return (false);
}
if (!vm_page_wired(m))
panic("vm_page_unwire: page %p's wire count is zero", m);
m->wire_count--;
if (m->wire_count == 0) {
vm_wire_sub(1);
return (true);
} else
return (false);
}
/*
* vm_page_activate:
*
* Put the specified page on the active list (if appropriate).
* Ensure that act_count is at least ACT_INIT but do not otherwise
* mess with it.
*
* The page must be locked.
*/
void
vm_page_activate(vm_page_t m)
{
vm_page_assert_locked(m);
if (vm_page_wired(m) || (m->oflags & VPO_UNMANAGED) != 0)
return;
if (vm_page_queue(m) == PQ_ACTIVE) {
if (m->act_count < ACT_INIT)
m->act_count = ACT_INIT;
return;
}
vm_page_dequeue(m);
if (m->act_count < ACT_INIT)
m->act_count = ACT_INIT;
vm_page_enqueue(m, PQ_ACTIVE);
}
/*
* Move the specified page to the tail of the inactive queue, or requeue
* the page if it is already in the inactive queue.
*
* The page must be locked.
*/
void
vm_page_deactivate(vm_page_t m)
{
vm_page_assert_locked(m);
if (vm_page_wired(m) || (m->oflags & VPO_UNMANAGED) != 0)
return;
if (!vm_page_inactive(m)) {
vm_page_dequeue(m);
vm_page_enqueue(m, PQ_INACTIVE);
} else
vm_page_requeue(m);
}
/*
* Move the specified page close to the head of the inactive queue,
* bypassing LRU. A marker page is used to maintain FIFO ordering.
* As with regular enqueues, we use a per-CPU batch queue to reduce
* contention on the page queue lock.
*
* The page must be locked.
*/
void
vm_page_deactivate_noreuse(vm_page_t m)
{
vm_page_assert_locked(m);
if (vm_page_wired(m) || (m->oflags & VPO_UNMANAGED) != 0)
return;
if (!vm_page_inactive(m)) {
vm_page_dequeue(m);
m->queue = PQ_INACTIVE;
}
if ((m->aflags & PGA_REQUEUE_HEAD) == 0)
vm_page_aflag_set(m, PGA_REQUEUE_HEAD);
vm_pqbatch_submit_page(m, PQ_INACTIVE);
}
/*
* vm_page_launder
*
* Put a page in the laundry, or requeue it if it is already there.
*/
void
vm_page_launder(vm_page_t m)
{
vm_page_assert_locked(m);
if (vm_page_wired(m) || (m->oflags & VPO_UNMANAGED) != 0)
return;
if (vm_page_in_laundry(m))
vm_page_requeue(m);
else {
vm_page_dequeue(m);
vm_page_enqueue(m, PQ_LAUNDRY);
}
}
/*
* vm_page_unswappable
*
* Put a page in the PQ_UNSWAPPABLE holding queue.
*/
void
vm_page_unswappable(vm_page_t m)
{
vm_page_assert_locked(m);
KASSERT(!vm_page_wired(m) && (m->oflags & VPO_UNMANAGED) == 0,
("page %p already unswappable", m));
vm_page_dequeue(m);
vm_page_enqueue(m, PQ_UNSWAPPABLE);
}
/*
* Attempt to free the page. If it cannot be freed, do nothing. Returns true
* if the page is freed and false otherwise.
*
* The page must be managed. The page and its containing object must be
* locked.
*/
bool
vm_page_try_to_free(vm_page_t m)
{
vm_page_assert_locked(m);
VM_OBJECT_ASSERT_WLOCKED(m->object);
KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("page %p is unmanaged", m));
if (m->dirty != 0 || vm_page_held(m) || vm_page_busied(m))
return (false);
if (m->object->ref_count != 0) {
pmap_remove_all(m);
if (m->dirty != 0)
return (false);
}
vm_page_free(m);
return (true);
}
/*
* vm_page_advise
*
* Apply the specified advice to the given page.
*
* The object and page must be locked.
*/
void
vm_page_advise(vm_page_t m, int advice)
{
vm_page_assert_locked(m);
VM_OBJECT_ASSERT_WLOCKED(m->object);
if (advice == MADV_FREE)
/*
* Mark the page clean. This will allow the page to be freed
* without first paging it out. MADV_FREE pages are often
* quickly reused by malloc(3), so we do not do anything that
* would result in a page fault on a later access.
*/
vm_page_undirty(m);
else if (advice != MADV_DONTNEED) {
if (advice == MADV_WILLNEED)
vm_page_activate(m);
return;
}
/*
* Clear any references to the page. Otherwise, the page daemon will
* immediately reactivate the page.
*/
vm_page_aflag_clear(m, PGA_REFERENCED);
if (advice != MADV_FREE && m->dirty == 0 && pmap_is_modified(m))
vm_page_dirty(m);
/*
* Place clean pages near the head of the inactive queue rather than
* the tail, thus defeating the queue's LRU operation and ensuring that
* the page will be reused quickly. Dirty pages not already in the
* laundry are moved there.
*/
if (m->dirty == 0)
vm_page_deactivate_noreuse(m);
else if (!vm_page_in_laundry(m))
vm_page_launder(m);
}
/*
* Grab a page, waiting until we are waken up due to the page
* changing state. We keep on waiting, if the page continues
* to be in the object. If the page doesn't exist, first allocate it
* and then conditionally zero it.
*
* This routine may sleep.
*
* The object must be locked on entry. The lock will, however, be released
* and reacquired if the routine sleeps.
*/
vm_page_t
vm_page_grab(vm_object_t object, vm_pindex_t pindex, int allocflags)
{
vm_page_t m;
int sleep;
int pflags;
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT((allocflags & VM_ALLOC_SBUSY) == 0 ||
(allocflags & VM_ALLOC_IGN_SBUSY) != 0,
("vm_page_grab: VM_ALLOC_SBUSY/VM_ALLOC_IGN_SBUSY mismatch"));
pflags = allocflags &
~(VM_ALLOC_NOWAIT | VM_ALLOC_WAITOK | VM_ALLOC_WAITFAIL);
if ((allocflags & VM_ALLOC_NOWAIT) == 0)
pflags |= VM_ALLOC_WAITFAIL;
retrylookup:
if ((m = vm_page_lookup(object, pindex)) != NULL) {
sleep = (allocflags & VM_ALLOC_IGN_SBUSY) != 0 ?
vm_page_xbusied(m) : vm_page_busied(m);
if (sleep) {
if ((allocflags & VM_ALLOC_NOWAIT) != 0)
return (NULL);
/*
* Reference the page before unlocking and
* sleeping so that the page daemon is less
* likely to reclaim it.
*/
vm_page_aflag_set(m, PGA_REFERENCED);
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object);
vm_page_busy_sleep(m, "pgrbwt", (allocflags &
VM_ALLOC_IGN_SBUSY) != 0);
VM_OBJECT_WLOCK(object);
goto retrylookup;
} else {
if ((allocflags & VM_ALLOC_WIRED) != 0) {
vm_page_lock(m);
vm_page_wire(m);
vm_page_unlock(m);
}
if ((allocflags &
(VM_ALLOC_NOBUSY | VM_ALLOC_SBUSY)) == 0)
vm_page_xbusy(m);
if ((allocflags & VM_ALLOC_SBUSY) != 0)
vm_page_sbusy(m);
return (m);
}
}
m = vm_page_alloc(object, pindex, pflags);
if (m == NULL) {
if ((allocflags & VM_ALLOC_NOWAIT) != 0)
return (NULL);
goto retrylookup;
}
if (allocflags & VM_ALLOC_ZERO && (m->flags & PG_ZERO) == 0)
pmap_zero_page(m);
return (m);
}
/*
* Return the specified range of pages from the given object. For each
* page offset within the range, if a page already exists within the object
* at that offset and it is busy, then wait for it to change state. If,
* instead, the page doesn't exist, then allocate it.
*
* The caller must always specify an allocation class.
*
* allocation classes:
* VM_ALLOC_NORMAL normal process request
* VM_ALLOC_SYSTEM system *really* needs the pages
*
* The caller must always specify that the pages are to be busied and/or
* wired.
*
* optional allocation flags:
* VM_ALLOC_IGN_SBUSY do not sleep on soft busy pages
* VM_ALLOC_NOBUSY do not exclusive busy the page
* VM_ALLOC_NOWAIT do not sleep
* VM_ALLOC_SBUSY set page to sbusy state
* VM_ALLOC_WIRED wire the pages
* VM_ALLOC_ZERO zero and validate any invalid pages
*
* If VM_ALLOC_NOWAIT is not specified, this routine may sleep. Otherwise, it
* may return a partial prefix of the requested range.
*/
int
vm_page_grab_pages(vm_object_t object, vm_pindex_t pindex, int allocflags,
vm_page_t *ma, int count)
{
vm_page_t m, mpred;
int pflags;
int i;
bool sleep;
VM_OBJECT_ASSERT_WLOCKED(object);
KASSERT(((u_int)allocflags >> VM_ALLOC_COUNT_SHIFT) == 0,
("vm_page_grap_pages: VM_ALLOC_COUNT() is not allowed"));
KASSERT((allocflags & VM_ALLOC_NOBUSY) == 0 ||
(allocflags & VM_ALLOC_WIRED) != 0,
("vm_page_grab_pages: the pages must be busied or wired"));
KASSERT((allocflags & VM_ALLOC_SBUSY) == 0 ||
(allocflags & VM_ALLOC_IGN_SBUSY) != 0,
("vm_page_grab_pages: VM_ALLOC_SBUSY/IGN_SBUSY mismatch"));
if (count == 0)
return (0);
pflags = allocflags & ~(VM_ALLOC_NOWAIT | VM_ALLOC_WAITOK |
VM_ALLOC_WAITFAIL | VM_ALLOC_IGN_SBUSY);
if ((allocflags & VM_ALLOC_NOWAIT) == 0)
pflags |= VM_ALLOC_WAITFAIL;
i = 0;
retrylookup:
m = vm_radix_lookup_le(&object->rtree, pindex + i);
if (m == NULL || m->pindex != pindex + i) {
mpred = m;
m = NULL;
} else
mpred = TAILQ_PREV(m, pglist, listq);
for (; i < count; i++) {
if (m != NULL) {
sleep = (allocflags & VM_ALLOC_IGN_SBUSY) != 0 ?
vm_page_xbusied(m) : vm_page_busied(m);
if (sleep) {
if ((allocflags & VM_ALLOC_NOWAIT) != 0)
break;
/*
* Reference the page before unlocking and
* sleeping so that the page daemon is less
* likely to reclaim it.
*/
vm_page_aflag_set(m, PGA_REFERENCED);
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object);
vm_page_busy_sleep(m, "grbmaw", (allocflags &
VM_ALLOC_IGN_SBUSY) != 0);
VM_OBJECT_WLOCK(object);
goto retrylookup;
}
if ((allocflags & VM_ALLOC_WIRED) != 0) {
vm_page_lock(m);
vm_page_wire(m);
vm_page_unlock(m);
}
if ((allocflags & (VM_ALLOC_NOBUSY |
VM_ALLOC_SBUSY)) == 0)
vm_page_xbusy(m);
if ((allocflags & VM_ALLOC_SBUSY) != 0)
vm_page_sbusy(m);
} else {
m = vm_page_alloc_after(object, pindex + i,
pflags | VM_ALLOC_COUNT(count - i), mpred);
if (m == NULL) {
if ((allocflags & VM_ALLOC_NOWAIT) != 0)
break;
goto retrylookup;
}
}
if (m->valid == 0 && (allocflags & VM_ALLOC_ZERO) != 0) {
if ((m->flags & PG_ZERO) == 0)
pmap_zero_page(m);
m->valid = VM_PAGE_BITS_ALL;
}
ma[i] = mpred = m;
m = vm_page_next(m);
}
return (i);
}
/*
* Mapping function for valid or dirty bits in a page.
*
* Inputs are required to range within a page.
*/
vm_page_bits_t
vm_page_bits(int base, int size)
{
int first_bit;
int last_bit;
KASSERT(
base + size <= PAGE_SIZE,
("vm_page_bits: illegal base/size %d/%d", base, size)
);
if (size == 0) /* handle degenerate case */
return (0);
first_bit = base >> DEV_BSHIFT;
last_bit = (base + size - 1) >> DEV_BSHIFT;
return (((vm_page_bits_t)2 << last_bit) -
((vm_page_bits_t)1 << first_bit));
}
/*
* vm_page_set_valid_range:
*
* Sets portions of a page valid. The arguments are expected
* to be DEV_BSIZE aligned but if they aren't the bitmap is inclusive
* of any partial chunks touched by the range. The invalid portion of
* such chunks will be zeroed.
*
* (base + size) must be less then or equal to PAGE_SIZE.
*/
void
vm_page_set_valid_range(vm_page_t m, int base, int size)
{
int endoff, frag;
VM_OBJECT_ASSERT_WLOCKED(m->object);
if (size == 0) /* handle degenerate case */
return;
/*
* If the base is not DEV_BSIZE aligned and the valid
* bit is clear, we have to zero out a portion of the
* first block.
*/
if ((frag = rounddown2(base, DEV_BSIZE)) != base &&
(m->valid & (1 << (base >> DEV_BSHIFT))) == 0)
pmap_zero_page_area(m, frag, base - frag);
/*
* If the ending offset is not DEV_BSIZE aligned and the
* valid bit is clear, we have to zero out a portion of
* the last block.
*/
endoff = base + size;
if ((frag = rounddown2(endoff, DEV_BSIZE)) != endoff &&
(m->valid & (1 << (endoff >> DEV_BSHIFT))) == 0)
pmap_zero_page_area(m, endoff,
DEV_BSIZE - (endoff & (DEV_BSIZE - 1)));
/*
* Assert that no previously invalid block that is now being validated
* is already dirty.
*/
KASSERT((~m->valid & vm_page_bits(base, size) & m->dirty) == 0,
("vm_page_set_valid_range: page %p is dirty", m));
/*
* Set valid bits inclusive of any overlap.
*/
m->valid |= vm_page_bits(base, size);
}
/*
* Clear the given bits from the specified page's dirty field.
*/
static __inline void
vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits)
{
uintptr_t addr;
#if PAGE_SIZE < 16384
int shift;
#endif
/*
* If the object is locked and the page is neither exclusive busy nor
* write mapped, then the page's dirty field cannot possibly be
* set by a concurrent pmap operation.
*/
VM_OBJECT_ASSERT_WLOCKED(m->object);
if (!vm_page_xbusied(m) && !pmap_page_is_write_mapped(m))
m->dirty &= ~pagebits;
else {
/*
* The pmap layer can call vm_page_dirty() without
* holding a distinguished lock. The combination of
* the object's lock and an atomic operation suffice
* to guarantee consistency of the page dirty field.
*
* For PAGE_SIZE == 32768 case, compiler already
* properly aligns the dirty field, so no forcible
* alignment is needed. Only require existence of
* atomic_clear_64 when page size is 32768.
*/
addr = (uintptr_t)&m->dirty;
#if PAGE_SIZE == 32768
atomic_clear_64((uint64_t *)addr, pagebits);
#elif PAGE_SIZE == 16384
atomic_clear_32((uint32_t *)addr, pagebits);
#else /* PAGE_SIZE <= 8192 */
/*
* Use a trick to perform a 32-bit atomic on the
* containing aligned word, to not depend on the existence
* of atomic_clear_{8, 16}.
*/
shift = addr & (sizeof(uint32_t) - 1);
#if BYTE_ORDER == BIG_ENDIAN
shift = (sizeof(uint32_t) - sizeof(m->dirty) - shift) * NBBY;
#else
shift *= NBBY;
#endif
addr &= ~(sizeof(uint32_t) - 1);
atomic_clear_32((uint32_t *)addr, pagebits << shift);
#endif /* PAGE_SIZE */
}
}
/*
* vm_page_set_validclean:
*
* Sets portions of a page valid and clean. The arguments are expected
* to be DEV_BSIZE aligned but if they aren't the bitmap is inclusive
* of any partial chunks touched by the range. The invalid portion of
* such chunks will be zero'd.
*
* (base + size) must be less then or equal to PAGE_SIZE.
*/
void
vm_page_set_validclean(vm_page_t m, int base, int size)
{
vm_page_bits_t oldvalid, pagebits;
int endoff, frag;
VM_OBJECT_ASSERT_WLOCKED(m->object);
if (size == 0) /* handle degenerate case */
return;
/*
* If the base is not DEV_BSIZE aligned and the valid
* bit is clear, we have to zero out a portion of the
* first block.
*/
if ((frag = rounddown2(base, DEV_BSIZE)) != base &&
(m->valid & ((vm_page_bits_t)1 << (base >> DEV_BSHIFT))) == 0)
pmap_zero_page_area(m, frag, base - frag);
/*
* If the ending offset is not DEV_BSIZE aligned and the
* valid bit is clear, we have to zero out a portion of
* the last block.
*/
endoff = base + size;
if ((frag = rounddown2(endoff, DEV_BSIZE)) != endoff &&
(m->valid & ((vm_page_bits_t)1 << (endoff >> DEV_BSHIFT))) == 0)
pmap_zero_page_area(m, endoff,
DEV_BSIZE - (endoff & (DEV_BSIZE - 1)));
/*
* Set valid, clear dirty bits. If validating the entire
* page we can safely clear the pmap modify bit. We also
* use this opportunity to clear the VPO_NOSYNC flag. If a process
* takes a write fault on a MAP_NOSYNC memory area the flag will
* be set again.
*
* We set valid bits inclusive of any overlap, but we can only
* clear dirty bits for DEV_BSIZE chunks that are fully within
* the range.
*/
oldvalid = m->valid;
pagebits = vm_page_bits(base, size);
m->valid |= pagebits;
#if 0 /* NOT YET */
if ((frag = base & (DEV_BSIZE - 1)) != 0) {
frag = DEV_BSIZE - frag;
base += frag;
size -= frag;
if (size < 0)
size = 0;
}
pagebits = vm_page_bits(base, size & (DEV_BSIZE - 1));
#endif
if (base == 0 && size == PAGE_SIZE) {
/*
* The page can only be modified within the pmap if it is
* mapped, and it can only be mapped if it was previously
* fully valid.
*/
if (oldvalid == VM_PAGE_BITS_ALL)
/*
* Perform the pmap_clear_modify() first. Otherwise,
* a concurrent pmap operation, such as
* pmap_protect(), could clear a modification in the
* pmap and set the dirty field on the page before
* pmap_clear_modify() had begun and after the dirty
* field was cleared here.
*/
pmap_clear_modify(m);
m->dirty = 0;
m->oflags &= ~VPO_NOSYNC;
} else if (oldvalid != VM_PAGE_BITS_ALL)
m->dirty &= ~pagebits;
else
vm_page_clear_dirty_mask(m, pagebits);
}
void
vm_page_clear_dirty(vm_page_t m, int base, int size)
{
vm_page_clear_dirty_mask(m, vm_page_bits(base, size));
}
/*
* vm_page_set_invalid:
*
* Invalidates DEV_BSIZE'd chunks within a page. Both the
* valid and dirty bits for the effected areas are cleared.
*/
void
vm_page_set_invalid(vm_page_t m, int base, int size)
{
vm_page_bits_t bits;
vm_object_t object;
object = m->object;
VM_OBJECT_ASSERT_WLOCKED(object);
if (object->type == OBJT_VNODE && base == 0 && IDX_TO_OFF(m->pindex) +
size >= object->un_pager.vnp.vnp_size)
bits = VM_PAGE_BITS_ALL;
else
bits = vm_page_bits(base, size);
if (object->ref_count != 0 && m->valid == VM_PAGE_BITS_ALL &&
bits != 0)
pmap_remove_all(m);
KASSERT((bits == 0 && m->valid == VM_PAGE_BITS_ALL) ||
!pmap_page_is_mapped(m),
("vm_page_set_invalid: page %p is mapped", m));
m->valid &= ~bits;
m->dirty &= ~bits;
}
/*
* vm_page_zero_invalid()
*
* The kernel assumes that the invalid portions of a page contain
* garbage, but such pages can be mapped into memory by user code.
* When this occurs, we must zero out the non-valid portions of the
* page so user code sees what it expects.
*
* Pages are most often semi-valid when the end of a file is mapped
* into memory and the file's size is not page aligned.
*/
void
vm_page_zero_invalid(vm_page_t m, boolean_t setvalid)
{
int b;
int i;
VM_OBJECT_ASSERT_WLOCKED(m->object);
/*
* Scan the valid bits looking for invalid sections that
* must be zeroed. Invalid sub-DEV_BSIZE'd areas ( where the
* valid bit may be set ) have already been zeroed by
* vm_page_set_validclean().
*/
for (b = i = 0; i <= PAGE_SIZE / DEV_BSIZE; ++i) {
if (i == (PAGE_SIZE / DEV_BSIZE) ||
(m->valid & ((vm_page_bits_t)1 << i))) {
if (i > b) {
pmap_zero_page_area(m,
b << DEV_BSHIFT, (i - b) << DEV_BSHIFT);
}
b = i + 1;
}
}
/*
* setvalid is TRUE when we can safely set the zero'd areas
* as being valid. We can do this if there are no cache consistancy
* issues. e.g. it is ok to do with UFS, but not ok to do with NFS.
*/
if (setvalid)
m->valid = VM_PAGE_BITS_ALL;
}
/*
* vm_page_is_valid:
*
* Is (partial) page valid? Note that the case where size == 0
* will return FALSE in the degenerate case where the page is
* entirely invalid, and TRUE otherwise.
*/
int
vm_page_is_valid(vm_page_t m, int base, int size)
{
vm_page_bits_t bits;
VM_OBJECT_ASSERT_LOCKED(m->object);
bits = vm_page_bits(base, size);
return (m->valid != 0 && (m->valid & bits) == bits);
}
/*
* Returns true if all of the specified predicates are true for the entire
* (super)page and false otherwise.
*/
bool
vm_page_ps_test(vm_page_t m, int flags, vm_page_t skip_m)
{
vm_object_t object;
int i, npages;
object = m->object;
if (skip_m != NULL && skip_m->object != object)
return (false);
VM_OBJECT_ASSERT_LOCKED(object);
npages = atop(pagesizes[m->psind]);
/*
* The physically contiguous pages that make up a superpage, i.e., a
* page with a page size index ("psind") greater than zero, will
* occupy adjacent entries in vm_page_array[].
*/
for (i = 0; i < npages; i++) {
/* Always test object consistency, including "skip_m". */
if (m[i].object != object)
return (false);
if (&m[i] == skip_m)
continue;
if ((flags & PS_NONE_BUSY) != 0 && vm_page_busied(&m[i]))
return (false);
if ((flags & PS_ALL_DIRTY) != 0) {
/*
* Calling vm_page_test_dirty() or pmap_is_modified()
* might stop this case from spuriously returning
* "false". However, that would require a write lock
* on the object containing "m[i]".
*/
if (m[i].dirty != VM_PAGE_BITS_ALL)
return (false);
}
if ((flags & PS_ALL_VALID) != 0 &&
m[i].valid != VM_PAGE_BITS_ALL)
return (false);
}
return (true);
}
/*
* Set the page's dirty bits if the page is modified.
*/
void
vm_page_test_dirty(vm_page_t m)
{
VM_OBJECT_ASSERT_WLOCKED(m->object);
if (m->dirty != VM_PAGE_BITS_ALL && pmap_is_modified(m))
vm_page_dirty(m);
}
void
vm_page_lock_KBI(vm_page_t m, const char *file, int line)
{
mtx_lock_flags_(vm_page_lockptr(m), 0, file, line);
}
void
vm_page_unlock_KBI(vm_page_t m, const char *file, int line)
{
mtx_unlock_flags_(vm_page_lockptr(m), 0, file, line);
}
int
vm_page_trylock_KBI(vm_page_t m, const char *file, int line)
{
return (mtx_trylock_flags_(vm_page_lockptr(m), 0, file, line));
}
#if defined(INVARIANTS) || defined(INVARIANT_SUPPORT)
void
vm_page_assert_locked_KBI(vm_page_t m, const char *file, int line)
{
vm_page_lock_assert_KBI(m, MA_OWNED, file, line);
}
void
vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line)
{
mtx_assert_(vm_page_lockptr(m), a, file, line);
}
#endif
#ifdef INVARIANTS
void
vm_page_object_lock_assert(vm_page_t m)
{
/*
* Certain of the page's fields may only be modified by the
* holder of the containing object's lock or the exclusive busy.
* holder. Unfortunately, the holder of the write busy is
* not recorded, and thus cannot be checked here.
*/
if (m->object != NULL && !vm_page_xbusied(m))
VM_OBJECT_ASSERT_WLOCKED(m->object);
}
void
vm_page_assert_pga_writeable(vm_page_t m, uint8_t bits)
{
if ((bits & PGA_WRITEABLE) == 0)
return;
/*
* The PGA_WRITEABLE flag can only be set if the page is
* managed, is exclusively busied or the object is locked.
* Currently, this flag is only set by pmap_enter().
*/
KASSERT((m->oflags & VPO_UNMANAGED) == 0,
("PGA_WRITEABLE on unmanaged page"));
if (!vm_page_xbusied(m))
VM_OBJECT_ASSERT_LOCKED(m->object);
}
#endif
#include "opt_ddb.h"
#ifdef DDB
#include
#include
DB_SHOW_COMMAND(page, vm_page_print_page_info)
{
db_printf("vm_cnt.v_free_count: %d\n", vm_free_count());
db_printf("vm_cnt.v_inactive_count: %d\n", vm_inactive_count());
db_printf("vm_cnt.v_active_count: %d\n", vm_active_count());
db_printf("vm_cnt.v_laundry_count: %d\n", vm_laundry_count());
db_printf("vm_cnt.v_wire_count: %d\n", vm_wire_count());
db_printf("vm_cnt.v_free_reserved: %d\n", vm_cnt.v_free_reserved);
db_printf("vm_cnt.v_free_min: %d\n", vm_cnt.v_free_min);
db_printf("vm_cnt.v_free_target: %d\n", vm_cnt.v_free_target);
db_printf("vm_cnt.v_inactive_target: %d\n", vm_cnt.v_inactive_target);
}
DB_SHOW_COMMAND(pageq, vm_page_print_pageq_info)
{
int dom;
db_printf("pq_free %d\n", vm_free_count());
for (dom = 0; dom < vm_ndomains; dom++) {
db_printf(
"dom %d page_cnt %d free %d pq_act %d pq_inact %d pq_laund %d pq_unsw %d\n",
dom,
vm_dom[dom].vmd_page_count,
vm_dom[dom].vmd_free_count,
vm_dom[dom].vmd_pagequeues[PQ_ACTIVE].pq_cnt,
vm_dom[dom].vmd_pagequeues[PQ_INACTIVE].pq_cnt,
vm_dom[dom].vmd_pagequeues[PQ_LAUNDRY].pq_cnt,
vm_dom[dom].vmd_pagequeues[PQ_UNSWAPPABLE].pq_cnt);
}
}
DB_SHOW_COMMAND(pginfo, vm_page_print_pginfo)
{
vm_page_t m;
boolean_t phys, virt;
if (!have_addr) {
db_printf("show pginfo addr\n");
return;
}
phys = strchr(modif, 'p') != NULL;
virt = strchr(modif, 'v') != NULL;
if (virt)
m = PHYS_TO_VM_PAGE(pmap_kextract(addr));
else if (phys)
m = PHYS_TO_VM_PAGE(addr);
else
m = (vm_page_t)addr;
db_printf(
"page %p obj %p pidx 0x%jx phys 0x%jx q %d hold %d wire %d\n"
" af 0x%x of 0x%x f 0x%x act %d busy %x valid 0x%x dirty 0x%x\n",
m, m->object, (uintmax_t)m->pindex, (uintmax_t)m->phys_addr,
m->queue, m->hold_count, m->wire_count, m->aflags, m->oflags,
m->flags, m->act_count, m->busy_lock, m->valid, m->dirty);
}
#endif /* DDB */
Index: head/sys/vm/vm_page.h
===================================================================
--- head/sys/vm/vm_page.h (revision 349431)
+++ head/sys/vm/vm_page.h (revision 349432)
@@ -1,833 +1,833 @@
/*-
* SPDX-License-Identifier: (BSD-3-Clause AND MIT-CMU)
*
* Copyright (c) 1991, 1993
* The Regents of the University of California. All rights reserved.
*
* This code is derived from software contributed to Berkeley by
* The Mach Operating System project at Carnegie-Mellon University.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* from: @(#)vm_page.h 8.2 (Berkeley) 12/13/93
*
*
* Copyright (c) 1987, 1990 Carnegie-Mellon University.
* All rights reserved.
*
* Authors: Avadis Tevanian, Jr., Michael Wayne Young
*
* Permission to use, copy, modify and distribute this software and
* its documentation is hereby granted, provided that both the copyright
* notice and this permission notice appear in all copies of the
* software, derivative works or modified versions, and any portions
* thereof, and that both notices appear in supporting documentation.
*
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
* FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
*
* Carnegie Mellon requests users of this software to return to
*
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
* School of Computer Science
* Carnegie Mellon University
* Pittsburgh PA 15213-3890
*
* any improvements or extensions that they make and grant Carnegie the
* rights to redistribute these changes.
*
* $FreeBSD$
*/
/*
* Resident memory system definitions.
*/
#ifndef _VM_PAGE_
#define _VM_PAGE_
#include
/*
* Management of resident (logical) pages.
*
* A small structure is kept for each resident
* page, indexed by page number. Each structure
* is an element of several collections:
*
* A radix tree used to quickly
* perform object/offset lookups
*
* A list of all pages for a given object,
* so they can be quickly deactivated at
* time of deallocation.
*
* An ordered list of pages due for pageout.
*
* In addition, the structure contains the object
* and offset to which this page belongs (for pageout),
* and sundry status bits.
*
* In general, operations on this structure's mutable fields are
* synchronized using either one of or a combination of the lock on the
* object that the page belongs to (O), the page lock (P),
* the per-domain lock for the free queues (F), or the page's queue
* lock (Q). The physical address of a page is used to select its page
* lock from a pool. The queue lock for a page depends on the value of
* its queue field and described in detail below. If a field is
* annotated below with two of these locks, then holding either lock is
* sufficient for read access, but both locks are required for write
* access. An annotation of (C) indicates that the field is immutable.
*
* In contrast, the synchronization of accesses to the page's
* dirty field is machine dependent (M). In the
* machine-independent layer, the lock on the object that the
* page belongs to must be held in order to operate on the field.
* However, the pmap layer is permitted to set all bits within
* the field without holding that lock. If the underlying
* architecture does not support atomic read-modify-write
* operations on the field's type, then the machine-independent
* layer uses a 32-bit atomic on the aligned 32-bit word that
* contains the dirty field. In the machine-independent layer,
* the implementation of read-modify-write operations on the
* field is encapsulated in vm_page_clear_dirty_mask().
*
* The page structure contains two counters which prevent page reuse.
* Both counters are protected by the page lock (P). The hold
* counter counts transient references obtained via a pmap lookup, and
* is also used to prevent page reclamation in situations where it is
* undesirable to block other accesses to the page. The wire counter
* is used to implement mlock(2) and is non-zero for pages containing
* kernel memory. Pages that are wired or held will not be reclaimed
* or laundered by the page daemon, but are treated differently during
* a page queue scan: held pages remain at their position in the queue,
* while wired pages are removed from the queue and must later be
* re-enqueued appropriately by the unwiring thread. It is legal to
* call vm_page_free() on a held page; doing so causes it to be removed
* from its object and page queue, and the page is released to the
* allocator once the last hold reference is dropped. In contrast,
* wired pages may not be freed.
*
* In some pmap implementations, the wire count of a page table page is
* used to track the number of populated entries.
*
* The busy lock is an embedded reader-writer lock which protects the
* page's contents and identity (i.e., its