Right now, vm_page_dequeue() may be called with or without the page lock
held. If the lock is not held, nothing prevents a different thread from
enqueuing the page before vm_page_dequeue() returns. In practice, this
does not happen because vm_page_dequeue() is called either with the lock
held or from the page allocation functions (so other threads cannot be
manipulating the page).
Suppose m is wired and on a page queue. Suppose the page daemon visits
m and schedules a deferred dequeue. This sets PGA_DEQUEUE and creates a
batch queue entry for the page. A bit later, the batch queue is
processed with only the page queue lock held. This processing will set
m->queue = PQ_NONE, issue a release fence, and clear PGA_DEQUEUE.
Suppose that a thread concurrently unwires the page. It will call
vm_page_dequeue(), which returns if it observes m->queue == PQ_NONE.
Thus, vm_page_dequeue() may return while PGA_DEQUEUE is still set, and
in particular, the thread processing the batch queue will clear all
queue state flags at some point after vm_page_dequeue() returns.
To fix this, make vm_page_dequeue() return only once the concurrent
dequeue has completed. The concurrent dequeue will happen in a critical
section, so we should end up looping for only a short period.
While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked(). I added a volatile qualifier to "queue" in
r333703; instead, use atomic_load_*() only when specifically needed.