Page MenuHomeFreeBSD

amd64 pmap: fine-grained pv list locking
Needs ReviewPublic

Authored by kib on Mar 29 2020, 3:36 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Feb 28, 4:59 PM
Unknown Object (File)
Feb 14 2024, 3:19 PM
Unknown Object (File)
Feb 12 2024, 11:16 AM
Unknown Object (File)
Feb 4 2024, 10:26 PM
Unknown Object (File)
Jan 31 2024, 6:54 PM
Unknown Object (File)
Jan 30 2024, 4:23 AM
Unknown Object (File)
Jan 30 2024, 4:23 AM
Unknown Object (File)
Jan 30 2024, 4:23 AM

Details

Reviewers
alc
markj
jeff
Summary

The patch adds individual blockable lock for each small page' pv list. OBM (one byte mutex). This way, there is no aliasing and no false blocking on other page pv list. For lock, I only need single byte in md_page. The turnstiles used to block are still shared per superpage/hash function as it was in the stock kernel, i.e. blocked thread gets wakeups from the siblings. This seems to occur rare (I do not have numbers).

For superpages promotion and demotion, I lock the whole run of small pages which constitutes the superpage. This was the largest hesitation, but I convinced myself that the code touches 512 pv entries anyway, so touching 512 bytes from vm_pages that are hot anyway is fine. For obscure semantic of pde pv lock, see the herald comment above pmap_pv_list_lock_pde().

Unfortunate is that I cannot easily implement shared mode for OBM, or at least it would need a 'saturated' shared mode when there are more active readers than the space to count them. It is unfortunate because I think with fine-grained locking there is actually the opportunity to really see shared mode utilized for good.

OBM would also benefit from the lock debugging and profiling support.

This is extracted from my WIP branch for pv lists handling rework. Apparently make UMA to outperform highly-optimized specialized pv chunk allocator is hard, and arguably not needed.

Next, I eliminate the global (per-domain) pv chunk list lock. For this, a note is that we can get away with the global pmap list and rotate both pmap list and chunks list in the pmap to approach pc_lru ordering. My reasonable belief is that strict LRU does not matter if we are in situation where chunks reclamation is needed.

I do not have large machine, on 16 CPU / 32 threads box, with buildkernel -s -j 40 over tmpfs objdir, I got

stock
      237.53 real      6782.41 user       567.74 sys
      238.08 real      6769.18 user       569.86 sys
      237.39 real      6783.97 user       570.91 sys

patched
      227.86 real      6730.43 user       307.43 sys
      226.40 real      6737.99 user       304.76 sys
      227.33 real      6733.44 user       305.36 sys

(look for sys time).

On pig1 with NUMA enabled, the results are not that dramatic. This needs to be checked on large box.

stock
      116.11 real      7901.83 user       446.20 sys
      116.03 real      7945.72 user       447.52 sys
      117.55 real      7902.05 user       450.55 sys
pig1 (only obm):
      114.19 real      7879.51 user       438.05 sys
      116.39 real      7882.03 user       436.02 sys
pig1 (obm + removal of pvc locks):
      115.82 real      7911.22 user       429.60 sys
      116.64 real      7898.30 user       433.64 sys
      115.68 real      7918.47 user       430.82 sys

Tested by: pho

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 32801

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Fix handling of the pv chunk pages ref_count on reclaim.

kib edited the summary of this revision. (Show Details)

Rebase.

I don't mean for this to block the change, but: OBM could be more general since it is only using 2 bits. We could have a BIT_MUTEX_DEFINE() which generates the set of routines currently in obm.h, and is parameterized by a shift that gets applied to a base address provided by the consumer. This way we could use mutexes that are not byte-aligned, which may be useful in constrained structures like vm_page. It could also be used to easily embed mutexes into pointers, where low bits are unused.

sys/amd64/amd64/pmap.c
2137

Perhaps add a comment explaining that this assignment is required by pv_list_lock_object(), ditto in the #else case below.

2179

Perhaps add a comment explaining that this assignment is required by pv_list_lock_object().

sys/sys/obm.h
59

Are you sure that we have atomic_*_char on all platforms? I do not see it on arm64 for instance, though it does have atomic_*_8. Maybe the latter is a bit more clear anyway.

This revision is now accepted and ready to land.Jul 26 2020, 7:46 PM

Spinning (or adaptive spinning) is mandatory for the primitive to perform when contended. I presume going off CPU in this particular use case has to be supported, hence the new primitive and there is no adaptive spinning since there is no information who owns the lock and consequently no means to check if they went off CPU.

I had a draft for a routine which supports adaptive spinning and easily fits in a byte (even less). The states are: free, locked, sleeping waiters, owner off cpu.

Key observation is that going off CPU while holding a lock is relatively rare.

The idea is to explicitly track all OBM locks as they are taken in an array in struct thread or similar. Then the few places which can put the thread off CPU with the lock held check for it and walk the array to mark the locks appropriately. While puts an explicit upper limit at how many locks should be taken at any given time, I think the limitation is more than fine.

That said, I'll do some benchmarks later with stock head, head + vm obj rebase, head + obm, head + vm obj rebase + obm.

kib marked 2 inline comments as done.

Add comments for pv fake page.

This revision now requires review to proceed.Jul 26 2020, 9:13 PM
In D24217#572014, @mjg wrote:

Spinning (or adaptive spinning) is mandatory for the primitive to perform when contended. I presume going off CPU in this particular use case has to be supported, hence the new primitive and there is no adaptive spinning since there is no information who owns the lock and consequently no means to check if they went off CPU.

I had a draft for a routine which supports adaptive spinning and easily fits in a byte (even less). The states are: free, locked, sleeping waiters, owner off cpu.

Key observation is that going off CPU while holding a lock is relatively rare.

The idea is to explicitly track all OBM locks as they are taken in an array in struct thread or similar. Then the few places which can put the thread off CPU with the lock held check for it and walk the array to mark the locks appropriately. While puts an explicit upper limit at how many locks should be taken at any given time, I think the limitation is more than fine.

That said, I'll do some benchmarks later with stock head, head + vm obj rebase, head + obm, head + vm obj rebase + obm.

With removal of the aliasing for the pv list locks, contention on the lock is almost never occurring. In particular, this explains why the non-NUMA case got such large time reduction.

That said, there is a lot that can be done differently or improved for OBM (including the name). I prefer to get this patchset pushed to unblock both more OBM work and more pmap work to occur, without me having to rewrite the patchset on rebase.

WRT atomic_cmpset_char etc on non-amd64, I probably would add aliases for arches which do not have them defined, as needed. For initial commit, if tb become red, I will move OBM to amd64-only.

markj added inline comments.
sys/amd64/amd64/pmap.c
2138

Typo, "fake"

This revision is now accepted and ready to land.Jul 27 2020, 2:45 PM
kib marked an inline comment as done.Jul 27 2020, 6:18 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
2138

This is actually 'make' (used to make all physical addresses work)

kib marked an inline comment as done.Aug 4 2020, 6:39 PM
In D24217#572014, @mjg wrote:

That said, I'll do some benchmarks later with stock head, head + vm obj rebase, head + obm, head + vm obj rebase + obm.

Any progress with this ? I plan to commit this on Aug 6.

sys/amd64/amd64/pmap.c
179–180

Empty messages?

454

-> "We only depend ...

543

"page'" -> "page's"

545

"... all run ..."?

2138

The 'm' is missing from the word "make". It's currently "ake".

4912

Is lockp still used?

4979

Is lockp still used?

Sorry, lost this review. I'll do benchmarks later today.

kib marked 15 inline comments as done.

Grammar in comments.
Fill missed static assert messages.
Remove unused lockp argument.
Switch from atomic_XXX_char to atomic_XXX _8 for obm.

This revision now requires review to proceed.Aug 6 2020, 11:20 AM

See the bottom for extra kernel patches + dtrace script

To start off I have a benchmark comparing FreeBSD and Linux building the Linux kernel on bare metal, 2 sockets * 24 cores (no HT) on Cascade Lake. In both cases the same userspace was untarred to tmpfs and the build was performed chrooted. The result is from early July.

Note: I rounded up total times due to different format.

full build (make -s -j 48 bzImage)

Linux8499.63s user 1150.89s system 3145% cpu 306s real
FreeBSD8634.15s user 1341.72s system 3102% cpu 321s real

incremental:

Linux96.81s user 52.20s system 1878% cpu 8s real
FreeBSD98.01s user 122.86s system 1936% cpu 11s real

For full build case we take more in total real time but at the same time are less on CPU in the entire process (3102% cpu < 3145% cpu).

For incremental case it's just contention, partially in vfs layer which I'm going to address later.

While not directly comparable, the problem was easily reproducible on the 2 sockets * 26 cores * 2 threads Skylake in zoo and this is where I benchmarked further FreeBSD-only.

Head was r363939. objc is the object_concurrency branch rebased by markj. I did not benchmark head + obm only.

full build (make -s -j 104 bzImage) (warmup + 10 builds):

head211686.83s user 24536.31s system 6359% cpu 3714s real
head + objc211729.23s user 23971.41s system 6339% cpu 3718s real
head + objc + obm211981.34s user 22917.79s system 6341% cpu 3704s real

incremental builds (warmup + 10 builds):

head1637.96s user 3431.48s system 3127% cpu 162s real
head + objc1653.72s user 2289.97s system 2556% cpu 154s real
head + objc + obm1767.50s user 1720.98s system 2338% cpu 149s real

There is some improvement in both system time and total real time.

Nonetheless, the gap in real time is still not closed. Here is how on CPU time looks like on a fully patched kernel (gathered with pmcstat):
https://people.freebsd.org/~mjg/linux-build/full-oncpu.svg

The same off CPU with a custom probes in turnstile_wait and sleepq_switch, wrapping around the mi_switch call):
https://people.freebsd.org/~mjg/linux-build/full-offcpu2.svg

[EDIT: replaced with a better graph with higher resolution of timestamps]

Data was edited to remove waitpid, rfork waits and similar.

Total off CPU time where the kernel is waiting on itself (so to speak) is 25702098460 nanoseconds (or 25.7 seconds), out of which almost half belongs to pmap pv list (obm). Note there is no information how much of it is overlapped with several threads at the same time. Even then this explains the difference vs Linux (more real time, less CPU time in the period).

That said, I think the patch should go in but the kernel needs a more generic solution to 1) track off cpu time 2) wait in an adaptive manner. The current hand-rolled de facto sleep locks don't have proper instrumentation nor behavior when contended.

I can confirm system time reduction for native builds. I have a long running test of a specific stable/12 snapshot (best real time selected out of 10) and results are pretty stable (with majority of the slowdown is from contending on vm object):

r359444    33.28 real      2552.36 user       329.16 sys
r360431    33.26 real      2551.49 user       344.60 sys
r361622    33.26 real      2542.66 user       323.06 sys
r361993    33.28 real      2568.95 user       332.53 sys
r362643    33.19 real      2563.73 user       312.56 sys
r362828    33.30 real      2553.93 user       322.67 sys
r363523    33.13 real      2558.10 user       324.39 sys

I could not be bothered to re-test with stock r363939 nor r363939 + objc only, so as a rough comparison here is a test against the latest above result vs fully patched head kernel Adding objc used to reduce system time to about 300 seconds.

r36352333.13 real 2558.10 user 324.39 sys
r363939+objc+obm32.87 real 2587.73 user 231.70 sys

Full breakdown for interested:
head:

warmup:
      335.91 real     19242.15 user      2210.51 sys
run:
      337.94 real     19247.93 user      2231.28 sys
      338.00 real     19244.56 user      2226.73 sys
      337.97 real     19246.36 user      2225.37 sys
      338.18 real     19242.30 user      2236.71 sys
      337.49 real     19239.67 user      2236.14 sys
      337.75 real     19243.28 user      2235.98 sys
      338.10 real     19242.37 user      2231.36 sys
      336.84 real     19243.65 user      2231.99 sys
      338.28 real     19247.89 user      2234.61 sys
      337.99 real     19246.62 user      2235.55 sys

head + objc:

warmup:
      337.26 real     19221.03 user      2220.93 sys
run:
      337.56 real     19252.80 user      2169.27 sys
      338.08 real     19248.43 user      2171.01 sys
      338.07 real     19253.96 user      2175.03 sys
      337.63 real     19252.79 user      2171.68 sys
      338.44 real     19250.57 user      2175.79 sys
      338.44 real     19255.44 user      2174.68 sys
      338.47 real     19257.38 user      2182.76 sys
      337.53 real     19228.53 user      2178.06 sys
      338.27 real     19252.80 user      2175.39 sys
      338.33 real     19255.45 user      2176.72 sys

head + objc + obm:

warmup:
      335.78 real     19261.35 user      2073.71 sys
run:  
      337.03 real     19267.26 user      2077.54 sys
      335.99 real     19280.36 user      2076.67 sys
      336.41 real     19271.90 user      2082.00 sys
      336.88 real     19256.40 user      2088.40 sys
      336.93 real     19278.11 user      2082.90 sys
      336.66 real     19264.48 user      2092.92 sys
      337.32 real     19273.15 user      2086.42 sys
      336.79 real     19276.21 user      2088.43 sys
      337.05 real     19275.83 user      2082.23 sys
      337.08 real     19276.23 user      2086.50 sys

Profiling added is pretty crap, I'll productize it later:

index 56515f033f5d..0832f15f64ad 100644
--- a/sys/kern/subr_sleepqueue.c
+++ b/sys/kern/subr_sleepqueue.c
@@ -180,6 +180,10 @@ static void        sleepq_timeout(void *arg);
 SDT_PROBE_DECLARE(sched, , , sleep);
 SDT_PROBE_DECLARE(sched, , , wakeup);
 
+SDT_PROVIDER_DEFINE(kernel);
+SDT_PROBE_DEFINE0(kernel, profile, sleepq, block);
+SDT_PROBE_DEFINE0(kernel, profile, sleepq, wakeup);
+
 /*
  * Initialize SLEEPQUEUE_PROFILING specific sysctl nodes.
  * Note that it must happen after sleepinit() has been fully executed, so
@@ -621,8 +625,10 @@ sleepq_switch(const void *wchan, int pri)
        sched_sleep(td, pri);
        thread_lock_set(td, &sc->sc_lock);
        SDT_PROBE0(sched, , , sleep);
+       SDT_PROBE0(kernel, profile, sleepq, block);
        TD_SET_SLEEPING(td);
        mi_switch(SW_VOL | SWT_SLEEPQ);
+       SDT_PROBE0(kernel, profile, sleepq, wakeup);
        KASSERT(TD_IS_RUNNING(td), ("running but not TDS_RUNNING"));
        CTR3(KTR_PROC, "sleepq resume: thread %p (pid %ld, %s)",
            (void *)td, (long)td->td_proc->p_pid, (void *)td->td_name);
diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
index 01d265277ba1..c50392198be9 100644
--- a/sys/kern/subr_turnstile.c
+++ b/sys/kern/subr_turnstile.c
@@ -176,6 +176,10 @@ SDT_PROBE_DEFINE(sched, , , sleep);
 SDT_PROBE_DEFINE2(sched, , , wakeup, "struct thread *", 
     "struct proc *");
 
+SDT_PROVIDER_DECLARE(kernel);
+SDT_PROBE_DEFINE0(kernel, profile, turnstile, block);
+SDT_PROBE_DEFINE0(kernel, profile, turnstile, wakeup);
+
 static inline void
 propagate_unlock_ts(struct turnstile *top, struct turnstile *ts)
 {
@@ -813,10 +817,14 @@ turnstile_wait(struct turnstile *ts, struct thread *owner, int queue)
                    td->td_tid, lock, lock->lo_name);
 
        SDT_PROBE0(sched, , , sleep);
+       SDT_PROBE0(kernel, profile, turnstile, block);
 
        THREAD_LOCKPTR_ASSERT(td, &ts->ts_lock);
        mi_switch(SW_VOL | SWT_TURNSTILE);
 
+       SDT_PROBE0(kernel, profile, turnstile, wakeup);
        if (LOCK_LOG_TEST(lock, 0))
                CTR4(KTR_LOCK, "%s: td %d free from blocked on [%p] %s",
                    __func__, td->td_tid, lock, lock->lo_name);

dtrace:

#pragma D option dynvarsize=32m
  
fbt::turnstile_wait:entry
{
        self->turnstile = args[0];
}       

fbt::turnstile_wait:return
{
        self->turnstile = 0;
}       

kernel:profile:turnstile:block
/(curthread->td_proc->p_flag & 0x4) == 0/
{

        self->mesg = self->turnstile->ts_lockobj->lo_name;
        self->ts = timestamp;
}       

kernel:profile:sleepq:block
/(curthread->td_proc->p_flag & 0x4) == 0/
{
        self->ts = timestamp;
        self->mesg = curthread->td_wmesg;
}       

kernel:profile:*:wakeup
/self->ts/
{
        @[stack(), self->mesg ? stringof(self->mesg) : probefunc] = sum(timestamp - self->ts);    
        self->ts = 0;
        self->mesg = 0;
}       

dtrace:::END
{
        printa("%k\n%s\n%@d\n\n", @);
}

Then the flamegraph one-liner.

The following part in obm_lock_slow has whitespace issues in indent:

v = atomic_load_8(&obm->lk);
if (v == OBM_UNLOCKED) {
        if (atomic_fcmpset_acq_char(&obm->lk, &v, OBM_LOCKED) != 0)
                break;
        lock_delay(&lda);
        continue;
}

ts = turnstile_trywait(lo);
v = atomic_load_8(&obm->lk);
if (v == OBM_UNLOCKED) {
        turnstile_cancel(ts);
        if (atomic_fcmpset_acq_8(&obm->lk, &v, OBM_LOCKED) != 0)
                break;
        lock_delay(&lda);
        continue;
}
if ((v & OBM_CONTESTED) == 0 &&
    atomic_fcmpset_8(&obm->lk, &v, v | OBM_CONTESTED) == 0) {
        turnstile_cancel(ts);
        continue;
}
turnstile_wait(ts, NULL, TS_SHARED_QUEUE);

Fix indent in obm_lock_slow.

sys/amd64/amd64/pmap.c
363–364

Is this actually the correct place for this "if" statement? Suppose that *lockp is NULL, because this execution of CHANGE_PV_LIST_LOCK_TO_PHYS() is the first to acquire a lock, and that PHYS_TO_VM_PAGE(pa) returns NULL. _new_lock != *_lockp will be false, and so we won't enter this block and acquire the lock associated with pv_fake_page. Am I missing something?

My comments amount to the following fixups: https://people.freebsd.org/~mjg/obm-tidyup.diff

Whatever profiling or spinning may show up is probably for a different review.

sys/kern/kern_obm.c
85

So the entry point blindly lock_delays even if the lock is contested or fcmpset spuriously failed. Other primitives pass the found value instead, then the main loop can test on it without reading.

105

If this fails you may the lock is taken by someone else which adds an avoidable turnstile trip. Other primitives have a goto label.

I did an experiment with a variant which spins. Here is the result: https://people.freebsd.org/~mjg/out2.svg

This extended real time by over 10 seconds.

As such these locks can be very contested. With stock variant there is no significant real time reduction because there is a lot of time spent off CPU. Modifying the routine to not go off CPU seems to run into another corner case, where threads take many locks one after another and keep fighting for all of them in lockstep. In order to make performance progress (less real time) this bare minimum will need some form of rate limiting for pmap_remove_pages, but I suspect ultimately a very different solution is needed to begin with.

obm fixes from mjg.
Fix fake page pv lock calculation by alc.
Reduce pv list lock scope in pmap_remove_pages().

Remove some bits for spinning, unintended leftover.

Submitted by: mjg

I reran the patch, there is a little bit more system time and real time, but it's probably random effects.

head211686.83s user 24536.31s system 6359% cpu 3714s real
head + objc211729.23s user 23971.41s system 6339% cpu 3718s real
head + objc + obm211981.34s user 22917.79s system 6341% cpu 3704s real
head + objc +obmv2212009.92s user 23044.97s system 6342% cpu 3706s real

In short, this is an improvement, but things are lagging behind Linux on this workload and the approach employed here may not be enough.

Full dump:

warmup:
      336.99 real     19257.34 user      2180.58 sys
run:
      336.95 real     19272.24 user      2085.77 sys
      337.26 real     19272.63 user      2084.47 sys
      337.07 real     19276.98 user      2086.53 sys
      336.70 real     19273.83 user      2085.77 sys
      337.27 real     19279.45 user      2082.67 sys
      336.74 real     19281.96 user      2083.34 sys
      337.17 real     19279.00 user      2087.81 sys
      336.46 real     19261.84 user      2092.63 sys
      336.94 real     19274.54 user      2090.42 sys
      336.54 real     19280.04 user      2084.91 sys

I'll give the pv entry reclamation code a careful reading tomorrow.

sys/amd64/amd64/pmap.c
455

Elsewhere there are two spaces after a period.

498–499

This sentence, specifically, the part that says, "... is locked according to ...", doesn't really explain anything to me.

502

"in the process" of what?

541–547

Doesn't this comment describe the next function and not the one that immediately follows?

591

Although it makes no functional difference, I think that this would more correctly be NPTEPG.

2138–2142

"Initialize pv_fake page, which is used to make pv_list locking work for physical addresses not covered by vm_page_array[]. In particular, it is needed by pv_list_lock_object() and CHANGE_PV_LIST_LOCK_TO_PHYS().

3792–3793

"Add the pmap to the global list of pmaps, which is used during pv chunk reclamation. The pmap is ..."

kib marked 3 inline comments as done.Aug 7 2020, 11:50 AM
kib marked 7 inline comments as done.

Editing of comments.

I'm trying to finish my final review of this change today.

sys/amd64/amd64/pmap.c
356–367

This could be shortened to:

vm_page_t _m;

_m = PHYS_TO_VM_PAGE(pa);
if (_m == NULL)
        _m = &pv_fake_page;
CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, _m);
498

"indicates whether the PV llst for m is already locked."

510

Elsewhere, e.g., promotion, I've written this as atop(VM_PAGE_TO_PHYS(m) & PG_FRAME & PDRMASK)) so that there is only one arithmetic operation at run-time.

521

Since this is iterating at 4KB granularity, NPTEPG would seem more appropriate.

531

Since this is iterating at 4KB granularity, NPTEPG would seem more appropriate.

588

Elsewhere, e.g., promotion, I've written this as atop(VM_PAGE_TO_PHYS(m) & PG_FRAME & PDRMASK)).

3793

"pmap is" is duplicated.

4282–4285

Why move these given that their use is localized to the code that follows?

kib marked 8 inline comments as done.Aug 30 2020, 9:21 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
4282–4285

I added assert in pmap_init() that uses the constants, and that comes 2K lines before this place. So I moved defines to some logical place earlier.

kib marked an inline comment as done.

Handle current batch of notes from Alan.

I gave it a spin with poudriere on top of objc branch. Looks like the passed pv pointer was NULL.

fault virtual address   = 0x18
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff808aaf12
stack pointer           = 0x0:0xfffffe03ac2d5030
frame pointer           = 0x0:0xfffffe03ac2d5040
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 29183 (xsltproc)
trap number             = 12
panic: page fault
cpuid = 0
time = 1598919221
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe03ac2d4ce0
vpanic() at vpanic+0x182/frame 0xfffffe03ac2d4d30
panic() at panic+0x43/frame 0xfffffe03ac2d4d90
trap_fatal() at trap_fatal+0x387/frame 0xfffffe03ac2d4df0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe03ac2d4e50
trap() at trap+0x277/frame 0xfffffe03ac2d4f60
calltrap() at calltrap+0x8/frame 0xfffffe03ac2d4f60
--- trap 0xc, rip = 0xffffffff808aaf12, rsp = 0xfffffe03ac2d5030, rbp = 0xfffffe03ac2d5040 ---
free_pv_entry() at free_pv_entry+0x42/frame 0xfffffe03ac2d5040
pmap_promote_pde() at pmap_promote_pde+0x56b/frame 0xfffffe03ac2d50b0
pmap_enter() at pmap_enter+0xc7f/frame 0xfffffe03ac2d5190
vm_fault() at vm_fault+0x15d5/frame 0xfffffe03ac2d52b0
vm_fault_trap() at vm_fault_trap+0x6d/frame 0xfffffe03ac2d5300
trap_pfault() at trap_pfault+0x1f8/frame 0xfffffe03ac2d5360
trap() at trap+0x400/frame 0xfffffe03ac2d5470
calltrap() at calltrap+0x8/frame 0xfffffe03ac2d5470
--- trap 0xc, rip = 0x800262370, rsp = 0x7fffffffdf60, rbp = 0x7fffffffdfe0 ---
In D24217#583751, @mjg wrote:

I gave it a spin with poudriere on top of objc branch. Looks like the passed pv pointer was NULL.

fault virtual address   = 0x18
free_pv_entry() at free_pv_entry+0x42/frame 0xfffffe03ac2d5040
pmap_promote_pde() at pmap_promote_pde+0x56b/frame 0xfffffe03ac2d50b0

I assume the kernel was configured without INVARIANTS ?
Can you get some panic on INVARIANTS-enabled kernel (there is a designated check for pv != NULL in pmap_pvh_free()).

After few hours with INVARIANTS I got:

panic: Bad link elm 0xfffff842e7139138 prev->next != elm
cpuid = 90
time = 1598985147
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe067abac0e0
vpanic() at vpanic+0x182/frame 0xfffffe067abac130
panic() at panic+0x43/frame 0xfffffe067abac190
pmap_remove_pages() at pmap_remove_pages+0xb0f/frame 0xfffffe067abac290
vmspace_exit() at vmspace_exit+0xa3/frame 0xfffffe067abac2d0
exit1() at exit1+0x532/frame 0xfffffe067abac340
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe067abac350
amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe067abac470
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe067abac470
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x802183a2a, rsp = 0x7fffffffd918, rbp = 0x7fffffffd930 ---

Regardless of the above I suggest the following: https://people.freebsd.org/~mjg/obm.diff

this adds basic LOCK_PROFILING support to obm, makes it easier to add WITNESS later and converts some inlines into macros so that FILE__/LINE__ point at callers. I find profiling mandatory as these locks *are* contended.

Rebase, fix conflicts.
Add initial obm lock profiling support (by mjg).

I ran 675 out of 723 stress tests with D24217.83480.diff without observing any problems,