Details

Reviewers

alc
markj
jeff

Summary

The patch adds individual blockable lock for each small page' pv list. OBM (one byte mutex). This way, there is no aliasing and no false blocking on other page pv list. For lock, I only need single byte in md_page. The turnstiles used to block are still shared per superpage/hash function as it was in the stock kernel, i.e. blocked thread gets wakeups from the siblings. This seems to occur rare (I do not have numbers).

For superpages promotion and demotion, I lock the whole run of small pages which constitutes the superpage. This was the largest hesitation, but I convinced myself that the code touches 512 pv entries anyway, so touching 512 bytes from vm_pages that are hot anyway is fine. For obscure semantic of pde pv lock, see the herald comment above pmap_pv_list_lock_pde().

Unfortunate is that I cannot easily implement shared mode for OBM, or at least it would need a 'saturated' shared mode when there are more active readers than the space to count them. It is unfortunate because I think with fine-grained locking there is actually the opportunity to really see shared mode utilized for good.

OBM would also benefit from the lock debugging and profiling support.

This is extracted from my WIP branch for pv lists handling rework. Apparently make UMA to outperform highly-optimized specialized pv chunk allocator is hard, and arguably not needed.

Next, I eliminate the global (per-domain) pv chunk list lock. For this, a note is that we can get away with the global pmap list and rotate both pmap list and chunks list in the pmap to approach pc_lru ordering. My reasonable belief is that strict LRU does not matter if we are in situation where chunks reclamation is needed.

I do not have large machine, on 16 CPU / 32 threads box, with buildkernel -s -j 40 over tmpfs objdir, I got

stock
      237.53 real      6782.41 user       567.74 sys
      238.08 real      6769.18 user       569.86 sys
      237.39 real      6783.97 user       570.91 sys

patched
      227.86 real      6730.43 user       307.43 sys
      226.40 real      6737.99 user       304.76 sys
      227.33 real      6733.44 user       305.36 sys

(look for sys time).

On pig1 with NUMA enabled, the results are not that dramatic. This needs to be checked on large box.

stock
      116.11 real      7901.83 user       446.20 sys
      116.03 real      7945.72 user       447.52 sys
      117.55 real      7902.05 user       450.55 sys
pig1 (only obm):
      114.19 real      7879.51 user       438.05 sys
      116.39 real      7882.03 user       436.02 sys
pig1 (obm + removal of pvc locks):
      115.82 real      7911.22 user       429.60 sys
      116.64 real      7898.30 user       433.64 sys
      115.68 real      7918.47 user       430.82 sys

Tested by: pho

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I don't mean for this to block the change, but: OBM could be more general since it is only using 2 bits. We could have a BIT_MUTEX_DEFINE() which generates the set of routines currently in obm.h, and is parameterized by a shift that gets applied to a base address provided by the consumer. This way we could use mutexes that are not byte-aligned, which may be useful in constrained structures like vm_page. It could also be used to easily embed mutexes into pointers, where low bits are unused.

sys/amd64/amd64/pmap.c
2381	Perhaps add a comment explaining that this assignment is required by pv_list_lock_object(), ditto in the #else case below.
2422	Perhaps add a comment explaining that this assignment is required by pv_list_lock_object().
sys/sys/obm.h
59	Are you sure that we have atomic__char on all platforms? I do not see it on arm64 for instance, though it does have atomic__8. Maybe the latter is a bit more clear anyway.

This revision is now accepted and ready to land.Jul 26 2020, 7:46 PM

Spinning (or adaptive spinning) is mandatory for the primitive to perform when contended. I presume going off CPU in this particular use case has to be supported, hence the new primitive and there is no adaptive spinning since there is no information who owns the lock and consequently no means to check if they went off CPU.

I had a draft for a routine which supports adaptive spinning and easily fits in a byte (even less). The states are: free, locked, sleeping waiters, owner off cpu.

Key observation is that going off CPU while holding a lock is relatively rare.

The idea is to explicitly track all OBM locks as they are taken in an array in struct thread or similar. Then the few places which can put the thread off CPU with the lock held check for it and walk the array to mark the locks appropriately. While puts an explicit upper limit at how many locks should be taken at any given time, I think the limitation is more than fine.

That said, I'll do some benchmarks later with stock head, head + vm obj rebase, head + obm, head + vm obj rebase + obm.

Add comments for pv fake page.

This revision now requires review to proceed.Jul 26 2020, 9:13 PM

Harbormaster completed remote builds in B32575: Diff 74982.Jul 26 2020, 9:13 PM

In D24217#572014, @mjg wrote:

Spinning (or adaptive spinning) is mandatory for the primitive to perform when contended. I presume going off CPU in this particular use case has to be supported, hence the new primitive and there is no adaptive spinning since there is no information who owns the lock and consequently no means to check if they went off CPU.

I had a draft for a routine which supports adaptive spinning and easily fits in a byte (even less). The states are: free, locked, sleeping waiters, owner off cpu.

Key observation is that going off CPU while holding a lock is relatively rare.

The idea is to explicitly track all OBM locks as they are taken in an array in struct thread or similar. Then the few places which can put the thread off CPU with the lock held check for it and walk the array to mark the locks appropriately. While puts an explicit upper limit at how many locks should be taken at any given time, I think the limitation is more than fine.

That said, I'll do some benchmarks later with stock head, head + vm obj rebase, head + obm, head + vm obj rebase + obm.

With removal of the aliasing for the pv list locks, contention on the lock is almost never occurring. In particular, this explains why the non-NUMA case got such large time reduction.

That said, there is a lot that can be done differently or improved for OBM (including the name). I prefer to get this patchset pushed to unblock both more OBM work and more pmap work to occur, without me having to rewrite the patchset on rebase.

WRT atomic_cmpset_char etc on non-amd64, I probably would add aliases for arches which do not have them defined, as needed. For initial commit, if tb become red, I will move OBM to amd64-only.

markj accepted this revision.Jul 27 2020, 2:45 PM

markj added inline comments.

sys/amd64/amd64/pmap.c
2382	Typo, "fake"

This revision is now accepted and ready to land.Jul 27 2020, 2:45 PM

kib marked an inline comment as done.Jul 27 2020, 6:18 PM

kib added inline comments.

sys/amd64/amd64/pmap.c
2382	This is actually 'make' (`used to make all physical addresses work`)

In D24217#572014, @mjg wrote:

That said, I'll do some benchmarks later with stock head, head + vm obj rebase, head + obm, head + vm obj rebase + obm.

Any progress with this ? I plan to commit this on Aug 6.

alc added inline comments.Aug 6 2020, 2:55 AM

sys/amd64/amd64/pmap.c
182–183	Empty messages?
467	-> "We only depend ...
556	"page'" -> "page's"
558	"... all run ..."?
2382	The 'm' is missing from the word "make". It's currently "ake".
5470	Is `lockp` still used?
5537	Is `lockp` still used?

Sorry, lost this review. I'll do benchmarks later today.

Grammar in comments.
Fill missed static assert messages.
Remove unused lockp argument.
Switch from atomic_XXX_char to atomic_XXX _8 for obm.

This revision now requires review to proceed.Aug 6 2020, 11:20 AM

Harbormaster completed remote builds in B32801: Diff 75483.Aug 6 2020, 11:20 AM

See the bottom for extra kernel patches + dtrace script

To start off I have a benchmark comparing FreeBSD and Linux building the Linux kernel on bare metal, 2 sockets * 24 cores (no HT) on Cascade Lake. In both cases the same userspace was untarred to tmpfs and the build was performed chrooted. The result is from early July.

Note: I rounded up total times due to different format.

full build (make -s -j 48 bzImage)

Linux	8499.63s user 1150.89s system 3145% cpu 306s real
FreeBSD	8634.15s user 1341.72s system 3102% cpu 321s real

incremental:

Linux	96.81s user 52.20s system 1878% cpu 8s real
FreeBSD	98.01s user 122.86s system 1936% cpu 11s real

For full build case we take more in total real time but at the same time are less on CPU in the entire process (3102% cpu < 3145% cpu).

For incremental case it's just contention, partially in vfs layer which I'm going to address later.

While not directly comparable, the problem was easily reproducible on the 2 sockets * 26 cores * 2 threads Skylake in zoo and this is where I benchmarked further FreeBSD-only.

Head was r363939. objc is the object_concurrency branch rebased by markj. I did not benchmark head + obm only.

full build (make -s -j 104 bzImage) (warmup + 10 builds):

head	211686.83s user 24536.31s system 6359% cpu 3714s real
head + objc	211729.23s user 23971.41s system 6339% cpu 3718s real
head + objc + obm	211981.34s user 22917.79s system 6341% cpu 3704s real

incremental builds (warmup + 10 builds):

head	1637.96s user 3431.48s system 3127% cpu 162s real
head + objc	1653.72s user 2289.97s system 2556% cpu 154s real
head + objc + obm	1767.50s user 1720.98s system 2338% cpu 149s real

There is some improvement in both system time and total real time.

Nonetheless, the gap in real time is still not closed. Here is how on CPU time looks like on a fully patched kernel (gathered with pmcstat):
https://people.freebsd.org/~mjg/linux-build/full-oncpu.svg

The same off CPU with a custom probes in turnstile_wait and sleepq_switch, wrapping around the mi_switch call):
https://people.freebsd.org/~mjg/linux-build/full-offcpu2.svg

[EDIT: replaced with a better graph with higher resolution of timestamps]

Data was edited to remove waitpid, rfork waits and similar.

Total off CPU time where the kernel is waiting on itself (so to speak) is 25702098460 nanoseconds (or 25.7 seconds), out of which almost half belongs to pmap pv list (obm). Note there is no information how much of it is overlapped with several threads at the same time. Even then this explains the difference vs Linux (more real time, less CPU time in the period).

That said, I think the patch should go in but the kernel needs a more generic solution to 1) track off cpu time 2) wait in an adaptive manner. The current hand-rolled de facto sleep locks don't have proper instrumentation nor behavior when contended.

I can confirm system time reduction for native builds. I have a long running test of a specific stable/12 snapshot (best real time selected out of 10) and results are pretty stable (with majority of the slowdown is from contending on vm object):

r359444    33.28 real      2552.36 user       329.16 sys
r360431    33.26 real      2551.49 user       344.60 sys
r361622    33.26 real      2542.66 user       323.06 sys
r361993    33.28 real      2568.95 user       332.53 sys
r362643    33.19 real      2563.73 user       312.56 sys
r362828    33.30 real      2553.93 user       322.67 sys
r363523    33.13 real      2558.10 user       324.39 sys

I could not be bothered to re-test with stock r363939 nor r363939 + objc only, so as a rough comparison here is a test against the latest above result vs fully patched head kernel Adding objc used to reduce system time to about 300 seconds.

r363523	33.13 real 2558.10 user 324.39 sys
r363939+objc+obm	32.87 real 2587.73 user 231.70 sys

Full breakdown for interested:
head:

warmup:
      335.91 real     19242.15 user      2210.51 sys
run:
      337.94 real     19247.93 user      2231.28 sys
      338.00 real     19244.56 user      2226.73 sys
      337.97 real     19246.36 user      2225.37 sys
      338.18 real     19242.30 user      2236.71 sys
      337.49 real     19239.67 user      2236.14 sys
      337.75 real     19243.28 user      2235.98 sys
      338.10 real     19242.37 user      2231.36 sys
      336.84 real     19243.65 user      2231.99 sys
      338.28 real     19247.89 user      2234.61 sys
      337.99 real     19246.62 user      2235.55 sys

head + objc:

warmup:
      337.26 real     19221.03 user      2220.93 sys
run:
      337.56 real     19252.80 user      2169.27 sys
      338.08 real     19248.43 user      2171.01 sys
      338.07 real     19253.96 user      2175.03 sys
      337.63 real     19252.79 user      2171.68 sys
      338.44 real     19250.57 user      2175.79 sys
      338.44 real     19255.44 user      2174.68 sys
      338.47 real     19257.38 user      2182.76 sys
      337.53 real     19228.53 user      2178.06 sys
      338.27 real     19252.80 user      2175.39 sys
      338.33 real     19255.45 user      2176.72 sys

head + objc + obm:

warmup:
      335.78 real     19261.35 user      2073.71 sys
run:  
      337.03 real     19267.26 user      2077.54 sys
      335.99 real     19280.36 user      2076.67 sys
      336.41 real     19271.90 user      2082.00 sys
      336.88 real     19256.40 user      2088.40 sys
      336.93 real     19278.11 user      2082.90 sys
      336.66 real     19264.48 user      2092.92 sys
      337.32 real     19273.15 user      2086.42 sys
      336.79 real     19276.21 user      2088.43 sys
      337.05 real     19275.83 user      2082.23 sys
      337.08 real     19276.23 user      2086.50 sys

Profiling added is pretty crap, I'll productize it later:

index 56515f033f5d..0832f15f64ad 100644
--- a/sys/kern/subr_sleepqueue.c
+++ b/sys/kern/subr_sleepqueue.c
@@ -180,6 +180,10 @@ static void        sleepq_timeout(void *arg);
 SDT_PROBE_DECLARE(sched, , , sleep);
 SDT_PROBE_DECLARE(sched, , , wakeup);
 
+SDT_PROVIDER_DEFINE(kernel);
+SDT_PROBE_DEFINE0(kernel, profile, sleepq, block);
+SDT_PROBE_DEFINE0(kernel, profile, sleepq, wakeup);
+
 /*
  * Initialize SLEEPQUEUE_PROFILING specific sysctl nodes.
  * Note that it must happen after sleepinit() has been fully executed, so
@@ -621,8 +625,10 @@ sleepq_switch(const void *wchan, int pri)
        sched_sleep(td, pri);
        thread_lock_set(td, &sc->sc_lock);
        SDT_PROBE0(sched, , , sleep);
+       SDT_PROBE0(kernel, profile, sleepq, block);
        TD_SET_SLEEPING(td);
        mi_switch(SW_VOL | SWT_SLEEPQ);
+       SDT_PROBE0(kernel, profile, sleepq, wakeup);
        KASSERT(TD_IS_RUNNING(td), ("running but not TDS_RUNNING"));
        CTR3(KTR_PROC, "sleepq resume: thread %p (pid %ld, %s)",
            (void *)td, (long)td->td_proc->p_pid, (void *)td->td_name);
diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
index 01d265277ba1..c50392198be9 100644
--- a/sys/kern/subr_turnstile.c
+++ b/sys/kern/subr_turnstile.c
@@ -176,6 +176,10 @@ SDT_PROBE_DEFINE(sched, , , sleep);
 SDT_PROBE_DEFINE2(sched, , , wakeup, "struct thread *", 
     "struct proc *");
 
+SDT_PROVIDER_DECLARE(kernel);
+SDT_PROBE_DEFINE0(kernel, profile, turnstile, block);
+SDT_PROBE_DEFINE0(kernel, profile, turnstile, wakeup);
+
 static inline void
 propagate_unlock_ts(struct turnstile *top, struct turnstile *ts)
 {
@@ -813,10 +817,14 @@ turnstile_wait(struct turnstile *ts, struct thread *owner, int queue)
                    td->td_tid, lock, lock->lo_name);
 
        SDT_PROBE0(sched, , , sleep);
+       SDT_PROBE0(kernel, profile, turnstile, block);
 
        THREAD_LOCKPTR_ASSERT(td, &ts->ts_lock);
        mi_switch(SW_VOL | SWT_TURNSTILE);
 
+       SDT_PROBE0(kernel, profile, turnstile, wakeup);
        if (LOCK_LOG_TEST(lock, 0))
                CTR4(KTR_LOCK, "%s: td %d free from blocked on [%p] %s",
                    __func__, td->td_tid, lock, lock->lo_name);

dtrace:

#pragma D option dynvarsize=32m
  
fbt::turnstile_wait:entry
{
        self->turnstile = args[0];
}       

fbt::turnstile_wait:return
{
        self->turnstile = 0;
}       

kernel:profile:turnstile:block
/(curthread->td_proc->p_flag & 0x4) == 0/
{

        self->mesg = self->turnstile->ts_lockobj->lo_name;
        self->ts = timestamp;
}       

kernel:profile:sleepq:block
/(curthread->td_proc->p_flag & 0x4) == 0/
{
        self->ts = timestamp;
        self->mesg = curthread->td_wmesg;
}       

kernel:profile:*:wakeup
/self->ts/
{
        @[stack(), self->mesg ? stringof(self->mesg) : probefunc] = sum(timestamp - self->ts);    
        self->ts = 0;
        self->mesg = 0;
}       

dtrace:::END
{
        printa("%k\n%s\n%@d\n\n", @);
}

Then the flamegraph one-liner.

The following part in obm_lock_slow has whitespace issues in indent:

v = atomic_load_8(&obm->lk);
if (v == OBM_UNLOCKED) {
        if (atomic_fcmpset_acq_char(&obm->lk, &v, OBM_LOCKED) != 0)
                break;
        lock_delay(&lda);
        continue;
}

ts = turnstile_trywait(lo);
v = atomic_load_8(&obm->lk);
if (v == OBM_UNLOCKED) {
        turnstile_cancel(ts);
        if (atomic_fcmpset_acq_8(&obm->lk, &v, OBM_LOCKED) != 0)
                break;
        lock_delay(&lda);
        continue;
}
if ((v & OBM_CONTESTED) == 0 &&
    atomic_fcmpset_8(&obm->lk, &v, v | OBM_CONTESTED) == 0) {
        turnstile_cancel(ts);
        continue;
}
turnstile_wait(ts, NULL, TS_SHARED_QUEUE);

Fix indent in obm_lock_slow.

Harbormaster completed remote builds in B32824: Diff 75531.Aug 6 2020, 10:20 PM

alc added inline comments.Aug 6 2020, 10:23 PM

sys/amd64/amd64/pmap.c
372–373	Is this actually the correct place for this "if" statement? Suppose that lockp is NULL, because this execution of CHANGE_PV_LIST_LOCK_TO_PHYS() is the first to acquire a lock, and that PHYS_TO_VM_PAGE(pa) returns NULL. `_new_lock != _lockp` will be false, and so we won't enter this block and acquire the lock associated with `pv_fake_page`. Am I missing something?

My comments amount to the following fixups: https://people.freebsd.org/~mjg/obm-tidyup.diff

Whatever profiling or spinning may show up is probably for a different review.

sys/kern/kern_obm.c
85	So the entry point blindly lock_delays even if the lock is contested or fcmpset spuriously failed. Other primitives pass the found value instead, then the main loop can test on it without reading.
105	If this fails you may the lock is taken by someone else which adds an avoidable turnstile trip. Other primitives have a goto label.

I did an experiment with a variant which spins. Here is the result: https://people.freebsd.org/~mjg/out2.svg

This extended real time by over 10 seconds.

As such these locks can be very contested. With stock variant there is no significant real time reduction because there is a lot of time spent off CPU. Modifying the routine to not go off CPU seems to run into another corner case, where threads take many locks one after another and keep fighting for all of them in lockstep. In order to make performance progress (less real time) this bare minimum will need some form of rate limiting for pmap_remove_pages, but I suspect ultimately a very different solution is needed to begin with.

obm fixes from mjg.
Fix fake page pv lock calculation by alc.
Reduce pv list lock scope in pmap_remove_pages().

Harbormaster completed remote builds in B32825: Diff 75534.Aug 7 2020, 12:13 AM

Remove some bits for spinning, unintended leftover.

Submitted by: mjg

Harbormaster completed remote builds in B32826: Diff 75535.Aug 7 2020, 1:03 AM

I reran the patch, there is a little bit more system time and real time, but it's probably random effects.

head	211686.83s user 24536.31s system 6359% cpu 3714s real
head + objc	211729.23s user 23971.41s system 6339% cpu 3718s real
head + objc + obm	211981.34s user 22917.79s system 6341% cpu 3704s real
head + objc +obmv2	212009.92s user 23044.97s system 6342% cpu 3706s real

In short, this is an improvement, but things are lagging behind Linux on this workload and the approach employed here may not be enough.

Full dump:

warmup:
      336.99 real     19257.34 user      2180.58 sys
run:
      336.95 real     19272.24 user      2085.77 sys
      337.26 real     19272.63 user      2084.47 sys
      337.07 real     19276.98 user      2086.53 sys
      336.70 real     19273.83 user      2085.77 sys
      337.27 real     19279.45 user      2082.67 sys
      336.74 real     19281.96 user      2083.34 sys
      337.17 real     19279.00 user      2087.81 sys
      336.46 real     19261.84 user      2092.63 sys
      336.94 real     19274.54 user      2090.42 sys
      336.54 real     19280.04 user      2084.91 sys

I'll give the pv entry reclamation code a careful reading tomorrow.

sys/amd64/amd64/pmap.c
468	Elsewhere there are two spaces after a period.
511–512	This sentence, specifically, the part that says, "... is locked according to ...", doesn't really explain anything to me.
515	"in the process" of what?
554–560	Doesn't this comment describe the next function and not the one that immediately follows?
604	Although it makes no functional difference, I think that this would more correctly be NPTEPG.
2382–2386	"Initialize pv_fake page, which is used to make pv_list locking work for physical addresses not covered by vm_page_array[]. In particular, it is needed by pv_list_lock_object() and CHANGE_PV_LIST_LOCK_TO_PHYS().
4197–4198	"Add the pmap to the global list of pmaps, which is used during pv chunk reclamation. The pmap is ..."

kib marked 3 inline comments as done.Aug 7 2020, 11:50 AM

Editing of comments.

Harbormaster completed remote builds in B32835: Diff 75553.Aug 7 2020, 12:05 PM

Rebase after LA57.

Harbormaster completed remote builds in B33125: Diff 76125.Aug 23 2020, 10:11 PM

I'm trying to finish my final review of this change today.

sys/amd64/amd64/pmap.c
359–375	This could be shortened to: vm_page_t _m; _m = PHYS_TO_VM_PAGE(pa); if (_m == NULL) _m = &pv_fake_page; CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, _m);
511	"indicates whether the PV llst for m is already locked."
523	Elsewhere, e.g., promotion, I've written this as `atop(VM_PAGE_TO_PHYS(m) & PG_FRAME & PDRMASK))` so that there is only one arithmetic operation at run-time.
534	Since this is iterating at 4KB granularity, NPTEPG would seem more appropriate.
544	Since this is iterating at 4KB granularity, NPTEPG would seem more appropriate.
601	Elsewhere, e.g., promotion, I've written this as `atop(VM_PAGE_TO_PHYS(m) & PG_FRAME & PDRMASK))`.
4198	"pmap is" is duplicated.
4848–4851	Why move these given that their use is localized to the code that follows?

kib marked 8 inline comments as done.Aug 30 2020, 9:21 PM

kib added inline comments.

sys/amd64/amd64/pmap.c
4848–4851	I added assert in pmap_init() that uses the constants, and that comes 2K lines before this place. So I moved defines to some logical place earlier.

Handle current batch of notes from Alan.

Harbormaster completed remote builds in B33244: Diff 76382.Aug 30 2020, 9:22 PM

I gave it a spin with poudriere on top of objc branch. Looks like the passed pv pointer was NULL.

fault virtual address   = 0x18
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff808aaf12
stack pointer           = 0x0:0xfffffe03ac2d5030
frame pointer           = 0x0:0xfffffe03ac2d5040
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 29183 (xsltproc)
trap number             = 12
panic: page fault
cpuid = 0
time = 1598919221
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe03ac2d4ce0
vpanic() at vpanic+0x182/frame 0xfffffe03ac2d4d30
panic() at panic+0x43/frame 0xfffffe03ac2d4d90
trap_fatal() at trap_fatal+0x387/frame 0xfffffe03ac2d4df0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe03ac2d4e50
trap() at trap+0x277/frame 0xfffffe03ac2d4f60
calltrap() at calltrap+0x8/frame 0xfffffe03ac2d4f60
--- trap 0xc, rip = 0xffffffff808aaf12, rsp = 0xfffffe03ac2d5030, rbp = 0xfffffe03ac2d5040 ---
free_pv_entry() at free_pv_entry+0x42/frame 0xfffffe03ac2d5040
pmap_promote_pde() at pmap_promote_pde+0x56b/frame 0xfffffe03ac2d50b0
pmap_enter() at pmap_enter+0xc7f/frame 0xfffffe03ac2d5190
vm_fault() at vm_fault+0x15d5/frame 0xfffffe03ac2d52b0
vm_fault_trap() at vm_fault_trap+0x6d/frame 0xfffffe03ac2d5300
trap_pfault() at trap_pfault+0x1f8/frame 0xfffffe03ac2d5360
trap() at trap+0x400/frame 0xfffffe03ac2d5470
calltrap() at calltrap+0x8/frame 0xfffffe03ac2d5470
--- trap 0xc, rip = 0x800262370, rsp = 0x7fffffffdf60, rbp = 0x7fffffffdfe0 ---

In D24217#583751, @mjg wrote:
I gave it a spin with poudriere on top of objc branch. Looks like the passed pv pointer was NULL.
fault virtual address   = 0x18
free_pv_entry() at free_pv_entry+0x42/frame 0xfffffe03ac2d5040
pmap_promote_pde() at pmap_promote_pde+0x56b/frame 0xfffffe03ac2d50b0

I assume the kernel was configured without INVARIANTS ?
Can you get some panic on INVARIANTS-enabled kernel (there is a designated check for pv != NULL in pmap_pvh_free()).

After few hours with INVARIANTS I got:

panic: Bad link elm 0xfffff842e7139138 prev->next != elm
cpuid = 90
time = 1598985147
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe067abac0e0
vpanic() at vpanic+0x182/frame 0xfffffe067abac130
panic() at panic+0x43/frame 0xfffffe067abac190
pmap_remove_pages() at pmap_remove_pages+0xb0f/frame 0xfffffe067abac290
vmspace_exit() at vmspace_exit+0xa3/frame 0xfffffe067abac2d0
exit1() at exit1+0x532/frame 0xfffffe067abac340
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe067abac350
amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe067abac470
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe067abac470
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x802183a2a, rsp = 0x7fffffffd918, rbp = 0x7fffffffd930 ---

Regardless of the above I suggest the following: https://people.freebsd.org/~mjg/obm.diff

this adds basic LOCK_PROFILING support to obm, makes it easier to add WITNESS later and converts some inlines into macros so that FILE__/LINE__ point at callers. I find profiling mandatory as these locks *are* contended.

emaste added a subscriber: emaste.Oct 19 2020, 5:57 PM

Rebase, fix conflicts.
Add initial obm lock profiling support (by mjg).

I ran 675 out of 723 stress tests with D24217.83480.diff without observing any problems,

Initially this was a mail, but then I decided to continue discussion in the review.

This mail is some kind of continuation of the attempt I did in D24217
amd64 pmap: fine-grained pv list locking and which effectively stallen
now.

After thinking more about the issue pointed out by mjg, that the
hardest part of the whole pv business is actually
vmspace_exit()/pmap_remove_pages(). The function iterates over all
ptes, relocking corresponding pvs to remove pv for pte from the
pv_list for the page. We have pv per tripple (pmap, page, address).
We collect huge amount of pv's, and then we have huge calculation
ahead to free them all.

But then, do we need such granularity? Really, for the given mapping
of the object, all pv's are essentially same. We can record pv into
object instead of page, that is each object would have a list of
tuples like (vm_map, obj offset, vm map offset, length) to record
pv-like information. [The back-reference to vm_map_entry is not
enough, or at least is inconvinient, see below about object locking].
The list would need to be updated on forking, shadowing, splitting etc.

With object-pv, operations like pmap_remove_all(m) can be implemented
as follows:

m->object gives us the object. Conveniently, the top shadow object is already write-locked AFAIR. Otherwise we lock the m->object, it is fine for order because backing obj lock is after the shadowed object.
we iterate over the tuples recorded in the object
for each tuple, we:
- lock pmap
- there we can look up pte in the pmap page table, if any
- if pte maps our page, pte is invalidated
- unlock pmap, take the next tuple.

Nice things is that pv lists are not needed, pv lists are locked by
the object lock, which is already usually taken around the calls.

When doing pmap_remove_pages(), we iterate over ptes and not over
pages. There, we would need to switch to corresponding object lock as
we discover pte->m->object, but often there is no relocking, since we
typically have long runs of mapped pages from the same object.
Superpages would help there even more.

One concern with this scheme is the increased use of the object locks.
I am not sure how much it is problem, but one idea that could help
with object-pv scheme is that we could share object locks for whole
shadow tree, storing the pointer to the lowest object lock to all
shadows. Then we can eliminate all shadow descend relocking we do in
many places. The obvious downside is that after fork, CoW mappings
would share the object lock even after complete shadowing. Similarly,
same file CoW mappings would also share the lock forever.

I do not think this is a new idea, but since amount of work to
implement it before it can be measured is huge, I want to get peer'
opinions before starting experiments.

Herald added a subscriber: olce. · View Herald TranscriptJul 23 2025, 6:20 AM

I looked into this in the meantime(tm).

Here are some notes before I bugger off.

First, note read-locked fault handling for the backing object has landed.

Second, execve scalability can be trivially benchmarked with http://apollo.backplane.com/DFlyMisc/doexec.c (and yes, this does emphasize actual problems coming up in large building.

For some context here is off cpu tracking for buildkernel: https://people.freebsd.org/~mjg/fg/flix1-buildkernel-offcpu-ec1b18c7353ad20d2d373e58931aeccfccd5599d.svg

pipelk aside, almost all of it is vmpfw. Things shifted here after the vm object problem got sorted out (here is the state prior: https://people.freebsd.org/~mjg/fg/flix1-buildkernel-offcpu.svg)

Or to put it differently, while the pv locks are indeed heavily contended today, this is still artificially *lowered* by page busying taking an exclusive lock and waiting off CPU. Whatever the solution to the pv problem, it can't be accurately evaluated with this bottleneck in place. Last I looked into it straight up shared-busying was not trivially achievable, but short of any significant rework I think a 3rd form of busying ("shared busy but only for fault handling, exclusive against everyone else") would do it.

As for pv itself, I think the best solution is to merely decentralize the state in a bounded capacity (and not per-object). Recognizing the entire thing may need to be walked, the total number of locks to hold for the problem case should be reasonably small. Having a list per every CPU would provide the best scalability of course, but is also not warranted. Outside of a microbenchmark setting this will likely be more than fine with 1 list + lock pair for every 16 cores (or even more). So instead of having a lock per page, like originally proposed in this review, there would be an array of lists + locks. This will be faster because there is a small number of CPUs competing for any given lock in the common case -- with granularity of 16 one can pretend the box is only 16-way etc.

Doing this for literally every super page would be highly wasteful from memory usage pov, but also not needed. The problem primarily concerns things like pages for libc or the loader.

So pmap could behave like it does right now by default and only switch to distributed state for a given super page area if there is sufficient contention detected and keep it like that.

vmpfw means that there are threads trying to fault on the same page. Either the page is not valid, and then the contention is unavoidable, or the page is valid (soft-fault), and then there are two cases: the contending threads faulted at the same virtual address, or not. Do you know which case is the most popular?

Right now xbusy prevents the same address case from being mapped by several threads simultaneously.

Reworking the soft-fault is not too hard IMO. For instance, we might take the page shared-busy and check the validity. When we actually get the busy lock ownership, page should be already valid. If not valid, we try to upgrade to xbusy.

If we get shared busy/valid, we might prevent parallel faults by taking e.g. a rangelock on the vm_map for the faulting page range. We might even take the rangelock in advance always.

In D24217#1175350, @kib wrote:

vmpfw means that there are threads trying to fault on the same page. Either the page is not valid, and then the contention is unavoidable, or the page is valid (soft-fault), and then there are two cases: the contending threads faulted at the same virtual address, or not. Do you know which case is the most popular?

It's the latter. Per my previous comment, almost all of it is on libc and the loader and is trivially reproducible with the linked benchmark. It's all there ready to use, but the kernel heavily and inefficiently serializes itself to do the bookkeeping.

Now that I wrote this I remember another thing to consider (adjacent, unrelated effort in the area): faulting more than one page at a time, at least for anon faults. I had some measurements showing that this almost does not change memory usage in practice, but instead cuts down on page faults.

I also note I have been hogging the few ok zoo machines for way too long. All of it is pretty old now, but flix1 is still very much usable even for this particular work. I'm happy to free up the box provided it will be put into good use. It has some crap set up there for my own tests, but all of that can be rm -rf'ed no problem.

In D24217#1175274, @kib wrote:

Initially this was a mail, but then I decided to continue discussion in the review.

This mail is some kind of continuation of the attempt I did in D24217
amd64 pmap: fine-grained pv list locking and which effectively stallen
now.

After thinking more about the issue pointed out by mjg, that the
hardest part of the whole pv business is actually
vmspace_exit()/pmap_remove_pages(). The function iterates over all
ptes, relocking corresponding pvs to remove pv for pte from the
pv_list for the page. We have pv per tripple (pmap, page, address).
We collect huge amount of pv's, and then we have huge calculation
ahead to free them all.

But then, do we need such granularity? Really, for the given mapping
of the object, all pv's are essentially same. We can record pv into
object instead of page, that is each object would have a list of
tuples like (vm_map, obj offset, vm map offset, length) to record
pv-like information. [The back-reference to vm_map_entry is not
enough, or at least is inconvinient, see below about object locking].
The list would need to be updated on forking, shadowing, splitting etc.

If I understand the scheme correctly, I believe dragonflybsd does almost exactly what you're describing. Each VM object carries a list of glue structures (struct vm_object_mapping) which link mappings of the object back to the vm_map. This replaces both per-page PV lists and the shadow object chain (i.e., each vm_object_mapping points to the backing object, or maybe it points to the backing vm_object_mapping). Moreover, there is a 1-1 correspondence between these structures and vm_map_entries.

With object-pv, operations like pmap_remove_all(m) can be implemented
as follows:

m->object gives us the object. Conveniently, the top shadow object is already write-locked AFAIR. Otherwise we lock the m->object, it is fine for order because backing obj lock is after the shadowed object.

we iterate over the tuples recorded in the object

for each tuple, we:

lock pmap

there we can look up pte in the pmap page table, if any

if pte maps our page, pte is invalidated

unlock pmap, take the next tuple.

Nice things is that pv lists are not needed, pv lists are locked by
the object lock, which is already usually taken around the calls.

One disadvantage is that we may have to examine many more pmaps this way. If 100 processes map libfoo.so but only one of them maps a particular page in the vnode, we have to visit 100 pmaps to pmap_remove_all(that page).

I suspect that your proposal still provides a better tradeoff, though.

When doing pmap_remove_pages(), we iterate over ptes and not over
pages.

I suspect this might be expensive. Perhaps we could start storing a 512-bit bitmap in each PTP indicating which slots are populated. Or, change the use of the per-pmap trie and always store all PTPs in it, then iterate over it when tearing down the pmap.

There, we would need to switch to corresponding object lock as
we discover pte->m->object, but often there is no relocking, since we
typically have long runs of mapped pages from the same object.
Superpages would help there even more.

One concern with this scheme is the increased use of the object locks.
I am not sure how much it is problem, but one idea that could help
with object-pv scheme is that we could share object locks for whole
shadow tree, storing the pointer to the lowest object lock to all
shadows. Then we can eliminate all shadow descend relocking we do in
many places. The obvious downside is that after fork, CoW mappings
would share the object lock even after complete shadowing. Similarly,
same file CoW mappings would also share the lock forever.

I do not think this is a new idea, but since amount of work to
implement it before it can be measured is huge, I want to get peer'
opinions before starting experiments.

Overall I think it makes sense, but I didn't spend any time so far thinking about the locking scheme.

Maintaining the fine-grained PV list is fundamentally expensive and in many workloads it's unclear that it's really necessary.

In D24217#1175889, @markj wrote:

In D24217#1175274, @kib wrote:

If I understand the scheme correctly, I believe dragonflybsd does almost exactly what you're describing. Each VM object carries a list of glue structures (struct vm_object_mapping) which link mappings of the object back to the vm_map. This replaces both per-page PV lists and the shadow object chain (i.e., each vm_object_mapping points to the backing object, or maybe it points to the backing vm_object_mapping). Moreover, there is a 1-1 correspondence between these structures and vm_map_entries.

I was not aware of it, thank you for the pointer.

One disadvantage is that we may have to examine many more pmaps this way. If 100 processes map libfoo.so but only one of them maps a particular page in the vnode, we have to visit 100 pmaps to pmap_remove_all(that page).

Yes, it is a disadvantage for the functions like pmap_remove_all() and pmap_remove_write() that operate on single page. OTOH, functions that operate on bulk would see much less overhead.

I suspect that your proposal still provides a better tradeoff, though.

When doing pmap_remove_pages(), we iterate over ptes and not over
pages.

I suspect this might be expensive. Perhaps we could start storing a 512-bit bitmap in each PTP indicating which slots are populated. Or, change the use of the per-pmap trie and always store all PTPs in it, then iterate over it when tearing down the pmap.

We already have the pctrie for page table page, it is the hardware page table itself. If there is a page in the table, there is a valid pte in it.
Having the bitmap would only optimize scanning the page to find the entries, I am not sure it would give much.

Or, for pmaps that use object-pv, pmap_remove_pages() could become empty.

Overall I think it makes sense, but I didn't spend any time so far thinking about the locking scheme.

Maintaining the fine-grained PV list is fundamentally expensive and in many workloads it's unclear that it's really necessary.

Exactly.

Still, I want to finish D51474 first.

In D24217#1177384, @kib wrote:

In D24217#1175889, @markj wrote:

In D24217#1175274, @kib wrote:

If I understand the scheme correctly, I believe dragonflybsd does almost exactly what you're describing. Each VM object carries a list of glue structures (struct vm_object_mapping) which link mappings of the object back to the vm_map. This replaces both per-page PV lists and the shadow object chain (i.e., each vm_object_mapping points to the backing object, or maybe it points to the backing vm_object_mapping). Moreover, there is a 1-1 correspondence between these structures and vm_map_entries.

I was not aware of it, thank you for the pointer.

One disadvantage is that we may have to examine many more pmaps this way. If 100 processes map libfoo.so but only one of them maps a particular page in the vnode, we have to visit 100 pmaps to pmap_remove_all(that page).

Yes, it is a disadvantage for the functions like pmap_remove_all() and pmap_remove_write() that operate on single page. OTOH, functions that operate on bulk would see much less overhead.

Or, maybe we could have an adaptive scheme where individual pages maintain a PV list until it becomes "too long", at which point we fall back the object mapping list. That is, the first few mappings of a page create PV entries, and past a certain threshold we flag the page and stop adding new ones. When calling pmap_remove_all() on a flagged page, the full object list needs to be traversed. Pages in libc.so etc. which are frequently mapped would not use PV lists, but rarely mapped pages would, so they can be quickly reclaimed.

In general we should try to avoid scenarios where the page daemon has to work very hard to free a single clean page.

I suspect that your proposal still provides a better tradeoff, though.

When doing pmap_remove_pages(), we iterate over ptes and not over
pages.

I suspect this might be expensive. Perhaps we could start storing a 512-bit bitmap in each PTP indicating which slots are populated. Or, change the use of the per-pmap trie and always store all PTPs in it, then iterate over it when tearing down the pmap.

We already have the pctrie for page table page, it is the hardware page table itself. If there is a page in the table, there is a valid pte in it.
Having the bitmap would only optimize scanning the page to find the entries, I am not sure it would give much.

Well, scanning sparsely populated PTPs will require many memory reads. The overhead would need to be benchmarked to see if some additional data structure is needed. But, at least this piece can be implemented on its own.

amd64 pmap: fine-grained pv list locking
Needs ReviewPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 83480

sys/amd64/amd64/pmap.c

sys/amd64/include/pmap.h

sys/conf/files

sys/kern/kern_obm.c

sys/sys/_obm.h

sys/sys/obm.h

amd64 pmap: fine-grained pv list lockingNeeds ReviewPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 83480

sys/amd64/amd64/pmap.c

sys/amd64/include/pmap.h

sys/conf/files

sys/kern/kern_obm.c

sys/sys/_obm.h

sys/sys/obm.h

amd64 pmap: fine-grained pv list locking
Needs ReviewPublic
Actions

Revision Contents
Changeset List