Details

Reviewers

alc
markj
mjg

Commits

rGa38483fa2b3a: vm_fault: assert that first_m is xbusy
rG149674bbac58: vm_fault: try to only share-busy page for soft faults
rG3f05bbdbd80f: vm_fault: add helper vm_fault_can_cow_rename()
rG5bd4c04a4e7f: vm_fault: add vm_fault_might_be_cow() helper
rGc6b79f587f27: vm_fault_busy_sleep(): pass explicit allocflags for vm_page_busy_sleep()
rG0854b4f569e1: vm/vm_fault.c: cleanup includes

Summary

vm_fault: add helper vm_fault_can_cow_rename()


vm_fault: try to only share-busy page for soft faults

If the fault found a vaild page that is definitely not going to be
renamed for Cow, try to only sbusy the page. We do not need to validate
the page, and parallel faults on the same address are excluded by the
map' fltlock.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

kib added inline comments.Jul 28 2025, 8:09 PM

sys/vm/vm_fault.c
1849	VM_FAULT_WIRE must occur only once from vm_fault() call from vm_map_wire(). And from my re-reading of the code, vm_fault() is careful to not wire the page when spurious call to vm_fault() occur on the wired entry. I also failed to see other reasons for disallowing faults on the same address. At worst we would do pmap_enter()s in parallel which should be handled by pmap. So I will drop the rangelock, after all.

Drop rangelock.
Fix condition in vm_fault_cow():

explicitly exclude FAULT_SOFT_MSHAREDBUSY from allowed state for rename,
use rename_cow for bool name.

kib edited the summary of this revision. (Show Details)Jul 28 2025, 8:13 PM

markj added inline comments.Aug 3 2025, 3:57 PM

sys/vm/vm_fault.c
166	Rather than adding a new result and passing that around, can we instead add a new `page_sbusied` flag to the fault state? Or, can we just use `vm_page_xbusied(fs.m)` to check for this case instead of adding new state?
1094	I think this explanation doesn't justify the code. The original problem scenario is that object O has a shadow object O', and O' is mapped by multiple processes. Then, suppose m_cow is mapped read-only into all of those processes. Suppose some process triggers a copy-on-write fault, causing m_cow to be copied into the shadow object. We must shoot down all of the mappings of m_cow, because it is fully shadowed and may be freed. I cannot see how that is prevented in the sbusy case. I also do not see how "the sbusy case is only possible when the shadow object is read-mapped" is true. Here we are handling a COW fault, so clearly the shadow object is mapped RW.
1704	The comment about the object lock is false in the sbusy case.
1711	`object_locked` would be a clearer name for this flag IMO.

kib marked 3 inline comments as done.Aug 3 2025, 7:11 PM

kib added inline comments.

sys/vm/vm_fault.c
1094	If the cow fault is triggerred and the page is copied, it must be write-fault or cow-fault. Both types are checked and excluded in the condition in vm_fault_object().

Remove SOFT_FAULT_MSHAREDBUSY, use vm_page_xbusied(fs.m)
Update comment.
unlock_object->object_locked.

Add ONEMAPPING for first_object as an alternative condition for read-fault. Either is enough to avoid the problem with shared writeable CoW.
Remove MSHAREDBUSY from the comment.

markj added inline comments.Aug 3 2025, 11:41 PM

sys/vm/vm_fault.c
1094	Sorry, I don't really follow. They're only excluded in the case where `fs->object == fs->first_object`, from my reading, but this is not sufficient. There is a test case in tests/sys/vm/shared_shadow_inval_test.c which exercises this case and explains the problem scenario a bit more. I do not see any problems when testing it with the patch applied, but I will try to spend more time to construct a problematic test case or discover that I'm wrong about something.
1507	I believe the check here is racy, since `fs->first_object` is not locked. And the parentheses look wrong.
1517	Since the condition was checked once already.
1519	So in this case we will call vm_fault_busy_sleep(), but why? Why not unbusy the page and fall through to the `vm_page_tryxbusy()` below?

kib marked 5 inline comments as done.Aug 4 2025, 12:38 AM

kib added inline comments.

sys/vm/vm_fault.c
1094	Ok, I extracted the checkers into two dedicated helpers, hope this way it would be clearer. I verify that: page is fully valid fault is for read OR fault is for one-mapped top object conditions for renaming do not occur Then I sbusy fs->m, if the failt if for read, it is enough. If the fault is for write, I additionally rlock top-level shadow, and recheck onemapping condition. The first_object is unlocked together with the object.

Do what was explained in the reply to comment:
take rlock on first_object and check onemapping condition after (optimistically checked before).

markj added inline comments.Aug 5 2025, 12:11 AM

sys/vm/vm_fault.c
1737	Suppose two threads fault on the same virtual page, one is a read fault, one is a write fault. Suppose the page is COW and the source page is valid. After this change, what stops them from racing? In particular, how do we ensure that the new top-level page is entered into the pmap?

kib added inline comments.Aug 5 2025, 12:34 AM

sys/vm/vm_fault.c
1737	Well, the read fault is impossible, because the fault type is defined by the mapping permissions, not just the hardware faulting mode. So if it is the same virtual page, then both faults would proceed this as a write fault.

markj added inline comments.Aug 5 2025, 1:03 AM

sys/vm/vm_fault.c
1737	Sorry, I don't understand. Suppose the MAP_ENTRY_NEEDS_COPY is clear and there is already a shadow object O with backing object O'. Suppose there is some resident page m in O' that is not mapped into the process. Suppose there is no resident page at that address in O. Let the write-faulting thread be T1, and the read-faulting thread be T2. Then: T2: looks up m, and sbusies it, it does not call vm_fault_cow() T1: allocates a page in O to shadow m, then sbusies m to copy it into the new page calls pmap_enter(new page) T2: calls pmap_enter(m) Now, subsequent reads of the page will not show the modification done by T1.

markj added inline comments.Aug 5 2025, 1:20 AM

sys/vm/vm_fault.c
1737	kib pointed out on IRC that these threads are serialized by the xbusying the top-level page. I forgot that we hold the xbusy lock for the entire duration of the fault, even for read faults.

Add a comment about xbusy state for fs.first_m.

markj added inline comments.Aug 5 2025, 1:18 PM

sys/vm/vm_fault.c
1711	This means that different processes faulting on the same top-level object will be serialized when faulting on the same page. So, if the top-level object is shared, i.e., OBJ_ONEMAPPING is clear, do we really need the extra synchronization from the fs.first_object read lock? It looks like we can just rely on the existing "racy" check in vm_fault_cow() after all. We just change the justification a little bit: instead of relying on the xbusy of m_cow to provide serialization, we rely on the xbusy of fs.first_m to ensure that a second process will not create a read-only pmap entry for m_cow.

markj added inline comments.Aug 5 2025, 2:27 PM

sys/vm/vm_fault.c
1809	Can we also assert something like, `fs.first_m == fs.m \|\| vm_page_xbusied(fs.first_m)` here?

Remove extension of the locked region for first_object.
Add assert for xbusy of first_m.

Switch assert from checking first_m != m to first_object != object. first_m is NULL when there is no shadow chain.

markj added inline comments.Aug 6 2025, 1:25 PM

sys/vm/vm_fault.c
1509	Why disallow shared busy after the first failed attempt?

kib marked an inline comment as done.Aug 6 2025, 9:57 PM

kib added inline comments.

sys/vm/vm_fault.c
1509	I want(ed) to have the code path to fall back to pristine old way if sbusy failed. I think we can remove can_sbusy indeed. One note that vm_fault_busy_sleep() should be more flexible, I will add allocflags argument.

Eliminate can_sbusy.
Only sleep for sbusy when needed, in vm_fault_busy_sleep().
While there, clean up includes.

markj added inline comments.Aug 12 2025, 2:53 PM

sys/vm/vm_fault.c
1711	The xbusy of `first_m` also serializes page faults through other mappings of the same top-level object. So, in the end maybe we do not need the object lock to synchronize the OBJ_ONEMAPPING check: Suppose OBJ_ONEMAPPING is set in `first_object`. Then, `m_cow` is only mapped in the current pmap, and no others. `first_m` is xbusied and shadows `m_cow`. Then, if OBJ_ONEMAPPING is cleared immediately after, we know that the other map which maps `first_object` will not call `pmap_enter(m_cow)`, since that requires `first_m` to be xbusied. Suppose OBJ_ONEMAPPING is clear in `first_object`. Then we are potentially shadowing `m_cow`, so we should shoot down all of its mappings with pmap_remove_all(). In other words, I think the argument in the comment "The flag check is racy, but this is tolerable..." can be changed to apply to the sbusy case too. Instead of using the xbusy state of `m_cow` to make the argument, we can use the xbusy state of `first_m` to argue that the racy check is safe. Then we do not need any special handling for the sbusy case. Am I missing something?

kib marked an inline comment as done.Aug 13 2025, 2:42 AM

kib added inline comments.

sys/vm/vm_fault.c
1711	By the special handling for the sbusy case, do you mean my extent of the fs->object lock after the vm_fault_cow() is done? I believe yes, because you said in the second sentence that 'we do not need the object lock'. The fs->object locking purpose was not to handle the OBJ_ONEMAPPING complication. There are checks in vm_fault_object() that should prevent vm_fault_cow() from copying the page from fs->object to fs->first_object, done by vm_fault_can_cow_rename(). I need a way to ensure that the invariants are kept true between the check in vm_fault_object() and the action in vm_fault_cow(). I believe that object collapse might break this. Even if it cannot now, still the checked conditions (refcount != 1 or shadow_count != 1) are delicate enough that collapse implementation should be allowed enough freedom so we do not need to rely on its internals to keep that true.

markj added inline comments.Aug 13 2025, 9:11 PM

sys/vm/vm_fault.c
1711	By the special handling for the sbusy case, do you mean my extent of the fs->object lock after the vm_fault_cow() is done? I believe yes, because you said in the second sentence that 'we do not need the object lock'. Right. The fs->object locking purpose was not to handle the OBJ_ONEMAPPING complication. Ok, but really my point was that there is no OBJ_ONEMAPPING complication, or so I believe. That is, I think the sbusy case does not need to check `(fs.first_object->flags & OBJ_ONEMAPPING) != 0`. Or, I do not understand the justification for checking it. There are checks in vm_fault_object() that should prevent vm_fault_cow() from copying the page from fs->object to fs->first_object, done by vm_fault_can_cow_rename(). I need a way to ensure that the invariants are kept true between the check in vm_fault_object() and the action in vm_fault_cow(). Why not perform the checks racily, and disallow renaming if `!vm_page_xbusied(fs->m)`? I believe that object collapse might break this. Even if it cannot now, still the checked conditions (refcount != 1 or shadow_count != 1) are delicate enough that collapse implementation should be allowed enough freedom so we do not need to rely on its internals to keep that true. Even today, the ref_count == 1 and shadow_count == 1 checks are done without the object lock held. I can't remember offhand why that is safe.

kib marked an inline comment as done.Aug 14 2025, 3:45 AM

kib added inline comments.

sys/vm/vm_fault.c
1711	The fs->object locking purpose was not to handle the OBJ_ONEMAPPING complication. Ok, but really my point was that there is no OBJ_ONEMAPPING complication, or so I believe. That is, I think the sbusy case does not need to check `(fs.first_object->flags & OBJ_ONEMAPPING) != 0`. Or, I do not understand the justification for checking it. Ok, I reduced the vm_fault_fo_is_onemapping() to vm_fault_might_be_cow(). Despite trivial, it is useful IMO, all places where I replaced direct object != first_object checks become easier to understand. There are checks in vm_fault_object() that should prevent vm_fault_cow() from copying the page from fs->object to fs->first_object, done by vm_fault_can_cow_rename(). I need a way to ensure that the invariants are kept true between the check in vm_fault_object() and the action in vm_fault_cow(). Why not perform the checks racily, and disallow renaming if `!vm_page_xbusied(fs->m)`? Ok. I believe that object collapse might break this. Even if it cannot now, still the checked conditions (refcount != 1 or shadow_count != 1) are delicate enough that collapse implementation should be allowed enough freedom so we do not need to rely on its internals to keep that true. Even today, the ref_count == 1 and shadow_count == 1 checks are done without the object lock held. I can't remember offhand why that is safe. I added the re-check after the locks are acquired.

Remove the OBJ_ONEMAPPING blocker from the sbusy path.
Re-check CoW rename possibility after all object's locks are acquired.

I think the change is ok, my notes are mostly about code comments.

sys/vm/vm_fault.c
1095	I think this observation can be used to simplify the whole comment. I can submit a patch for that after this work is landed.
1489
1490
1505
1515	Maybe, "Now make sure that racily checked conditions are still valid"
1516	I think this comment about ONEMAPPING is stale?
1520	Shouldn't the predict_true apply to the whole predicate?
1527	I wonder if it's worth maintaining a counter here?

This revision is now accepted and ready to land.Aug 29 2025, 3:39 PM

kib marked 8 inline comments as done.Aug 29 2025, 5:03 PM

kib added inline comments.

sys/vm/vm_fault.c
1527	Which counter? The count of times we tried to use sbusy but after relock found that conditions are not true? I do not see it much interesting. I had a debug part of this patch that counted successful sbusy cases, and then even more fine-grained list of conditions for each reason that sbusy cannot be used. I do not think it is very useful for either production or in-field debugging.

Comment updates.

Mark, do you want the 'In collaboration with: markj' tag for this change?

This revision now requires review to proceed.Aug 29 2025, 5:04 PM

I have been following this, but have not given it a careful reading. I should be able to do so over the weekend.

In D51474#1193369, @kib wrote:

Comment updates.

Mark, do you want the 'In collaboration with: markj' tag for this change?

IMO it's not necessary, I just reviewed the change.

sys/vm/vm_fault.c
1521
1527	Yes. It should be cheap to maintain and maybe it will reveal some unexpected phenomenon. But ok, it's probably not useful.

This revision is now accepted and ready to land.Aug 29 2025, 5:46 PM

kib added a subscriber: pho.Sep 11 2025, 10:15 AM

fwiw I can confirm offcpu time is gone, instead almost all of it spent contending on pmap pv locks [i.e. execs still don't scale, but the bottleneck has moved. i don't have the old numbers around for comparison]

flamegraph:
https://people.freebsd.org/~mjg/fg/D51474-vs-shared-exec-104.svg

collected with:

$ rm /tmp/out.kern_stacks ; dtrace -x stackframes=100 -n 'profile-997 { @[curthread->td_wantedlock != NULL ? stringof(curthread->td_wantedlock->lo_name) : stringof("\n"), stack()] = count(); }' -o /tmp/out.kern_stacks

while running http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -O2 -o shared-doexec doexec.c
$ ./shared-doexec $(nproc)

Peter tested the change. I intend to commit it shortly, at the beginning of the next week.

I am happy to see how much simpler this has gotten, particularly the elimination of the range lock.

sys/vm/vm_fault.c
1490
1491	Historically, we have written "COW", not "CoW". I only found one instance of the latter under vm/.
1500
1501
1502
1709	While it would be highly unusual, the top-level object could be read-write-shared between address spaces, so I don't see the reason for including "in our vm_map" here.