Page MenuHomeFreeBSD

vm_fault: try to only sbusy valid page that is not writeable
ClosedPublic

Authored by kib on Jul 23 2025, 12:03 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Oct 12, 6:16 AM
Unknown Object (File)
Sat, Oct 11, 3:41 AM
Unknown Object (File)
Sat, Oct 11, 3:41 AM
Unknown Object (File)
Fri, Oct 10, 11:54 PM
Unknown Object (File)
Fri, Oct 10, 8:47 PM
Unknown Object (File)
Fri, Oct 10, 8:44 PM
Unknown Object (File)
Fri, Oct 10, 8:42 PM
Unknown Object (File)
Fri, Oct 10, 3:16 PM
Subscribers

Details

Summary
vm_fault: add helper vm_fault_can_cow_rename()


vm_fault: try to only share-busy page for soft faults

If the fault found a vaild page that is definitely not going to be
renamed for Cow, try to only sbusy the page. We do not need to validate
the page, and parallel faults on the same address are excluded by the
map' fltlock.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/vm/vm_fault.c
1849

VM_FAULT_WIRE must occur only once from vm_fault() call from vm_map_wire(). And from my re-reading of the code, vm_fault() is careful to not wire the page when spurious call to vm_fault() occur on the wired entry.

I also failed to see other reasons for disallowing faults on the same address. At worst we would do pmap_enter()s in parallel which should be handled by pmap.

So I will drop the rangelock, after all.

kib marked 4 inline comments as done.

Drop rangelock.
Fix condition in vm_fault_cow():

  • explicitly exclude FAULT_SOFT_MSHAREDBUSY from allowed state for rename,
  • use rename_cow for bool name.
sys/vm/vm_fault.c
166

Rather than adding a new result and passing that around, can we instead add a new page_sbusied flag to the fault state? Or, can we just use vm_page_xbusied(fs.m) to check for this case instead of adding new state?

1094

I think this explanation doesn't justify the code. The original problem scenario is that object O has a shadow object O', and O' is mapped by multiple processes.

Then, suppose m_cow is mapped read-only into all of those processes. Suppose some process triggers a copy-on-write fault, causing m_cow to be copied into the shadow object. We must shoot down all of the mappings of m_cow, because it is fully shadowed and may be freed.

I cannot see how that is prevented in the sbusy case. I also do not see how "the sbusy case is only possible when the shadow object is read-mapped" is true. Here we are handling a COW fault, so clearly the shadow object is mapped RW.

1704

The comment about the object lock is false in the sbusy case.

1711

object_locked would be a clearer name for this flag IMO.

kib marked 3 inline comments as done.Aug 3 2025, 7:11 PM
kib added inline comments.
sys/vm/vm_fault.c
1094

If the cow fault is triggerred and the page is copied, it must be write-fault or cow-fault. Both types are checked and excluded in the condition in vm_fault_object().

Remove SOFT_FAULT_MSHAREDBUSY, use vm_page_xbusied(fs.m)
Update comment.
unlock_object->object_locked.

Add ONEMAPPING for first_object as an alternative condition for read-fault. Either is enough to avoid the problem with shared writeable CoW.
Remove MSHAREDBUSY from the comment.

sys/vm/vm_fault.c
1094

Sorry, I don't really follow. They're only excluded in the case where fs->object == fs->first_object, from my reading, but this is not sufficient.

There is a test case in tests/sys/vm/shared_shadow_inval_test.c which exercises this case and explains the problem scenario a bit more. I do not see any problems when testing it with the patch applied, but I will try to spend more time to construct a problematic test case or discover that I'm wrong about something.

1507

I believe the check here is racy, since fs->first_object is not locked. And the parentheses look wrong.

1517

Since the condition was checked once already.

1519

So in this case we will call vm_fault_busy_sleep(), but why? Why not unbusy the page and fall through to the vm_page_tryxbusy() below?

kib marked 5 inline comments as done.Aug 4 2025, 12:38 AM
kib added inline comments.
sys/vm/vm_fault.c
1094

Ok, I extracted the checkers into two dedicated helpers, hope this way it would be clearer.

I verify that:

  1. page is fully valid
  2. fault is for read OR fault is for one-mapped top object
  3. conditions for renaming do not occur

Then I sbusy fs->m, if the failt if for read, it is enough.
If the fault is for write, I additionally rlock top-level shadow, and recheck onemapping condition. The first_object is unlocked together with the object.

kib marked an inline comment as done.

Do what was explained in the reply to comment:
take rlock on first_object and check onemapping condition after (optimistically checked before).

sys/vm/vm_fault.c
1737

Suppose two threads fault on the same virtual page, one is a read fault, one is a write fault. Suppose the page is COW and the source page is valid. After this change, what stops them from racing? In particular, how do we ensure that the new top-level page is entered into the pmap?

sys/vm/vm_fault.c
1737

Well, the read fault is impossible, because the fault type is defined by the mapping permissions, not just the hardware faulting mode. So if it is the same virtual page, then both faults would proceed this as a write fault.

sys/vm/vm_fault.c
1737

Sorry, I don't understand. Suppose the MAP_ENTRY_NEEDS_COPY is clear and there is already a shadow object O with backing object O'. Suppose there is some resident page m in O' that is not mapped into the process. Suppose there is no resident page at that address in O.

Let the write-faulting thread be T1, and the read-faulting thread be T2. Then:

T2:

  • looks up m, and sbusies it, it does not call vm_fault_cow()

T1:

  • allocates a page in O to shadow m, then sbusies m to copy it into the new page
  • calls pmap_enter(new page)

T2:

  • calls pmap_enter(m)

Now, subsequent reads of the page will not show the modification done by T1.

sys/vm/vm_fault.c
1737

kib pointed out on IRC that these threads are serialized by the xbusying the top-level page. I forgot that we hold the xbusy lock for the entire duration of the fault, even for read faults.

Add a comment about xbusy state for fs.first_m.

sys/vm/vm_fault.c
1711

This means that different processes faulting on the same top-level object will be serialized when faulting on the same page. So, if the top-level object is shared, i.e., OBJ_ONEMAPPING is clear, do we really need the extra synchronization from the fs.first_object read lock? It looks like we can just rely on the existing "racy" check in vm_fault_cow() after all. We just change the justification a little bit: instead of relying on the xbusy of m_cow to provide serialization, we rely on the xbusy of fs.first_m to ensure that a second process will not create a read-only pmap entry for m_cow.

sys/vm/vm_fault.c
1809

Can we also assert something like, fs.first_m == fs.m || vm_page_xbusied(fs.first_m) here?

kib marked 5 inline comments as done.

Remove extension of the locked region for first_object.
Add assert for xbusy of first_m.

Switch assert from checking first_m != m to first_object != object. first_m is NULL when there is no shadow chain.

sys/vm/vm_fault.c
1509

Why disallow shared busy after the first failed attempt?

kib marked an inline comment as done.Aug 6 2025, 9:57 PM
kib added inline comments.
sys/vm/vm_fault.c
1509

I want(ed) to have the code path to fall back to pristine old way if sbusy failed.

I think we can remove can_sbusy indeed. One note that vm_fault_busy_sleep() should be more flexible, I will add allocflags argument.

kib marked an inline comment as done.

Eliminate can_sbusy.
Only sleep for sbusy when needed, in vm_fault_busy_sleep().
While there, clean up includes.

sys/vm/vm_fault.c
1711

The xbusy of first_m also serializes page faults through other mappings of the same top-level object. So, in the end maybe we do not need the object lock to synchronize the OBJ_ONEMAPPING check:

  • Suppose OBJ_ONEMAPPING is set in first_object. Then, m_cow is only mapped in the current pmap, and no others. first_m is xbusied and shadows m_cow. Then, if OBJ_ONEMAPPING is cleared immediately after, we know that the other map which maps first_object will not call pmap_enter(m_cow), since that requires first_m to be xbusied.
  • Suppose OBJ_ONEMAPPING is clear in first_object. Then we are potentially shadowing m_cow, so we should shoot down all of its mappings with pmap_remove_all().

In other words, I think the argument in the comment "The flag check is racy, but this is tolerable..." can be changed to apply to the sbusy case too. Instead of using the xbusy state of m_cow to make the argument, we can use the xbusy state of first_m to argue that the racy check is safe. Then we do not need any special handling for the sbusy case. Am I missing something?

kib marked an inline comment as done.Aug 13 2025, 2:42 AM
kib added inline comments.
sys/vm/vm_fault.c
1711

By the special handling for the sbusy case, do you mean my extent of the fs->object lock after the vm_fault_cow() is done? I believe yes, because you said in the second sentence that 'we do not need the object lock'.

The fs->object locking purpose was not to handle the OBJ_ONEMAPPING complication. There are checks in vm_fault_object() that should prevent vm_fault_cow() from copying the page from fs->object to fs->first_object, done by vm_fault_can_cow_rename(). I need a way to ensure that the invariants are kept true between the check in vm_fault_object() and the action in vm_fault_cow().

I believe that object collapse might break this. Even if it cannot now, still the checked conditions (refcount != 1 or shadow_count != 1) are delicate enough that collapse implementation should be allowed enough freedom so we do not need to rely on its internals to keep that true.

sys/vm/vm_fault.c
1711

By the special handling for the sbusy case, do you mean my extent of the fs->object lock after the vm_fault_cow() is done? I believe yes, because you said in the second sentence that 'we do not need the object lock'.

Right.

The fs->object locking purpose was not to handle the OBJ_ONEMAPPING complication.

Ok, but really my point was that there is no OBJ_ONEMAPPING complication, or so I believe. That is, I think the sbusy case does not need to check (fs.first_object->flags & OBJ_ONEMAPPING) != 0. Or, I do not understand the justification for checking it.

There are checks in vm_fault_object() that should prevent vm_fault_cow() from copying the page from fs->object to fs->first_object, done by vm_fault_can_cow_rename(). I need a way to ensure that the invariants are kept true between the check in vm_fault_object() and the action in vm_fault_cow().

Why not perform the checks racily, and disallow renaming if !vm_page_xbusied(fs->m)?

I believe that object collapse might break this. Even if it cannot now, still the checked conditions (refcount != 1 or shadow_count != 1) are delicate enough that collapse implementation should be allowed enough freedom so we do not need to rely on its internals to keep that true.

Even today, the ref_count == 1 and shadow_count == 1 checks are done without the object lock held. I can't remember offhand why that is safe.

kib marked an inline comment as done.Aug 14 2025, 3:45 AM
kib added inline comments.
sys/vm/vm_fault.c
1711

The fs->object locking purpose was not to handle the OBJ_ONEMAPPING complication.

Ok, but really my point was that there is no OBJ_ONEMAPPING complication, or so I believe. That is, I think the sbusy case does not need to check (fs.first_object->flags & OBJ_ONEMAPPING) != 0. Or, I do not understand the justification for checking it.

Ok, I reduced the vm_fault_fo_is_onemapping() to vm_fault_might_be_cow(). Despite trivial, it is useful IMO, all places where I replaced direct object != first_object checks become easier to understand.

There are checks in vm_fault_object() that should prevent vm_fault_cow() from copying the page from fs->object to fs->first_object, done by vm_fault_can_cow_rename(). I need a way to ensure that the invariants are kept true between the check in vm_fault_object() and the action in vm_fault_cow().

Why not perform the checks racily, and disallow renaming if !vm_page_xbusied(fs->m)?

Ok.

I believe that object collapse might break this. Even if it cannot now, still the checked conditions (refcount != 1 or shadow_count != 1) are delicate enough that collapse implementation should be allowed enough freedom so we do not need to rely on its internals to keep that true.

Even today, the ref_count == 1 and shadow_count == 1 checks are done without the object lock held. I can't remember offhand why that is safe.

I added the re-check after the locks are acquired.

Remove the OBJ_ONEMAPPING blocker from the sbusy path.
Re-check CoW rename possibility after all object's locks are acquired.

I think the change is ok, my notes are mostly about code comments.

sys/vm/vm_fault.c
1095

I think this observation can be used to simplify the whole comment. I can submit a patch for that after this work is landed.

1489
1490
1505
1515

Maybe, "Now make sure that racily checked conditions are still valid"

1516

I think this comment about ONEMAPPING is stale?

1520

Shouldn't the predict_true apply to the whole predicate?

1527

I wonder if it's worth maintaining a counter here?

This revision is now accepted and ready to land.Aug 29 2025, 3:39 PM
kib marked 8 inline comments as done.Aug 29 2025, 5:03 PM
kib added inline comments.
sys/vm/vm_fault.c
1527

Which counter? The count of times we tried to use sbusy but after relock found that conditions are not true? I do not see it much interesting.

I had a debug part of this patch that counted successful sbusy cases, and then even more fine-grained list of conditions for each reason that sbusy cannot be used. I do not think it is very useful for either production or in-field debugging.

Comment updates.

Mark, do you want the 'In collaboration with: markj' tag for this change?

This revision now requires review to proceed.Aug 29 2025, 5:04 PM

I have been following this, but have not given it a careful reading. I should be able to do so over the weekend.

In D51474#1193369, @kib wrote:

Comment updates.

Mark, do you want the 'In collaboration with: markj' tag for this change?

IMO it's not necessary, I just reviewed the change.

sys/vm/vm_fault.c
1521
1527

Yes. It should be cheap to maintain and maybe it will reveal some unexpected phenomenon. But ok, it's probably not useful.

This revision is now accepted and ready to land.Aug 29 2025, 5:46 PM

fwiw I can confirm offcpu time is gone, instead almost all of it spent contending on pmap pv locks [i.e. execs still don't scale, but the bottleneck has moved. i don't have the old numbers around for comparison]

flamegraph:
https://people.freebsd.org/~mjg/fg/D51474-vs-shared-exec-104.svg

collected with:

$ rm /tmp/out.kern_stacks ; dtrace -x stackframes=100 -n 'profile-997 { @[curthread->td_wantedlock != NULL ? stringof(curthread->td_wantedlock->lo_name) : stringof("\n"), stack()] = count(); }' -o /tmp/out.kern_stacks

while running http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -O2 -o shared-doexec doexec.c
$ ./shared-doexec $(nproc)

Peter tested the change. I intend to commit it shortly, at the beginning of the next week.

I am happy to see how much simpler this has gotten, particularly the elimination of the range lock.

sys/vm/vm_fault.c
1490
1491

Historically, we have written "COW", not "CoW". I only found one instance of the latter under vm/.

1500
1501
1502
1709

While it would be highly unusual, the top-level object could be read-write-shared between address spaces, so I don't see the reason for including "in our vm_map" here.

kib marked 8 inline comments as done.

Update comments.

This revision now requires review to proceed.Sep 14 2025, 7:15 PM