vm_fault: add helper vm_fault_can_cow_rename() vm_fault: try to only share-busy page for soft faults If the fault found a vaild page that is definitely not going to be renamed for Cow, try to only sbusy the page. We do not need to validate the page, and parallel faults on the same address are excluded by the map' fltlock.
Details
- Reviewers
- alc - markj - mjg 
- Commits
- rGa38483fa2b3a: vm_fault: assert that first_m is xbusy
 rG149674bbac58: vm_fault: try to only share-busy page for soft faults
 rG3f05bbdbd80f: vm_fault: add helper vm_fault_can_cow_rename()
 rG5bd4c04a4e7f: vm_fault: add vm_fault_might_be_cow() helper
 rGc6b79f587f27: vm_fault_busy_sleep(): pass explicit allocflags for vm_page_busy_sleep()
 rG0854b4f569e1: vm/vm_fault.c: cleanup includes
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
- Lint Skipped 
- Unit
- Tests Skipped 
Event Timeline
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1926 | VM_FAULT_WIRE must occur only once from vm_fault() call from vm_map_wire(). And from my re-reading of the code, vm_fault() is careful to not wire the page when spurious call to vm_fault() occur on the wired entry. I also failed to see other reasons for disallowing faults on the same address. At worst we would do pmap_enter()s in parallel which should be handled by pmap. So I will drop the rangelock, after all. | |
Drop rangelock.
Fix condition in vm_fault_cow():
- explicitly exclude FAULT_SOFT_MSHAREDBUSY from allowed state for rename,
- use rename_cow for bool name.
| sys/vm/vm_fault.c | ||
|---|---|---|
| 169 | Rather than adding a new result and passing that around, can we instead add a new page_sbusied flag to the fault state? Or, can we just use vm_page_xbusied(fs.m) to check for this case instead of adding new state? | |
| 1118 | I think this explanation doesn't justify the code. The original problem scenario is that object O has a shadow object O', and O' is mapped by multiple processes. Then, suppose m_cow is mapped read-only into all of those processes. Suppose some process triggers a copy-on-write fault, causing m_cow to be copied into the shadow object. We must shoot down all of the mappings of m_cow, because it is fully shadowed and may be freed. I cannot see how that is prevented in the sbusy case. I also do not see how "the sbusy case is only possible when the shadow object is read-mapped" is true. Here we are handling a COW fault, so clearly the shadow object is mapped RW. | |
| 1767 | The comment about the object lock is false in the sbusy case. | |
| 1774 | object_locked would be a clearer name for this flag IMO. | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1118 | If the cow fault is triggerred and the page is copied, it must be write-fault or cow-fault. Both types are checked and excluded in the condition in vm_fault_object(). | |
Remove SOFT_FAULT_MSHAREDBUSY, use vm_page_xbusied(fs.m)
Update comment.
unlock_object->object_locked.
Add ONEMAPPING for first_object as an alternative condition for read-fault.  Either is enough to avoid the problem with shared writeable CoW.
Remove MSHAREDBUSY from the comment.
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1118 | Sorry, I don't really follow. They're only excluded in the case where fs->object == fs->first_object, from my reading, but this is not sufficient. There is a test case in tests/sys/vm/shared_shadow_inval_test.c which exercises this case and explains the problem scenario a bit more. I do not see any problems when testing it with the patch applied, but I will try to spend more time to construct a problematic test case or discover that I'm wrong about something. | |
| 1540 | I believe the check here is racy, since fs->first_object is not locked. And the parentheses look wrong. | |
| 1550 | Since the condition was checked once already. | |
| 1552 | So in this case we will call vm_fault_busy_sleep(), but why? Why not unbusy the page and fall through to the vm_page_tryxbusy() below? | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1118 | Ok, I extracted the checkers into two dedicated helpers, hope this way it would be clearer. I verify that: 
 Then I sbusy fs->m, if the failt if for read, it is enough. | |
Do what was explained in the reply to comment:
take rlock on first_object and check onemapping condition after (optimistically checked before).
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1809 | Suppose two threads fault on the same virtual page, one is a read fault, one is a write fault. Suppose the page is COW and the source page is valid. After this change, what stops them from racing? In particular, how do we ensure that the new top-level page is entered into the pmap? | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1809 | Well, the read fault is impossible, because the fault type is defined by the mapping permissions, not just the hardware faulting mode. So if it is the same virtual page, then both faults would proceed this as a write fault. | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1809 | Sorry, I don't understand. Suppose the MAP_ENTRY_NEEDS_COPY is clear and there is already a shadow object O with backing object O'. Suppose there is some resident page m in O' that is not mapped into the process. Suppose there is no resident page at that address in O. Let the write-faulting thread be T1, and the read-faulting thread be T2. Then: T2: 
 T1: 
 T2: 
 Now, subsequent reads of the page will not show the modification done by T1. | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1809 | kib pointed out on IRC that these threads are serialized by the xbusying the top-level page. I forgot that we hold the xbusy lock for the entire duration of the fault, even for read faults. | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1774 | This means that different processes faulting on the same top-level object will be serialized when faulting on the same page. So, if the top-level object is shared, i.e., OBJ_ONEMAPPING is clear, do we really need the extra synchronization from the fs.first_object read lock? It looks like we can just rely on the existing "racy" check in vm_fault_cow() after all. We just change the justification a little bit: instead of relying on the xbusy of m_cow to provide serialization, we rely on the xbusy of fs.first_m to ensure that a second process will not create a read-only pmap entry for m_cow. | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1883 | Can we also assert something like, fs.first_m == fs.m || vm_page_xbusied(fs.first_m) here? | |
Remove extension of the locked region for first_object.
Add assert for xbusy of first_m.
Switch assert from checking first_m != m to first_object != object. first_m is NULL when there is no shadow chain.
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1542 | Why disallow shared busy after the first failed attempt? | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1542 | I want(ed) to have the code path to fall back to pristine old way if sbusy failed. I think we can remove can_sbusy indeed. One note that vm_fault_busy_sleep() should be more flexible, I will add allocflags argument. | |
Eliminate can_sbusy.
Only sleep for sbusy when needed, in vm_fault_busy_sleep().
While there, clean up includes.
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1774 | The xbusy of first_m also serializes page faults through other mappings of the same top-level object. So, in the end maybe we do not need the object lock to synchronize the OBJ_ONEMAPPING check: 
 In other words, I think the argument in the comment "The flag check is racy, but this is tolerable..." can be changed to apply to the sbusy case too. Instead of using the xbusy state of m_cow to make the argument, we can use the xbusy state of first_m to argue that the racy check is safe. Then we do not need any special handling for the sbusy case. Am I missing something? | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1774 | By the special handling for the sbusy case, do you mean my extent of the fs->object lock after the vm_fault_cow() is done? I believe yes, because you said in the second sentence that 'we do not need the object lock'. The fs->object locking purpose was not to handle the OBJ_ONEMAPPING complication. There are checks in vm_fault_object() that should prevent vm_fault_cow() from copying the page from fs->object to fs->first_object, done by vm_fault_can_cow_rename(). I need a way to ensure that the invariants are kept true between the check in vm_fault_object() and the action in vm_fault_cow(). I believe that object collapse might break this. Even if it cannot now, still the checked conditions (refcount != 1 or shadow_count != 1) are delicate enough that collapse implementation should be allowed enough freedom so we do not need to rely on its internals to keep that true. | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1774 | 
 Right. 
 Ok, but really my point was that there is no OBJ_ONEMAPPING complication, or so I believe. That is, I think the sbusy case does not need to check (fs.first_object->flags & OBJ_ONEMAPPING) != 0. Or, I do not understand the justification for checking it. 
 Why not perform the checks racily, and disallow renaming if !vm_page_xbusied(fs->m)? 
 Even today, the ref_count == 1 and shadow_count == 1 checks are done without the object lock held. I can't remember offhand why that is safe. | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1774 | 
 Ok, I reduced the vm_fault_fo_is_onemapping() to vm_fault_might_be_cow(). Despite trivial, it is useful IMO, all places where I replaced direct object != first_object checks become easier to understand. 
 Ok. 
 I added the re-check after the locks are acquired. | |
Remove the OBJ_ONEMAPPING blocker from the sbusy path.
Re-check CoW rename possibility after all object's locks are acquired.
I think the change is ok, my notes are mostly about code comments.
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1119 | I think this observation can be used to simplify the whole comment. I can submit a patch for that after this work is landed. | |
| 1522 | ||
| 1523 | ||
| 1538 | ||
| 1548 | Maybe, "Now make sure that racily checked conditions are still valid" | |
| 1549 | I think this comment about ONEMAPPING is stale? | |
| 1553 | Shouldn't the predict_true apply to the whole predicate? | |
| 1560 | I wonder if it's worth maintaining a counter here? | |
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1560 | Which counter? The count of times we tried to use sbusy but after relock found that conditions are not true? I do not see it much interesting. I had a debug part of this patch that counted successful sbusy cases, and then even more fine-grained list of conditions for each reason that sbusy cannot be used. I do not think it is very useful for either production or in-field debugging. | |
Comment updates.
Mark, do you want the 'In collaboration with: markj' tag for this change?
I have been following this, but have not given it a careful reading. I should be able to do so over the weekend.
fwiw I can confirm offcpu time is gone, instead almost all of it spent contending on pmap pv locks [i.e. execs still don't scale, but the bottleneck has moved. i don't have the old numbers around for comparison]
flamegraph:
https://people.freebsd.org/~mjg/fg/D51474-vs-shared-exec-104.svg
collected with:
$ rm /tmp/out.kern_stacks ; dtrace -x stackframes=100 -n 'profile-997 { @[curthread->td_wantedlock != NULL ? stringof(curthread->td_wantedlock->lo_name) : stringof("\n"), stack()] = count(); }' -o /tmp/out.kern_stackswhile running http://apollo.backplane.com/DFlyMisc/doexec.c
$ cc -O2 -o shared-doexec doexec.c $ ./shared-doexec $(nproc)
Peter tested the change. I intend to commit it shortly, at the beginning of the next week.
I am happy to see how much simpler this has gotten, particularly the elimination of the range lock.
| sys/vm/vm_fault.c | ||
|---|---|---|
| 1523 | ||
| 1524 | Historically, we have written "COW", not "CoW". I only found one instance of the latter under vm/. | |
| 1533 | ||
| 1534 | ||
| 1535 | ||
| 1772 | While it would be highly unusual, the top-level object could be read-write-shared between address spaces, so I don't see the reason for including "in our vm_map" here. | |