Details
- Reviewers
alc markj cem jhb - Commits
- rS299788: Eliminate pvh_global_lock from the amd64 pmap.
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Shouldn't there be pmap_delayed_invl_page() calls in pmap_remove_pages() where the PV entries are destroyed?
Doesn't this mean that rw_rlock(&pvh_global_lock) in the pmap_remove_pages() is not enough, and we should use rw_wlock there ? And indeed, I do not see immediately why the situation explained in your citation in the revision description is impossible with rlock there.
It is impossible because the current pmap is single-threaded. Then, why would we need pmap_delayed_invl_page() there as well ? There cannot be a third thread still modifying the page with empty pv list but not yet invalidated by us. If e.g. page daemon starts reusing a page for which we cleared pv list, nothing wrong could happen to that page - curthread never returns to usermode with that pmap active.
Yes, I agree. That the current pmap is single-threaded is also the reason why we don't use atomic load-and-clear to destroy page table entries in pmap_remove_pages().
Taken to the limit your argument would also say that the pmap_invalidate_all() at the end of pmap_remove_pages() is unnecessary, but don't we actually return to user-mode with the same pmap after an execve? So, pmap_invalidate_all() is required.
It appears to me that many of the use cases for pmap_delayed_invl_{started,finished}() are to account for the possibility that a failed demotion destroys a managed superpage mapping. However, I claim that in some of these cases, like pmap_unwire(), the calls to pmap_delayed_invl_{started,finished}() are unnecessary because the PV list lock is held until after the TLB invalidate is performed.
Well, the pmap_unwire() does not change the mapping at all, and the herald comment explains it, noting that no TLB invalidation is needed. So it does not really matter that some invalidation might be done under the PV lock (it does not, in fact).
If not braced with pmap_delayed_invl_started(), pmap_delayed_invl_page() becomes nop. It was very long time ago when I thought about this decision, but I have some memory that I did put assert initially there, but then changed it to just do nothing.
I reverted the addition of braces to pmap_remove_pages() and removed them from pmap_unwire() as well, with a comment.
pmap_unwire() may not be the best example, but for a different reason than what you cite. A demotion can fail because it cannot find an existing, reserved page table page or allocate a new one. In such cases, it destroys the mapping, including a call to pmap_invalidate_page(). However, the reason that pmap_unwire() is not the best example is that demotion of wired mappings should never fail in this way.
If not braced with pmap_delayed_invl_started(), pmap_delayed_invl_page() becomes nop. It was very long time ago when I thought about this decision, but I have some memory that I did put assert initially there, but then changed it to just do nothing.
I reverted the addition of braces to pmap_remove_pages() and removed them from pmap_unwire() as well, with a comment.
I changed the comment to
* Since pmap_demote_pde() for the wired entry must never fail, * pmap_delayed_invl_started()/finished() calls around the * function are not needed.
Take a look at pmap_protect(). I claim that it doesn't need the calls to pmap_delayed_invl_{started,finished}() for the aforementioned reason.
Yes, it does not need the braces, but IMO it is because invalidation is done under pmap lock. Would the lock dropped, we would have a similar three-way race with pmap_remove_write(), and BTW this is the reason for pmap_delayed_invl_wait() in pmap_remove_write().
I removed braces in my repo.
Where managed mappings are involved, I would never even think of performing the TLB invalidation after releasing the pmap lock. It would make the already difficult reasoning about the pmap layer even harder. :-)
In regards to pmap_remove_write(), I agree with what you've done. Concurrent calls to pmap_remove() and pmap_remove_write() are potentially problematic without the call to pmap_delayed_invl_wait() in pmap_remove_write().
sys/amd64/amd64/pmap.c | ||
---|---|---|
4282 | I don't think that pmap_enter() requires pmap_delayed_invl_{started,finished}(). I believe that every case that destroys a mapping, and its PV entry, and requires a TLB invalidation does the TLB invalidation before releasing the PV list lock, so it is safe. |
Remove pmap_delayed stuff from pmap_enter(), add the wording to the pmap_enter() herald explaining why it is not needed.
Since pmap_enter() does not do invl_start, _pmap_allocpte() must not unconditionally drop DI around VM_WAIT.
sys/amd64/amd64/pmap.c | ||
---|---|---|
2395–2396 | With the removal of the pmap_delayed_invl_{started,finished}() calls from pmap_enter(), delay_invl will never be "true" here. That said, a fair question is whether this function would ever (in the future) be used in setting where there was a delayed invalidation active. I haven't come to a conclusion. My initial reaction is "no". | |
2871 | I've led you to create a problem here. With the removal of the pmap_delayed_invl_{started,finished}() calls from pmap_enter(), we don't have an active delayed invalidation record when we get here, but we need one. This is, in fact, a problematic delayed invalidation case. However, this is such a rare case, we shouldn't be penalizing pmap_enter() by performing pmap_delayed_invl_{started,finished}() there to solve it. Is there any reason that we can't do the pmap_delayed_invl_{started,finished}() calls locally here? |
sys/amd64/amd64/pmap.c | ||
---|---|---|
2871 | I only called _started() and _finished() outside of the pmap or pv chunks locked regions. In fact, currently there is no reason to maintain this, the invl_gen_mtx is leaf. I wanted to eliminate the invl_gen_mtx at all, even looked some at the literature about complex lock-less structures. |
Remove newly added braces in _pmap_allocpte().
In reclaim_pv_chunk(), enter DI braces. I temporary drop and immediately reacquire DI after invalidation in the pmap change code, to avoid clogging large queue of waiters.
sys/amd64/amd64/pmap.c | ||
---|---|---|
5778 | I don't see a need for pmap_delayed_invl_{start,finish}ed() here either. The invalidations are performed before the PV list lock is released. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
537–541 | I believe that the corresponding PV list lock is held whenever we call pmap_delayed_invl_page(), so we don't need to use an atomic cmpset here. |
Claim that md_page.invl_gen is protected by pv list lock. Assert this. Remove cmpxchgin pmap_delayed_invl_page() (but keep rel semantic of update).
sys/amd64/include/pmap.h | ||
---|---|---|
291 | I'm always reluctant to increase the size of struct vm_page. Instead of a per-page invl_gen, I think that we should consider a per-PV list lock invl_gen. |
sys/amd64/include/pmap.h | ||
---|---|---|
291 | I agree that increasing the md_page size is not ideal. I can make an array of gen's parallel to the pv_list_locks[]. But per-PV list lock generation would mean that we have collisions for generations, causing waiters to loop waiting for invalidation of unrelated addresses. Do you still want me to code this ? |
My hypothesis is that the use cases for this new delayed invalidation synchronization primitive are now infrequent enough that spurious waits are rare.
Can we get pho@ to test this hypothesis?
How could this be arranged ? Restore md_page.invl_gen in addition to the pv_invl_gen[], and increment a counter on pv_invl_gen[i] being greater than md_page.invl_gen, in invl_wait() ?
sys/amd64/amd64/pmap.c | ||
---|---|---|
4864–4866 | Style: Simplify to } else if (!pmap_demote... |
With the following patch applied
diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c index d25c6c3..69d5ee5 100644 --- a/sys/amd64/amd64/pmap.c +++ b/sys/amd64/amd64/pmap.c @@ -484,6 +484,12 @@ pmap_delayed_invl_finished(void) static long invl_wait; SYSCTL_LONG(_vm_pmap, OID_AUTO, invl_wait, CTLFLAG_RD, &invl_wait, 0, ""); +static long invl_wait_coll1; +SYSCTL_LONG(_vm_pmap, OID_AUTO, invl_wait_coll1, CTLFLAG_RD, &invl_wait_coll1, 0, + ""); +static long invl_wait_coll2; +SYSCTL_LONG(_vm_pmap, OID_AUTO, invl_wait_coll2, CTLFLAG_RD, &invl_wait_coll2, 0, + ""); #endif static u_long * @@ -522,6 +528,9 @@ pmap_delayed_invl_wait(vm_page_t m) accounted = TRUE; } #endif + if (m->md.invl_gen <= pmap_invl_gen) + atomic_add_long(&invl_wait_coll1, 1); + atomic_add_long(&invl_wait_coll2, 1); kern_yield(PRI_USER); /* XXX */ } } @@ -548,6 +557,8 @@ pmap_delayed_invl_page(vm_page_t m) m_gen = pmap_delayed_invl_genp(m); if (*m_gen < gen) atomic_store_rel_long(m_gen, gen); + if (m->md.invl_gen < gen) + m->md.invl_gen = gen; } /* diff --git a/sys/amd64/include/pmap.h b/sys/amd64/include/pmap.h index 47d81ca..e96c635 100644 --- a/sys/amd64/include/pmap.h +++ b/sys/amd64/include/pmap.h @@ -292,6 +292,7 @@ struct md_page { TAILQ_HEAD(, pv_entry) pv_list; /* (p) */ int pv_gen; /* (p) */ int pat_mode; + u_long invl_gen; /* (p) long to avoid wraparounds. */ }; enum pmap_type {
Peter' testing shows
>> vm.pmap.invl_wait_coll2: 5228265 >> vm.pmap.invl_wait_coll1: 5156965 >> vm.pmap.invl_wait: 705
which is convincing enough that per-PV list DI generation counter gives similar wait times as the per-page DI generation.
sys/amd64/amd64/pmap.c | ||
---|---|---|
485 | Do you want to keep this counter in the committed version of the patch? If so, please add a description here. (In my opinion, it's worth keeping.) | |
524 | What do you want to do about the "XXX" here? | |
sys/amd64/include/pmap.h | ||
287–295 | You might as well commit this change now. There is nothing here related to the patch anymore. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
485 | I did not planned to keep the counter for the committable version of the patch. I do not see much sense in having unneeded atomic. For the patch development, the counter gives the understanding whether the situation occur at all, and how frequent is it. If kept, it should be moved out of INVARIANTS, IMO. | |
524 | When I wrote the patch, I believe this was put as a mark to consider precise synchronization instead of spinning. Now I do think that spin with kern_yield() is good enough there, for SMP machines. I am not quite convinced that this would work for single-CPU configs. First, I do think that the problem is present on UP. We must switch CPU between user threads 1 and 2 and pagedaemon to get use on incorrect TLB entry and empty PV. Seemingly, the context switches would do invalidation because pageademon uses kernel pmap. But things like PCID bring the problem back. So, for UP we still must provide DI mechanism and spin, and if we spin, priority inverstion could kill us on UP. It is quite possible that this should be coded as if (mp_ncpus == 1) pause("pmapdi", 1); else kern_yield(PRI_USER); (after the experience with mnt_vnode_next_active()), but this is something for Peter to test and confirm. | |
sys/amd64/include/pmap.h | ||
287–295 | Will do tomorrow. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
430–433 | "... thread. Within a DI block, the current thread may destroy both the page table and PV list entries for a mapping and then release the corresponding PV list lock before ensuring that the mapping is flushed from the TLBs of any processors with the pmap active. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
485 | How about conditioning it on PV_STATS? |
sys/amd64/amd64/pmap.c | ||
---|---|---|
485 | This looks fine, will do. |
Update patch with all changes from yesterday:
- comments
- use bool for accounted
- use PV_LISTS instead of INVARIANTS for vm.pmap.invl_wait sysctl
- UP blocking
sys/amd64/amd64/pmap.c | ||
---|---|---|
458–459 | I don't think that the sentence "No other processor ..." adds anything to the previous sentence. I would drop it. | |
461 | "This function works by bumping the global DI generation number to the ... | |
462 | "generation number of the current thread's DI, unless there is a pending DI that started earlier. In the latter case, bumping the global DI generation number would incorrectly signal that the earlier DI had finished. Instead, this function bumps the earlier DI's generation number to match the generation number of the current thread's DI. |
I'm done with all of the comments.
sys/amd64/amd64/pmap.c | ||
---|---|---|
500 | I would chop off the leading "The ... function" and just start with "Ensures". This style is consistent with the earlier comments. | |
501–502 | Two comments here:
You are restricting the set of DI blocks to those that invalidate mappings to the given page.
| |
503 | "... has an empty PV list and we call pmap_delayed_invl_wait(), upon its return we ..." | |
504–505 | "page m in either its page table or TLB. | |
507 | "This function works by spinning until the global DI generation number catches up with the generation number associated with the given page m and its PV list. | |
536 | "Mark the page m's PV list ... current thread's DI | |
537–538 | "block. Any threads concurrently using m's PV list to remove or restrict all mappings to m will wait for the current thread's DI block to complete before proceeding. | |
540 | "... setting the DI generation number for m's PV list to at least the number for the current thread. | |
543 | "the current thread calls ... | |
543 | "the current thread calls ... | |
4222–4223 | "When destroying both a page table and PV entry, this function performs the TLB ... | |
4225 | "... here. | |
5761 | "A DI block is not needed within this function, because |
So it appeared that the UP case is indeed problematic. Obvious problem in the previous patch is that pause(9) cannot be called in pmap_delayed_invl_wait(), because we typically own both object and page locks when doing the wait. OTOH, we cannot spin with kern_yield(), because threads which own DI gen numbers have sometimes lower priority than we, and machine livelocks.
I do not see any way around this than stop spinning, and temporary remove the waiting thread from CPU. We cannot sleep, because we own mutexes, and dropping the locks is not feasible. I talked briefly with jhb about suitable KPIs to use, and two variants are turnstiles without owner, or direct sched(9) KPI use in style of ithreads. Updated patch directly removes and re-adds the waiting thread to the runqueue, similar to ithreads code, and just abuses IWAIT inhibitor for blocking.
Patch was mailed to Peter for smoke-testing.
The previous attempt to abuse IWAIT for blocking thread owning vm object lock and possibly page lock, did not worked out. If another thread tried to lock on the locks, propagate_priority() failed when found non-runnable not blocked on lock thread but inhibited. That could be hacked around, but I think that it was good indicator that the approach is too hackish.
Instead, I changed the code to put the thread on turnstile. It also has an effect of putting the pmap_invl_gen changes under the our turnstile ts_lock.
The turnstile KPI use looks ok to me. My only request as I noted to Konstantin on IRC is having an overview comment explaining why this is done.
Peter did extensive testing of the latest version of the patch, including specialized checks to see if there is a blocked thread left after the DI situation due to lost wakeups. No issues were seen.
WRT John request to explain the issue which is solved by DI. First, the introductory comment before pmap_delayed_invl_started() explains what is it. Second, I will cite the letter from Alan, which was used as the introduction for the revision, in the commit log. Do you consider any more details are needed ?
Overall, this looks like a concluding stage.
Update comments. mostly about locking and blocking.
Remove now useless _rel atomic, plain write is enough now.
Yes, I think Alan's original mail is the thing I'm after. Mainly I want to document why this change is being made (and why to use a more complex construct in place of a rwlock).
sys/amd64/amd64/pmap.c | ||
---|---|---|
520–523 | "Since this function's callers typically own an object lock and sometimes own a page lock, it cannot sleep. Instead, it blocks on a turnstile to relinquish the processor." |