Page MenuHomeFreeBSD

create straightforward EBR wrapper with rudimentary support for preemption
AbandonedPublic

Authored by kmacy on Apr 30 2018, 7:21 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Apr 30, 10:50 AM
Unknown Object (File)
Nov 10 2023, 4:53 PM
Unknown Object (File)
Oct 13 2023, 5:57 PM
Unknown Object (File)
Oct 9 2023, 3:48 PM
Unknown Object (File)
Sep 26 2023, 5:45 PM
Unknown Object (File)
Jun 20 2023, 4:28 PM
Unknown Object (File)
May 8 2023, 9:12 PM
Unknown Object (File)
Apr 14 2023, 7:40 AM
Subscribers

Details

Summary

This handles setup and teardown, pcpu references, and preemption in the middle of a EBR critical section

Example usage is replacing the ad hoc EBR in amd64/pmap.c with this:
https://github.com/mattmacy/networking/commit/5f36e665fecf53eb3c8c39cce23fd67317b88e6c
https://github.com/mattmacy/networking/commit/5ea3b95e50cc2f9787c2a8cee24979f1f3162211

Thus removing the global serialization of all pmap_remove calls (and thus all munmap calls) on the invl_gen_mtx

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 16348

Event Timeline

Where is the documentation for the ck epoch, both interface and implementation.

Please show numbers demonstrating that invl_gen_mtx is problematic. When I developed DI to get rid of global pv lock, I remember that the turnstile wait had significant appearance in the profile, which is expected and most likely not changed by your patch, because it does not change the underlying problematic scenario of pv unlock without TLB invalidation. But I do not remember that the generation count creation has any measurable bad numbers.

sys/kern/subr_epoch.c
147

This is redundand.

! In D15231#321789, @kib wrote:
Where is the documentation for the ck epoch, both interface and implementation.

There are man page for all the ck_ functions. How many _implementations_ are documented with man pages in FreeBSD?

Please show numbers demonstrating that invl_gen_mtx is problematic.

http://scalebsd.org/blog/2018/05/03/Why-is-munmap-single-threaded-and-what-can-we-do-about-it

It looks like the man pages aren't installed for ck. I'll submit a review for that.

Actually - @cognet could you please fix man page installation for ck?

Please show numbers demonstrating that invl_gen_mtx is problematic.

http://scalebsd.org/blog/2018/05/03/Why-is-munmap-single-threaded-and-what-can-we-do-about-it

This is quite interesting, because you set up a corner case where there is no single valid pte in the ranges passed to pmap_remove(). As a result, the whole work performed by pmap_remove() was to set up DI region, lock the pmap and then exit. Of course, the setup cost of the DI dominated then. I believe that the following patch (that is, in my opinion, orthogonal to your change, see the very end of the reply), would be a better optimization for your case:

diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c
index 1ce2435fe3c..1dc6143ef98 100644
--- a/sys/vm/vm_map.c
+++ b/sys/vm/vm_map.c
@@ -3158,7 +3158,9 @@ vm_map_delete(vm_map_t map, vm_offset_t start, vm_offset_t end)
                        vm_map_entry_unwire(map, entry);
                }
 
-               pmap_remove(map->pmap, entry->start, entry->end);
+               if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0 ||
+                   entry->object.vm_object != NULL)
+                       pmap_remove(map->pmap, entry->start, entry->end);
 
                /*
                 * Delete the entry only after removing all pmap

In the practically important case, unmap is done over the region that has the instantiated mappings, so we do lock pv locks, examine the shutdown IPI latency, and do wait for epoch end (in the form of the DI generation wait). When I looked at the lock profile before/after the pvh global lock removal, the total pvh global lock wait time was ~10x time of the invl_gen_ts turnstile wait time, and invl_gen_mtx wait time was minuscule comparing with the turnstile wait.

Another thing that I want to note is that the TLB invalidation epoch is really per-thread and not per-CPU. It seems (I admit that I did not fully read your implementation + ck) that you hack around this by pinning the thread to the CPU for the DI region. But then, for real case of unmap with the instantiated PTEs, you end up locking and least pmap and pv with the thread pinned. This usually do not end up with good performance, or might even cause deadlocks or at least livelocks.

Still, I believe that some way to eliminate invl_gen_mtx could be beneficial.

@kib Thanks. If I touch a page in the mapping that would be more representative correct? Page faults appear to be globally serialized by the per-domain reservations mutex, so that would likely scale no better than the no-op case without your change or mine.

@kib Thanks. If I touch a page in the mapping that would be more representative correct? Page faults appear to be globally serialized by the per-domain reservations mutex, so that would likely scale no better than the no-op case without your change or mine.

I think you might try to set up some posix shared memory object, map it once and touch every page to instantiate the memory. Then do mmap(MAP_PREFAULT_READ)/munmap() loop for the measurement. The prefault flag should fill at least some PTEs, without incurring the cost of the page fault.

On a related note, take a look at https://www.cc.gatech.edu/~smaass3/papers/latr-paper.pdf for a proposed approach to hiding the latency of TLB shootdown on munmap(2).

One needs to Jeff's uncommitted locking changes to see any benefit from this if one touches any pages in the mapping and even then it's only a ~20% win because there is then so much contention on the vm page queue mutex. Jeff is hesitant to sink too much energy in to this on behalf of a microbenchmark. There are clear wins to be had elsewhere without the complexity and risk of VM, so dropping this for now.