Page MenuHomeFreeBSD

Make vm_map a threaded tree
ClosedPublic

Authored by dougm on Oct 10 2019, 4:47 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Mar 8, 3:59 AM
Unknown Object (File)
Feb 15 2024, 7:43 AM
Unknown Object (File)
Feb 15 2024, 7:43 AM
Unknown Object (File)
Feb 15 2024, 7:43 AM
Unknown Object (File)
Feb 15 2024, 7:43 AM
Unknown Object (File)
Feb 15 2024, 7:43 AM
Unknown Object (File)
Feb 15 2024, 7:43 AM
Unknown Object (File)
Feb 15 2024, 7:43 AM

Details

Summary

Change the vm_map tree so that it is threaded; where a NULL pointer appears as a left child now, point to the previous item, and where a NULL pointer appears as a right child, point to the next item. Where there is no previous or next item, point to the map header.

Remove the prev pointer, and use a simple walk - of expected length 2 - to find it using left and right pointers. Take more opportunities to keep trailing pointers when iterating forward, to avoid unnecessary calculation.

Remove the next pointer. Where it is used to keep a deferred list of map entries, use the right field instead. Where it is used to find the next entry in the ordered set, use a function to calculate that entry.

The change reduces the size of vm_map_entry, obviously. On a pair of tests that do nothing but create and free map entries, it reduces data cache miss counts and slightly increases run time.

Test 1


Test 2

Reran tests with ic-misses and retired instructions stats from pmcstat.

Test 1, before

s/00/dc-misses  s/00/ic-misses s/00/instructions       time
     640126288       189963218       68389420159       17.183823
     635966376       188903061       68383732671       17.136943
     645576968       187442527       68383619075       17.244985
     641578297       190145438       68383361081       17.186198
     643313499       193056507       68383326531       17.249235

Test 1, after

s/00/dc-misses  s/00/ic-misses s/00/instructions    time
      625740686        42856465       73497547534    17.382173
      615782289        41645451       73497370425    17.355385
      621791871        43549885       73497607344    17.412323
      621381530        46539739       73497547503    17.430227
      614402301        42532386       73497248665    17.307636

Test 2, before

s/00/dc-misses  s/00/ic-misses s/00/instructions       time 
     609155004       183879088       66508137909       16.320631
     615834730       184101900       66501366693       16.336226
     618390676       184364082       66501780532       16.400097
     609124953       184059016       66501644482       16.361776
     609934927       184430198       66501727209       16.349004

Test 2, after

s/00/dc-misses  s/00/ic-misses s/00/instructions    time
      581142919        43054092       71751406088    16.507261
      584860285        47687381       71744859944    16.575260
      578066986        42316363       71744755831    16.503378
      579061559        41838134       71744611514    16.543057
      593114238        40905403       71745027601    16.517714

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

startup_alloc from "UMA Zones", 4 boot pages left
startup_alloc from "UMA Zones", 3 boot pages left
startup_alloc from "UMA Hash", 2 boot pages left
startup_alloc from "UMA Zones", 1 boot pages left
Entering uma_startup1 with 0 boot pages left
panic: map 0xfffff80003002000 max = 7d8df000, max_left = 1ff80000000, max_right = 7d8df000
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff824ef830
vpanic() at vpanic+0x19d/frame 0xffffffff824ef880
panic() at panic+0x43/frame 0xffffffff824ef8e0
_vm_map_assert_consistent() at _vm_map_assert_consistent+0x3e1/frame 0xffffffff824ef940
vm_map_splay() at vm_map_splay+0x176/frame 0xffffffff824ef9a0
vm_map_lookup_entry() at vm_map_lookup_entry+0xc7/frame 0xffffffff824ef9f0
vm_map_insert() at vm_map_insert+0x1b5/frame 0xffffffff824efae0
kmem_init() at kmem_init+0xab/frame 0xffffffff824efb10
vm_mem_init() at vm_mem_init+0x51/frame 0xffffffff824efb20
mi_startup() at mi_startup+0x210/frame 0xffffffff824efb70
btext() at btext+0x2c
KDB: enter: panic
[ thread pid 0 tid 0 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db> x/s version
version:        FreeBSD 13.0-CURRENT #1 r353388M: Thu Oct 10 12:56:20 CEST 2019\012    pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO\012
db>

I ran all of the stress2 test on 63130. I found one problem, which to me looks unrelated.
https://people.freebsd.org/~pho/stress/log/quota10-2.txt

What do you mean by idle map entries ? The deferred list ?

sys/vm/vm_map.c
977

Isn't this a do/while loop ? You use do/while in most other places.

5011

Can you commit just the move of vm_map_assert_consistent so that any changes are visible ?

dougm marked 2 inline comments as done.

Reformat vm_map_entry_{pred,succ}.

Incorporate changes just checked in to move the vm_map validation code.

In D21964#480590, @pho wrote:

I ran all of the stress2 test on 63130. I found one problem, which to me looks unrelated.
https://people.freebsd.org/~pho/stress/log/quota10-2.txt

Peter,

Are you setting debug.vmmap_check to 1?

Restore the invariant-checking code that fell out of the last patch.

In D21964#481052, @alc wrote:
In D21964#480590, @pho wrote:

I ran all of the stress2 test on 63130. I found one problem, which to me looks unrelated.
https://people.freebsd.org/~pho/stress/log/quota10-2.txt

Peter,

Are you setting debug.vmmap_check to 1?

Thank you for reminding me. I'll do that with the test of Diff 63267.

I read through this a couple of times and it seems fine to me. You might consider breaking this into several patches, the first of which introduces vm_map_entry_{pred,succ}() as wrappers for direct accesses of entry->next and entry->pred and converts existing code to use them.

sys/vm/vm_map.c
936

Isn't this the same as ulmax()?

I read through this a couple of times and it seems fine to me. You might consider breaking this into several patches, the first of which introduces vm_map_entry_{pred,succ}() as wrappers for direct accesses of entry->next and entry->pred and converts existing code to use them.

Thanks for the suggestion.

sys/vm/vm_map.c
936

I don't know.
When I look for a definition of vm_size_t, I find this:
#ifdef LP64
typedef uint64_t u_register_t;
typedef uint64_t vm_offset_t;
typedef uint64_t vm_paddr_t;
typedef uint64_t vm_size_t;
#else
typedef uint32_t u_register_t;
typedef uint32_t vm_offset_t;
typedef uint64_t vm_paddr_t;
typedef uint32_t vm_size_t;
#endif
so it's not obvious to me that vm_size_t is the same as unsigned long, for all architectures.
Besides, somebody could change it to unsigned long long someday, and make ulmax the wrong function.

sys/vm/vm_map.c
936

The two data models supported by FreeBSD are LP64 (in which long is 64 bits wide) and ILP32 (in which long is 32 bits). In these cases sizeof(unsigned long) == sizeof(vm_size_t).

I don't object to having an explicit vm_size_max(), but I suspect that the right long-term solution is to implement a generic max() and min() using typeof.

Take out the parts that have been check in separately. Leave in the renaming parts from a patch awaiting review. Macro-size some splay-helper functions to help improve performance.

This change still improves performance on the test I run repeatedly, but the baseline performance I'm starting from has gotten worse lately,

Remove the bits that have been checked in separately.

Reconsider a pessimization to vm_map_findspace.

20191125 08:32:45 all (271/680): proccontrol.sh
panic: caller failed to provide space 0x400000 at address 0x800a56000
cpuid = 4
time = 1574667167
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01b74cc6c0
vpanic() at vpanic+0x17e/frame 0xfffffe01b74cc720
panic() at panic+0x43/frame 0xfffffe01b74cc780
vm_map_alignspace() at vm_map_alignspace+0x9b/frame 0xfffffe01b74cc7f0
vm_map_find() at vm_map_find+0x697/frame 0xfffffe01b74cc8b0
vm_map_find_min() at vm_map_find_min+0x85/frame 0xfffffe01b74cc930
vm_mmap_object() at vm_mmap_object+0x395/frame 0xfffffe01b74cc9c0
kern_mmap() at kern_mmap+0x56b/frame 0xfffffe01b74ccaa0
sys_mmap() at sys_mmap+0x2a/frame 0xfffffe01b74ccac0
amd64_syscall() at amd64_syscall+0x2f1/frame 0xfffffe01b74ccbf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe01b74ccbf0
--- syscall (477, FreeBSD ELF64, sys_mmap), rip = 0x800ec3f6a, rsp = 0x7fffff8ea238, rbp = 0x7fffff8ea250 ---
KDB: enter: panic
[ thread pid 65014 tid 102218 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1082cc6(%rip)
db>

https://people.freebsd.org/~pho/stress/log/dougm065.txt

Update to account for the fact that several parts originally in this patch have now been checked in. They were the parts that let me show a little time improvement, and what's left here, about nothing but threading the tree, will probably give back some of that time saved. I'll update the timing results and cache miss results in a while.

kib added inline comments.
sys/vm/vm_map.c
960–984

Do you need 'successor' in the line above.

960–984

I suggest to move this line to the block comment before vm_map_entry_pred.

sys/vm/vm_map.h
100

Perhaps note that the tree of left/right entries is threaded.

This revision is now accepted and ready to land.Nov 29 2019, 1:13 PM

I ran tests for 5 hours on D21964.65043.diff. No problems seen.

dougm edited the summary of this revision. (Show Details)
dougm marked 3 inline comments as done.
dougm edited the summary of this revision. (Show Details)

Apply reviewer suggestions. Modify SPLAY_{LEFT,RIGHT}_STEP to address performance issues. Update timing results. More than have the time lost is recovered. The rest is likely due to the cost of successor and predecessor.

This revision now requires review to proceed.Dec 2 2019, 7:33 AM

Ran a brief 5 hour test with D21964.65115.diff. No problems seen.

How are you collecting performance counters in your benchmark? Would it also be possible to collect branch miss and CPI statistics?

This revision is now accepted and ready to land.Dec 4 2019, 5:02 PM

How are you collecting performance counters in your benchmark? Would it also be possible to collect branch miss and CPI statistics?

For the test results I'm about to post, I used these commands:
pmcstat -w 1000 -s dc-misses -s ic-misses -s instructions -s branch-mispredicts -s cycles ./mapper3 11291960
pmcstat -w 1000 -s dc-misses -s ic-misses -s instructions -s branch-mispredicts -s cycles ./mapper4 11291960

I ran on the unmodified sources (as of the morning of Dec 3), those sources with this patch applied, and a version that includes the threaded tree but leaves in the next and prev pointers and uses them for iteration.

s/00/dc-misses  s/00/ic-misses s/00/instructions s/00/branch          s/00/cycles       time           CPI
                                                 mispredicts

Task 1, orig

635519051        67721814       68382803870   468743162        2600303896078       16.876120      38.025
636536109        69809487       68401937673   453516153        2655959449291       16.802909      38.828
691361609        68951071       68383119815   463482775        2712282343866       17.4492        39.663
666508572        68515415       68382913205   451125552        2767994771160       16.820196      40.477
716278743        69617072       68383235435   466392978        2824587360969       17.84485       41.305

Task 1, new

628516597       147270465       73499105244   531213455        2752418717461       17.766098      37.448
626182086       151898116       73497980192   506313213        2810797929426       17.623627      38.243
638290092       147966516       73502455117   503078531        2869155300523       17.619068      39.034
631932612       147742434       73497908517   524439846        2927815059993       17.710244      39.835
644331949       151143575       73497869554   505177139        2986251036358       17.641289      40.630

Task 1, thread-only

658661472        44540493       72499929753   488659720        1515242848652       17.196746      20.899
658458049        45146760       72498130543   495786935        1572466946473       17.274845      21.689
675764524        44447842       72498528808   506969869        1629931574682       17.349269      22.482
671799221        46618107       72498539211   489143361        1687126708318       17.267944      23.271
665154367        44392165       72498279897   497206675        1744367326692       17.280230      24.060

Task 2, orig

663355532        68017435       66501311280   390183163        2878027149454       16.133601      43.277
676582913        67675264       66501495705   396514188        2931818142853       16.239972      44.086
659315497        67832578       66501129083   389748369        2985182147798       16.110968      44.889
640289017        67842352       66504359262   399214467        3038807590479       16.188441      45.693
638772186        67962526       66504190793   391813704        3092040438585       16.71078       46.493

Task 2, new

594153471       145740371       71745769197   483110082        3042662332568       17.30522       42.408
601037993       148140939       71745554567   475589902        3099101588621       17.39746       43.195
594646812       146308356       71745457292   478335942        3155455129566       17.13924       43.981
591362880       147579084       71745365481   457985264        3211686914936       16.975958      44.765
592606198       147452281       71766157792   462896094        3267659510843       16.898582      45.532

Task2, thread-only

626360704        44059265       70744414310   448963225        1799155462267       16.533142      25.431
626954656        45099120       70743950490   446033217        1853765142275       16.487174      26.203
627779064        44980375       70744410626   436379734        1908287273160       16.460812      26.974
637162805        49819489       70762142662   431767188        1962823547944       16.463707      27.738
627576399        44202677       70744702816   442987868        2017512323454       16.510599      28.518

How are you collecting performance counters in your benchmark? Would it also be possible to collect branch miss and CPI statistics?

For the test results I'm about to post, I used these commands:

Thanks. To be clear, "old" is unmodified sources, "new" is the patch but with next and prev pointers retained, and "new, thread-only" is this patch?

The results are a bit surprising: the main difference seems to be in the number of icache misses, but I can't really see why that would be. The CPI numbers also seem quite large. What CPU are you using?

Thanks. To be clear, "old" is unmodified sources, "new" is the patch but with next and prev pointers retained, and "new, thread-only" is this patch?

"new" is this patch. "thread-only" is this patch with next and prev pointers maintained.

The results are a bit surprising: the main difference seems to be in the number of icache misses, but I can't really see why that would be. The CPI numbers also seem quite large. What CPU are you using?

CPU: AMD Phenom(tm) II X6 1100T Processor (3311.19-MHz K8-class CPU)

Update base to r355439. Stop passing dereferenced pointers to SPLAY_{LEFT,RIGHT}_STEP macros, since it seems that the compiler can't avoid extra operations for possible aliasing.

The latest test results very nearly suggest that all the time still lost is on account of the succ and pred operations.

s/00/dc-misses  s/00/ic-misses s/00/instructions s/00/branch          s/00/cycles       time
                                                 mispredicts

Task 1, unmodified

658593144        69343796       68388225521   474814412        1233619048749   	   17.13142
807193510        68400515       68384209825   482428205        1291629762497   	   17.514024
802032779        67912356       68383687888   471229631        1349147103709   	   17.365306
804451231        69639909       68397250313   482549247        1407035194863   	   17.477138
662658374        69223080       68383345398   471052266        1463209217281   	   16.958047

Task 1, thread-only

654375118        27358753       70733022690   491222223         747489333494       17.14201
654434723        29620107       70730358858   498873819         804153524018       17.106413
644890479        28077044       70729934592   489880644         860324421957       16.959090
646084189        27557405       70729978065   490069463         916837797181       17.60424
650418209        28003448       70730083647   493246855         973122041263       16.991522

Task 1, full patch

608519970        94743659       71864790005   526787718         475494284965   	   17.356757
609813018        94949933       71835036519   529107728         533067692122   	   17.380035
659675274       104843474       71829196396   524013772         590612485587   	   17.372077
611792142        93893760       71827984527   522196164         648085702310   	   17.351980
612015537        94041945       71828694413   526801132         705621083401   	   17.370631

Task 2, unmodified

626660296        68790028       66501552517   421926812        1516977687969   	   16.232848
632883567        68715555       66501508412   422720599        1570782054590   	   16.244070
633027355        70044547       66508297900   427302947        1624809519871   	   16.311303
631529719        68445223       66501644507   425281880        1678658246283   	   16.256124
626087791        68878058       66501501226   418632710        1732278665708   	   16.188098

Task 2, thread-only

623723537        27265616       68988735140   444489624        1027237466264       16.327150
618349057        27537080       68995018461   437053821        1081228270185       16.301293
617151557        27213237       68988205710   446091056        1135294724990       16.322393
617566947        27449109       68988389840   443327300        1189362114550       16.323801
618450959        27121330       68988618184   442113416        1243229107690       16.263455

Task 2, full patch

578851041        94483219       70090375944   454536158         760503497081   	   16.562908
586381359        96028490       70090727854   450857920         815331087371   	   16.551824
583430156        93853904       70090355428   451143118         870094379760   	   16.533528
579209762        94021842       70090731632   462017750         925062804711   	   16.595416
582200665        95707622       70090568068   465023341         980162330662   	   16.635007
This revision now requires review to proceed.Dec 6 2019, 9:19 AM

My personal opinion on this, I wold not bother with that small 'regressions' (so to say) in the microbenchmarks. The reduction of the struct vm_map_entry by 16 bytes alone is worth it regardless of these almost meaningless numbers.

In D21964#496706, @kib wrote:

My personal opinion on this, I wold not bother with that small 'regressions' (so to say) in the microbenchmarks. The reduction of the struct vm_map_entry by 16 bytes alone is worth it regardless of these almost meaningless numbers.

I think I agree. BTW, there are 4 unused pad bytes following the wired_count field on LP64 architectures.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 7 2019, 5:14 PM
This revision was automatically updated to reflect the committed changes.

This change broke devel/libgtop on powerpc64 (I didn't try amd64, it could also break there). The error is:

procmap.c:282:39: error: no member named 'next' in 'struct vm_map_entry'
        first = vmspace.vm_map.header.next;
                ~~~~~~~~~~~~~~~~~~~~~ ^
procmap.c:285:56: error: no member named 'next' in 'struct vm_map_entry'
                        (gulong) vmspace.vm_map.header.next,
                                 ~~~~~~~~~~~~~~~~~~~~~ ^
procmap.c:308:56: error: no member named 'next' in 'struct vm_map_entry'
                                        (gulong) entry.next,
                                                 ~~~~~ ^
procmap.c:386:24: error: no member named 'next' in 'struct vm_map_entry'
        } while (entry.next != first);
                 ~~~~~ ^

A similar problem arose in lib/libprocstat/libprocstat.c and was
addressed in the most recent change to that file.  I'll look at this
problem tonight, and expect to find that a similar solution will suffice.

Doug

This change broke devel/libgtop on powerpc64 (I didn't try amd64, it could also break there). The error is:

procmap.c:282:39: error: no member named 'next' in 'struct vm_map_entry'
        first = vmspace.vm_map.header.next;

I have good news, and bad news. The good news is that Neel Chauhan made a change to libgtop two days ago that allows procmap.c to compile on freebsd. The bad news is that the change is incorrect, in a way that will have it miss some map entries.

The change that Neel made replaces ".next" with ".right" everywhere in procmap.c. That does not get every entry, because 'right' is not just a new name for 'next'.

The same problem arose in freebsd in lib/libprocstat, libprocstat.c. To address it, I defined something called vm_map_entry_read_succ to get from one map entry to the next via a reader like kvm_read. I'll paste some bits here.

static int
procstat_vm_map_reader(void *token, vm_map_entry_t addr, vm_map_entry_t dest)
{
kvm_t *kd;

kd = (kvm_t *)token;
return (kvm_read_all(kd, (unsigned long)addr, dest, sizeof(*dest)));
}

...

		vmentry = vmspace.vm_map.header;
		for (entryp = vm_map_entry_read_succ(kd, &vmentry, procstat_vm_map_reader);
		    entryp != NULL && entryp != &kp->ki_vmspace->vm_map.header;
		     entryp = vm_map_entry_read_succ(kd, &vmentry, procstat_vm_map_reader)) {

vm_map_entry_read_succ is an inline function that will get you from one entry to the next. I believe it can be used to fix procmap.c.

@dougm Thanks for letting me know. I am working on an updated libgtop patch and will post it here (and at GNOME).

In D21964#500356, @neel_neelc.org wrote:

I sent an updated patch to the GNOME people here: https://gitlab.gnome.org/GNOME/libgtop/merge_requests/12

I also have an fixed Bugzilla patch here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242533

In the patch at gnome.org, I don't see where entryn gets initialized. Perhaps you intend this:
struct vm_map_entry entryn;

entryn = vmspace.vm_map.header;
entryp = vm_map_read_succ(server->machine->kd, &entryn, procstat_vm_map_reader);

since otherwise, I don't know what struct vm_map_entry is being read-into.

I integrated your changes. Would the updated Ports/GNOME changes work?

In D21964#500392, @neel_neelc.org wrote:

I integrated your changes. Would the updated Ports/GNOME changes work?

I'm not finding a good way to view your changes. I've taken a shot at a solution, for your consideration. It is attached.

@dougm Your patch surely does look cleaner than mine (I just needed something to unbreak the build), but are you sure it will work on 12.x and 11.x? If so, lets go with yours.

I built my patch to (hopefully) be compatible with 12.x and 11.x, so that's why it looks ugly, it has #ifdefs. I don't know if 12.x and below has vm_map_entry_read_succ().

In D21964#500456, @neel_neelc.org wrote:

@dougm Your patch surely does look cleaner than mine (I just needed something to unbreak the build), but are you sure it will work on 12.x and 11.x? If so, lets go with yours.

I built my patch to (hopefully) be compatible with 12.x and 11.x, so that's why it looks ugly, it has #ifdefs. I don't know if 12.x and below has vm_map_entry_read_succ().

The only thing I'm using that wasn't in this code already is 'nentries', but that has been around a long time.

12.x and below do not have vm_map_entry_read_succ(), which is why I defined one in the attached patch for versions before 13.

Sounds good, lets go with your patch then.

I closed my merge request on the GNOME GitLab.

In D21964#500501, @neel_neelc.org wrote:

Sounds good, lets go with your patch then.

I closed my merge request on the GNOME GitLab.

I assume that you'll offer my patch to GNOME GitLab, then. I have no associations there.

I'll offer your patch to GNOME's GitLab, but will give you credit as well.

@dougm I uploaded your patch to GNOME GitLab here: https://gitlab.gnome.org/GNOME/libgtop/merge_requests/13

I gave you credit as its your patch. I don't want to steal your work and call it mine.