Replace the implementations of lookup_le and lookup_ge with ones that do not use a stack or climb back up the tree.
Details
- Reviewers
alc markj - Commits
- rG6f251ef228e6: radix_trie: simplify ge, le lookups
On amd64, these changes reduce the size of the compiled lookup_ge and lookup_le routines by 112 bytes each.
A buildworld test suggests that this change speeds up lookup_le calls by 21.6%, and slows lookup_ge calls by 4.4%.
original timing:
52450.199u 1452.606s 59:09.38 1518.6% 73478+3095k 121089+33120io 110807pf+0w
vm.radix.le_cycles: 74538053872
vm.radix.le_calls: 341220107
le_cycles/call: 218.44566701340375
vm.radix.ge_cycles: 2167673543
vm.radix.ge_calls: 3540959
ge_cycles/call: 612.1713194081038
modified timing:
52599.911u 1454.881s 59:17.88 1519.2% 73479+3096k 121017+34297io 110758pf+0w
vm.radix.le_cycles: 58436422956
vm.radix.le_calls: 341219862
le_cycles/call: 171.25738992298167
vm.radix.ge_cycles: 2264687629
vm.radix.ge_calls: 3544083
ge_cycles/call: 639.0052459268026
(/ 171.25738992298167 218.44566701340375)
0.7839816292280742
(/ 639.0052459268026 612.1713194081038)
1.0438340145445624
Peter, can you test this?
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Discard the effort to include a lock-free option to simplify the code. Avoid counting leading or trailing zeroes within the initial search loop. Use builtin_ctz and builtin_clz to avoid zero-checks I know to be unnecessary.
Apply it all to subr_pctrie too.
I ran tests for 16 hours with D40936.id124419.diff on
[pho@mercat1 /usr/src/sys/amd64/compile/PHO]$ uname -a FreeBSD mercat1.netperf.freebsd.org 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n264144-1cd9788408aa9-dirty: Thu Jul 13 21:09:42 CEST 2023 pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO amd64 [pho@mercat1 /usr/src/sys/amd64/compile/PHO]$
No issues observed.
sys/vm/vm_radix.c | ||
---|---|---|
651 |
sys/vm/vm_radix.c | ||
---|---|---|
641 | Add a comment above the if to the effect that no exact match was found. |
sys/vm/vm_radix.c | ||
---|---|---|
567 | I would add parentheses here for readability even if they aren't technically necessary. |
sys/vm/vm_radix.c | ||
---|---|---|
605–608 | I don't see the need for the words "stored at a position". | |
631 | "node" should be "page", and I don't understand the second use of "max" in this comment. | |
637 | "nodes" should be "pages". | |
643 | "nodes" should be "pages". | |
646 | "nodes" should be "pages". | |
652–653 | This is not proper style for comments. |
sys/vm/vm_radix.c | ||
---|---|---|
640 | Shouldn't "past" be "before". |
sys/vm/vm_radix.c | ||
---|---|---|
618–620 | "Descend the trie as if performing an ordinary lookup for the page with the specified pindex. However, unlike an ordinary lookup, as we descend the trie, we use "pred" to remember the last branching-off point, that is, the interior node under which the page with the greatest pindex that is both outside our current path down the trie and less than the specified pindex resides. (The node's popmap makes it fast and easy to recognize a branching-off point.) If our ordinary lookup fails to yield a page with a pindex that is less than or equal to the specified pindex, then we will exit this loop and perform a lookup starting from "pred". If "pred" is not NULL, then that lookup is guaranteed to succeed. |
I find the new version quite a bit easier to understand. I haven't read it exhaustively. I do think Alan's suggested description, "Descend the trie as if performing an ordinary lookup..." would be nice to have, since otherwise no comment directly explains the idea behind the implementation.
original timing:
52450.199u 1452.606s 59:09.38 1518.6% 73478+3095k 121089+33120io 110807pf+0w
modified timing:
52599.911u 1454.881s 59:17.88 1519.2% 73479+3096k 121017+34297io 110758pf+0w
Do you have any idea why the timings are so different here? Is the difference consistent across multiple runs?
sys/vm/vm_radix.c | ||
---|---|---|
629 | You could replace one of the copies (per file) with "See vm_radix_lookup_ge() for an explanation of the algorithm." |
original timing:
52450.199u 1452.606s 59:09.38 1518.6% 73478+3095k 121089+33120io 110807pf+0w
modified timing:
52599.911u 1454.881s 59:17.88 1519.2% 73479+3096k 121017+34297io 110758pf+0wDo you have any idea why the timings are so different here? Is the difference consistent across multiple runs?
I ran 3 more buildworld tests on the modified and unchanged kernels immediately after reboots, with no instrumentation
added to count performance stats for any particular function.
The results:
Modified:
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52357.285u 1448.444s 59:11.33 1515.0% 73477+3095k 120995+33438io 110815pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52553.118u 1456.257s 59:19.67 1517.2% 73472+3096k 121217+33609io 110821pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52547.476u 1452.889s 59:17.54 1517.9% 73470+3095k 121178+33245io 110870pf+0w
Original:
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52602.552u 1452.130s 59:20.49 1518.1% 73477+3096k 121068+33157io 110855pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52656.392u 1451.635s 59:27.48 1516.7% 73471+3096k 121118+33354io 110845pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52601.014u 1445.216s 59:23.55 1516.6% 73478+3095k 121121+33611io 110844pf+0w
sys/vm/vm_radix.c | ||
---|---|---|
626–627 | I think that's reasonable, yes. |
In my experience, up to 10 seconds variance is normal between "buildworld" runs. These results are for -NODEBUG kernels, yes?