Page MenuHomeFreeBSD

radix_trie: simplify ge, le lookups
ClosedPublic

Authored by dougm on Jul 8 2023, 2:40 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 24, 1:59 PM
Unknown Object (File)
Sat, Jan 18, 12:51 AM
Unknown Object (File)
Fri, Jan 17, 10:42 PM
Unknown Object (File)
Fri, Jan 17, 1:48 PM
Unknown Object (File)
Wed, Jan 15, 10:41 AM
Unknown Object (File)
Wed, Jan 15, 2:50 AM
Unknown Object (File)
Tue, Jan 14, 12:57 PM
Unknown Object (File)
Sun, Jan 12, 3:15 PM
Subscribers

Details

Summary

Replace the implementations of lookup_le and lookup_ge with ones that do not use a stack or climb back up the tree.

Test Plan

On amd64, these changes reduce the size of the compiled lookup_ge and lookup_le routines by 112 bytes each.

A buildworld test suggests that this change speeds up lookup_le calls by 21.6%, and slows lookup_ge calls by 4.4%.

original timing:
52450.199u 1452.606s 59:09.38 1518.6% 73478+3095k 121089+33120io 110807pf+0w
vm.radix.le_cycles: 74538053872
vm.radix.le_calls: 341220107
le_cycles/call: 218.44566701340375
vm.radix.ge_cycles: 2167673543
vm.radix.ge_calls: 3540959
ge_cycles/call: 612.1713194081038

modified timing:
52599.911u 1454.881s 59:17.88 1519.2% 73479+3096k 121017+34297io 110758pf+0w
vm.radix.le_cycles: 58436422956
vm.radix.le_calls: 341219862
le_cycles/call: 171.25738992298167
vm.radix.ge_cycles: 2264687629
vm.radix.ge_calls: 3544083
ge_cycles/call: 639.0052459268026

(/ 171.25738992298167 218.44566701340375)
0.7839816292280742
(/ 639.0052459268026 612.1713194081038)
1.0438340145445624

Peter, can you test this?

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

dougm requested review of this revision.Jul 8 2023, 2:40 AM
dougm created this revision.
dougm edited the test plan for this revision. (Show Details)

Discard the effort to include a lock-free option to simplify the code. Avoid counting leading or trailing zeroes within the initial search loop. Use builtin_ctz and builtin_clz to avoid zero-checks I know to be unnecessary.

Apply it all to subr_pctrie too.

dougm added a subscriber: pho.

I ran tests for 16 hours with D40936.id124419.diff on

[pho@mercat1 /usr/src/sys/amd64/compile/PHO]$ uname -a
FreeBSD mercat1.netperf.freebsd.org 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n264144-1cd9788408aa9-dirty: Thu Jul 13 21:09:42 CEST 2023     pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO amd64
[pho@mercat1 /usr/src/sys/amd64/compile/PHO]$

No issues observed.

Changed comments a variable names, but nothing affecting behavior.

sys/vm/vm_radix.c
587–590
590
627

Can you add a block comment above the while that explains what this loop does? In particular, explaining what pred is.

sys/vm/vm_radix.c
661
sys/vm/vm_radix.c
651

Add a comment above the if to the effect that no exact match was found.

sys/vm/vm_radix.c
577

I would add parentheses here for readability even if they aren't technically necessary.

dougm marked 6 inline comments as done.
sys/vm/vm_radix.c
587–590

I don't see the need for the words "stored at a position".

641

"node" should be "page", and I don't understand the second use of "max" in this comment.

647

"nodes" should be "pages".

653

"nodes" should be "pages".

656

"nodes" should be "pages".

662–663

This is not proper style for comments.

dougm marked 6 inline comments as done.
sys/vm/vm_radix.c
650

Shouldn't "past" be "before".

dougm marked an inline comment as done.
sys/vm/vm_radix.c
612–614

"Descend the trie as if performing an ordinary lookup for the page with the specified pindex. However, unlike an ordinary lookup, as we descend the trie, we use "pred" to remember the last branching-off point, that is, the interior node under which the page with the greatest pindex that is both outside our current path down the trie and less than the specified pindex resides. (The node's popmap makes it fast and easy to recognize a branching-off point.) If our ordinary lookup fails to yield a page with a pindex that is less than or equal to the specified pindex, then we will exit this loop and perform a lookup starting from "pred". If "pred" is not NULL, then that lookup is guaranteed to succeed.

dougm added a reviewer: markj.

More comments. Enough comments? What does Mark think? Does he understand it?

More comments. Enough comments? What does Mark think? Does he understand it?

I find the new version quite a bit easier to understand. I haven't read it exhaustively. I do think Alan's suggested description, "Descend the trie as if performing an ordinary lookup..." would be nice to have, since otherwise no comment directly explains the idea behind the implementation.

original timing:
52450.199u 1452.606s 59:09.38 1518.6% 73478+3095k 121089+33120io 110807pf+0w
modified timing:
52599.911u 1454.881s 59:17.88 1519.2% 73479+3096k 121017+34297io 110758pf+0w

Do you have any idea why the timings are so different here? Is the difference consistent across multiple runs?

Drop in 4 versions of Alan's paragraph comment.

sys/vm/vm_radix.c
623

You could replace one of the copies (per file) with "See vm_radix_lookup_ge() for an explanation of the algorithm."

Reduce a couple of paragraph-comments.

sys/vm/vm_radix.c
627–628

I think that you can drop all of the comments within this function, aside from the mirroring comment. @markj Do you agree?

original timing:
52450.199u 1452.606s 59:09.38 1518.6% 73478+3095k 121089+33120io 110807pf+0w
modified timing:
52599.911u 1454.881s 59:17.88 1519.2% 73479+3096k 121017+34297io 110758pf+0w

Do you have any idea why the timings are so different here? Is the difference consistent across multiple runs?

I ran 3 more buildworld tests on the modified and unchanged kernels immediately after reboots, with no instrumentation
added to count performance stats for any particular function.
The results:

Modified:
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52357.285u 1448.444s 59:11.33 1515.0% 73477+3095k 120995+33438io 110815pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52553.118u 1456.257s 59:19.67 1517.2% 73472+3096k 121217+33609io 110821pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52547.476u 1452.889s 59:17.54 1517.9% 73470+3095k 121178+33245io 110870pf+0w

Original:
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52602.552u 1452.130s 59:20.49 1518.1% 73477+3096k 121068+33157io 110855pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52656.392u 1451.635s 59:27.48 1516.7% 73471+3096k 121118+33354io 110845pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52601.014u 1445.216s 59:23.55 1516.6% 73478+3095k 121121+33611io 110844pf+0w

sys/vm/vm_radix.c
627–628

I think that's reasonable, yes.

original timing:
52450.199u 1452.606s 59:09.38 1518.6% 73478+3095k 121089+33120io 110807pf+0w
modified timing:
52599.911u 1454.881s 59:17.88 1519.2% 73479+3096k 121017+34297io 110758pf+0w

Do you have any idea why the timings are so different here? Is the difference consistent across multiple runs?

I ran 3 more buildworld tests on the modified and unchanged kernels immediately after reboots, with no instrumentation
added to count performance stats for any particular function.
The results:

Modified:
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52357.285u 1448.444s 59:11.33 1515.0% 73477+3095k 120995+33438io 110815pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52553.118u 1456.257s 59:19.67 1517.2% 73472+3096k 121217+33609io 110821pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52547.476u 1452.889s 59:17.54 1517.9% 73470+3095k 121178+33245io 110870pf+0w

Original:
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52602.552u 1452.130s 59:20.49 1518.1% 73477+3096k 121068+33157io 110855pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52656.392u 1451.635s 59:27.48 1516.7% 73471+3096k 121118+33354io 110845pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52601.014u 1445.216s 59:23.55 1516.6% 73478+3095k 121121+33611io 110844pf+0w

In my experience, up to 10 seconds variance is normal between "buildworld" runs. These results are for -NODEBUG kernels, yes?

This revision is now accepted and ready to land.Jul 18 2023, 5:33 PM
In D40936#935104, @alc wrote:

In my experience, up to 10 seconds variance is normal between "buildworld" runs. These results are for -NODEBUG kernels, yes?

Yes. GENERIC-NODEBUG.

This revision was automatically updated to reflect the committed changes.