Page MenuHomeFreeBSD

pctrie: Introduce batch pctrie insertion routines
Needs ReviewPublic

Authored by bnovkov on Oct 10 2024, 9:39 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 5:58 AM
Unknown Object (File)
Fri, Jan 17, 2:30 PM
Unknown Object (File)
Wed, Jan 8, 3:56 AM
Unknown Object (File)
Mon, Jan 6, 9:29 AM
Unknown Object (File)
Thu, Jan 2, 8:38 PM
Unknown Object (File)
Wed, Jan 1, 11:47 PM
Unknown Object (File)
Mon, Dec 30, 4:30 AM
Unknown Object (File)
Mon, Dec 30, 4:16 AM
Subscribers

Details

Reviewers
markj
alc
kib
dougm
Summary

This change introduces routines optimized for inserting multiple
items into the pctrie. The main goal is to provide a more performant
and cache-friendly way of inserting multiple items into the pctrie.

pctrie_batch_insert uses a special, iterator-based lookup function
to find or construct a 0-level pctrie node. The target node is then
filled up with as many items as possible and the process is repeated
until all items are inserted. The routine will stop inserting if
it was not able to allocate a pctrie node.

Test Plan

Aside from the performance evaluation in D47051, I've also evaluated this change by comparing the average number of cycles it takes to insert N items into the pctrie.
The cycle count for each batch size was averaged across 100 runs.

no. itemsforloop + iter_insert cycle countpctrie_batch_insert cycle countpercentage of cycles saved
1103106-2.83
229019251.04
425919036.32
837919495.36
16622205203.42
321258422198.10
642420678256.93
12846551216282.81
25692382224315.38
512184274547305.26
1024371139049310.13
20487488818477305.30
409615037337571300.24
819230158475992296.86
16384608159154314294.10

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

This is more complex than initializing and iterator and using it to insert consecutive elements. How have you measured its performance and cache-friendliness in comparison to that simpler implementation?

sys/kern/subr_pctrie.c
725

What happens if there's already a value stored at node? What should happen?

726

It seems you can set *out to parent here and drop all the other assignments to *out. Can't the caller invoke _get_parent to get this value?

729

Is a special case necessary here? This function doesn't modify the tree, except in this case. Isn't this another special case for ALLOC_INSERT?

791

How do you know that there isn't already something stored here? What should you do in that case?

825

I think that your array of elements are constrained to have consecutive indices, but I don't see comments about that. And since you're reading index values from the data, it seems that if those indices aren't consecutive, this is going to wreck the tree. So perhaps this function should take an index argument for the first array element and do the assignment of consecutive index values to the array elements so that the tree cannot be wrecked.

This is more complex than initializing and iterator and using it to insert consecutive elements. How have you measured its performance and cache-friendliness in comparison to that simpler implementation?

The whole patch series was evaluated, but I haven't measured the performance gain for this change only.
I've updated the test plan with the results of a simple microbenchmark comparing iter_insert and pctrie_batch_insert.

Also, thank you for pointing out the consecutive indices constraint, this indeed is the case and I completely forgot to note that in the comments. I'll address this along with your other comments later today.

Address @dougm 's comments:

  • Removed a redundant pctrie_batch_lookup_result enum
  • pctrie_batch_insert now takes a starting key instead of relying on items having contiguous keys pre-filled
bnovkov added inline comments.
sys/kern/subr_pctrie.c
725

The test consumer of this interface performs an explicit lookup prior to insertion so this slipped by me.

My first reaction would be to check before insertions and panic if there's already a value stored.
This would slow things down a bit, but it's consistent with how we've been doing things so far.
We could also perform an explicit lookup before inserting anything and trim the insertion range accordingly

I'm not sure which approach would make more sense though.
The same goes for the other two similar comments out below.

729

Right, the case is redundant, I've removed it and handled the single-item insertion case in pctrie_batch_insert.

825

Thanks for pointing this out, this indeed is the case.
I've modified the function according to your suggestion, it'll now take a starting index and populate the elements instead of relying on them being properly prefilled.

sys/kern/subr_pctrie.c
727

If parent == NULL, and you've already assigned *out = parent, you don't need to assign *out = NULL.

732

Don't need the 'else'. Don't need so many braces.

759

You don't need to assert node != NULL. The system already checks that assertion for you.

766

"- n" doesn't contribute much here.

It seems that nitems and to_insert don't overlap. Why not replace to_insert with nitems?

767

It seems that you are error-handling by assertion. If I try to enter elements 100-199 in the pctrie and element 143 is there already, this assertion will fail. But if I've built the kernel GENERIC-NODEBUG, this assertion is disabled and I will lose element 143. So, is the user required to check that the least element >= 100 is > 199 to make sure they're all missing before entering 100-199 all at once?

I've wondered if this thing you're doing is for mass-insertion at an arbitrary point in the life of the pctrie, or is instead for initialization of the pctrie only. If this were an initializer only, these pesky questions I'm asking would just disappear.

787

I wonder if this needs to be locked.

I haven't figured out all your code yet, but I'm wondering if you're modifying the original pctrie a few pieces at a time, and using PCTRIE_LOCKED ever 16 writes or so, when you could do something else. Suppose you designed this to build an entire little pctrie of values 100-199, or whatever, and then attach it all at once to the big pctrie. The building of the little pctrie could be completely unsynchronized. Only when you inserted the little pctrie into the big pctrie would the last store have to be synchronized. So I expect that would improve performance.

If this is just an initializer, most of above still applies; you don't need PCTRIE_LOCKED until the pctrie is ready.

bnovkov marked an inline comment as done.

Address @dougm 's comments:

  • Removed redundant assignment and pruned code
  • Moved checks for existing elements outside of assertions
bnovkov added inline comments.
sys/kern/subr_pctrie.c
766

Ah, right, that was a bit silly - thanks for catching this!

767

So, is the user required to check that the least element >= 100 is > 199 to make sure they're all missing before entering 100-199 all at once?

I think its best if we performed the checks when inserting, the popmap is now always checked during batched insertion.

I've wondered if this thing you're doing is for mass-insertion at an arbitrary point in the life of the pctrie, or is instead for initialization of the pctrie only. If this were an initializer only, these pesky questions I'm asking would just disappear.

The former is true, this is meant to be used at any point of the pctrie's lifecycle.

787

I wonder if this needs to be locked.

I haven't figured out all your code yet, but I'm wondering if you're modifying the original pctrie a few pieces at a time, and using PCTRIE_LOCKED ever 16 writes or so, when you could do something else. Suppose you designed this to build an entire little pctrie of values 100-199, or whatever, and then attach it all at once to the big pctrie. The building of the little pctrie could be completely unsynchronized. Only when you inserted the little pctrie into the big pctrie would the last store have to be synchronized. So I expect that would improve performance.

I had considered this approach initially but ultimately settled for another approach since this idea seemed even more complex and trickier to get right. Would it make sense to do a similar thing with the current patch - i.e. synchronize only at the very last write?

sys/kern/subr_pctrie.c
540

A function so small, used only once, is best avoided.

717

nitems is not used in this function.

Is an enum type and this function really necessary, since it's only called once? Can't this just be logic in pctrie_batch_insert?

724

Reorder the tests so that you don't need so many nesting levels.

766

How about a for loop?

771

Instead of modifying slot with every iteration, how about slot + n here?

806

*val always has the same value assigned to it. Might as well do it one place instead of 3.

807

Some expressions below use 'index' and some use 'start + total'. Pick one.

816

I don't see the point of this assignment.

bnovkov marked 2 inline comments as done.

Address @dougm 's comments.

bnovkov added inline comments.
sys/kern/subr_pctrie.c
717

It could, but I deliberately split the lookup part out to keep pctrie_batch_insert smaller and easier to read.

If you're inserting batch [a,b,...,y.z], and if you began by inserting a and z with normal iter-insert, then your BATCH_SPLIT case would never happen and some of the complexity of batch_insert could be reduced. True or false?

sys/kern/subr_pctrie.c
725

I'm going to keep picking at this until this function and this enum goes away, but for now, can you write it to have only 2 if statements and 3 return statements?

801
	val = pctrie_iter_lookup_ge(it, start);
	if (val != NULL && *val <= start + nitems - 1)
		panic("%s: key '%lu' already present", __func__, *val);

Now you don't need any other panic tests in the code.

814

I observe that the rest of the code in this file does not wrap single lines after if, like this one, in braces. Don't adopt a unique style for this function.

851

Correct me if I'm wrong. This happens only when inserting the first item. If, before the loop begins, you check to see if (start + 1) % PCTRIE_COUNT == 0 and, if so, insert the first item with iter_insert and increase total to 1, then this case doesn't have to be checked here in the loop.

913
	} while (total < nitems - 1);

Now, there are three places where you can stop checking for nitems - total == 1 in the code, and instead, here you can just test for total != nitems and do an iter_insert in that case.

bnovkov marked an inline comment as done.

Address @dougm 's comments:

  • Removed separate lookup function and enum type
  • Explicit range lookup before insertion
  • Moved pctrie_iter_insert to subr_pctrie.c so it can be used for single-element insertions

If you're inserting batch [a,b,...,y.z], and if you began by inserting a and z with normal iter-insert, then your BATCH_SPLIT case would never happen and some of the complexity of batch_insert could be reduced. True or false?

I think you're right, values less than a and greater than z could be occupying their slots in an internal node, and there should be no other nodes between those if the pctrie_iter_lookup_ge check goes through.
However, removing the 'SPLIT' case and performing an out-of-order insertion of a and z leaves us with a tricky issue - what if any of the subsequent insertions fail?
That is, how would we signal the fact that we managed to insert only, e.g., a, b, and z?

sys/kern/subr_pctrie.c
725

After addressing a couple of your other remarks the lookup function slimmed down to a couple of lines, so I moved everything to pctrie_batch_insert.
The enum is still there since rewriting switch-case code would be cumbersome, but it's now part of the batch insert routine.

851

That is indeed the case, thanks for catching this!

sys/sys/pctrie.h
245

When I merged vm_radix.c with subr_pctrie.c so that there would be only one implementation of pctries, it was accepted, in part, because I made sure to avoid passing pointers to memory allocators to any functions. Otherwise, we'd still have a separate vm_radix implementation. So casually changing that by turning this function into one that passes a function pointer to a memory allocator is not acceptable.

bnovkov added inline comments.
sys/sys/pctrie.h
245

Right, I'll revert this part but I'll still need to access this from subr_pctrie.c somehow.
Would having a separate static version of pctrie_iter_insert in subr_pctrie.c be okay? Bouncing back and forth between pctrie.h and subr_pctrie.c to allocate nodes is something I want to avoid, if possible.

sys/sys/pctrie.h
245

Bouncing is bad.

The number of pctrie_nodes you have to allocate is bounded by something like nitems/(PCTRIE_COUNT-1) + 1. If you figure the quantity precisely, you can pre-allocate all the nodes at the top level as pass an array of pointers to them to the batch_insert function.

Address @dougm 's comments:

  • reverted pctrie_iter_insert change
  • removed allocfn from pctrie_batch_insert by passing an array of preallocated pctrie nodes
bnovkov added inline comments.
sys/sys/pctrie.h
245

I've converted the batch_insert function to follow this approach, although this might slow down cases where nitems == 1 since node allocation is pretty much guaranteed now.
We can avoid that by doing a regular pctrie_iter_insert, but I'm not sure whether this should be handled in _PCTRIE_INSERT_BATCH or in the pctrie consumers (e.g., vm_radix).

sys/sys/pctrie.h
197

I don't think this is enough nodes. If the pctrie is empty, start is 0 and nitems is 768, then you allocate 50 nodes, but you need 48 at the bottom level and 3 at the level above that, and one at the root, so you have too few. I suggested previously nitems / (PCTRIE_COUNT - 1) + 1 as a starting point. Dividing by PCTRIE_COUNT-1 was not a typo; it's how you account for nodes at all levels in an empty trie. 768/15 == 52 works in this case. I was hoping that you would calculate the exact number needed, which also depends on whether the first and/or last leaf will be added to an already allocated leaf. I didn't expect an approach of allocating too many and then freeing the ones you didn't turn out to need. You should write a function that uses the iterator, start and nitems to calculate exactly how many allocations you need. I could write one if you need me to.

sys/sys/pctrie.h
197

I was hoping that you would calculate the exact number needed, which also depends on whether the first and/or last leaf will be added to an already allocated leaf. I didn't expect an approach of allocating too many and then freeing the ones you didn't turn out to need. You should write a function that uses the iterator, start and nitems to calculate exactly how many allocations you need. I could write one if you need me to.

You're right, this expression underestimates the number of nodes. However, I spent some time analyzing the issue and I believe I found a way to calculate the exact number of nodes needed. I'm currently testing it out, so I'll put it up for review in a day or two if everything goes smoothly. I appreciate the offer though!

sys/sys/pctrie.h
197

@alc has expressed to me his belief that computing the number of nodes to allocate in advance leads to a too-complex solution. At the same time, he doesn't want to see an allocation function pointer passed to a function defined in subr_pctrie.c. What he suggests, instead, is that an inline function in pctrie.h, something like PCTRIE_INSERT_BASE, does the allocation and passes the allocated node along to a function - like pctrie_addnode_batch. But pctrie_batch_insert, or something like it, would reside in pctrie.h.

Refactor patch to avoid precalculating the number of nodes or passing allocfn to subr_pctrie.c:

  • the old pctrie_batch_insert logic was moved into pctrie.h
  • lookup logic was split out into a separate lookup function
  • pctrie.h will now use said lookup function to allocate nodes and pass them to pctrie_batch_insert using a "lookup result" structure
sys/sys/pctrie.h
197

@alc has expressed to me his belief that computing the number of nodes to allocate in advance leads to a too-complex solution.

I share the sentiment, especially after working on D47744.
I don't think we can thoroughly test such an approach, and debugging any eventual crashes caused by it would be nearly impossible.

What he suggests, instead, is that an inline function in pctrie.h, something like PCTRIE_INSERT_BASE, does the allocation and passes the allocated node along to a function - like pctrie_addnode_batch. But pctrie_batch_insert, or something like it, would reside in pctrie.h.

Thank you for the suggestion, I've used it to hoist most of the logic and the node allocation into pctrie.h.

sys/kern/subr_pctrie.c
701

pctrie_insert_lookup inserts, unless memory allocation is required; then, it just does a lookup. pctrie_batch_insert_lookup never inserts anything. Define pctrie_addnodes(node, index, arr, nitems, offs) and call it from here in the ALLOC_INSERT case, so that pctrie_batch_insert isn't called. When pctrie_batch_insert is called after allocation, it can also use pctrie_addnodes().

707

There's no need to store parent in a struct field. It's easy to get it from it->path and it->top.

713

If node has value 1 and index 2, you don't split. But node might be the root of the (1-items) tree, and you have to split.

bnovkov added inline comments.
sys/kern/subr_pctrie.c
713

While this technically is splitting, I treat this as a special case of ALLOC_INSERT since the leaf stored at the target slot will end up in the new clev 0 node.
Treating this as SPLIT will lead to one extra allocation, and I'm not sure that it's possible to detect this specific case (i.e., parent node clev == 0 after pctrie_node_init) when allocating nodes in pctrie.h.

This is a challenging patch to review, and every time I start to review it, I want to rewrite it to make it clearer. I'll hold off one more time. But I really don't think that enums or switches with fallthroughs really help much.

I've not been convinced that its complexity has a performance benefit. If this change is holding back other changes you want to commit, I suggest that you offer a simpler implementation that just iterates and inserts one at a time. Once you have that committed, and other code relies on it, then we'll be in a position to evaluate a more complex version like the one you're offering here. Right now, I expect that the benefits of this version are less synchronization and fewer bit-flips. That may be significant, but I'm not able to test the performance impact without all those other changes in place.

Finally, I notice that style(9) was updated recently regarding the use of predict_true and predict_false. You propose to use them a lot, and style(9) requires that their usage should meet certain standards. If you think you're usage meets those standards, you need to make that case.

sys/kern/subr_pctrie.c
787

Why even have a switch statement, since there are only two real cases?

790

How could control reach this line? If the new_parent allocation failed. So with 0 returned, total has 0 added to it, and we loop again, and allocfn fails again, so we have an infinite loop? So maybe the failed alloc of new_parent should be handled differently.

818

if PCTRIE_BATCH_ALLOC_INSERT, 1 is returned, but no node was added.

854

How would control reach this point?

sys/sys/pctrie.h
359

Why is this test after the PCTRIE_BATCH_ALLOC_INSERT case and not before it?

This is a challenging patch to review, and every time I start to review it, I want to rewrite it to make it clearer. I'll hold off one more time. But I really don't think that enums or switches with fallthroughs really help much.

I agree that it's a bit verbose and I really appreciate your feedback on this patch.
I'd really like to have some flavor of batch insertion in the tree, but I'm not hell-bent on landing the current iteration of the patch.
I'll suspend work on this until your memq removal patch lands as that'll make performance evaluation and development easier.
In the meantime, I'd appreciate if you could share any tips on how you'd like this patch to look like.

We could also consider alternative approaches. @markj recently pointed out to me that we could cache the last clev 0 node in the iterator. Other use cases aside, an approach like this would simplify batch insertion, turning it into a loop-based insert + pctrie_addnode_batch (which is where I'd expect most of the performance gain to come from).