Page MenuHomeFreeBSD

Introduce the ARB tree(3) macros
ClosedPublic

Authored by trasz on May 20 2019, 6:37 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Oct 23, 12:21 PM
Unknown Object (File)
Mon, Oct 20, 9:44 PM
Unknown Object (File)
Sun, Oct 19, 2:48 AM
Unknown Object (File)
Fri, Oct 17, 11:15 AM
Unknown Object (File)
Fri, Oct 17, 1:55 AM
Unknown Object (File)
Wed, Oct 15, 2:45 AM
Unknown Object (File)
Wed, Oct 15, 1:55 AM
Unknown Object (File)
Sun, Oct 12, 11:44 PM

Details

Summary

Introduce the Array-based Red-Black Tree macros (ARB): similar to
the traditional tree(3) RB trees, but using an array (preallocated,
linear chunk of memory) to store the tree.

This avoids allocation overhead, improves memory locality,
and makes it trivially easy to share/transfer/copy the entire tree
without the need for marshalling. The downside is that the size
is fixed at initialization time; there is no mechanism to resize
it.

This is one of the dependencies for the new stats(3) framework[1].

  1. https://reviews.freebsd.org/D20477

Sponsored By: Klara Inc, Netflix
Obtained from: Netflix

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

At the 2019 Waterloo Hackathon @markj offered to try to take a look

Make it easier to apply the patch. Note to self: tinderboxed.

bcr added a subscriber: bcr.

OK from manpages. Don't forget to bump the .Dd when committing.

This revision is now accepted and ready to land.May 31 2019, 7:18 AM

What is the benefit of this new data structure over existing data structures? Do you have empirical results?

How did you select an array-based RB tree over more classic designs like heaps, radix tree, or B-tree? (We have a 64-bit path-compressed radix trie in base already: sys/pctrie.)

It seems like tree storage is basically external, negating the benefit of an embedded entry data structure like existing queues and trees (RB included)? This is basically just an RB tree with layout contiguity. If we're allowed external or pre-allocated storage, we can probably do better (and less complicated) than this construction.

In D20324#443226, @cem wrote:

What is the benefit of this new data structure over existing data structures? Do you have empirical results?

How did you select an array-based RB tree over more classic designs like heaps, radix tree, or B-tree? (We have a 64-bit path-compressed radix trie in base already: sys/pctrie.)

It seems like tree storage is basically external, negating the benefit of an embedded entry data structure like existing queues and trees (RB included)? This is basically just an RB tree with layout contiguity. If we're allowed external or pre-allocated storage, we can probably do better (and less complicated) than this construction.

I'm interested in hearing more about the motivation too.

My assumption (based on the fact that index widths are configurable) was that this useful when there is some preexisting reason that all of your structures are in a contiguous array. In that case I don't see why this algorithm is any worse than the others you listed. Radix trees and B-trees would be less space-efficient, and plain heaps don't give you an efficient way to look up a given key.

I'm interested in hearing more about the motivation too.

Yeah, without some context we can only speculate.

My assumption (based on the fact that index widths are configurable) was that this useful when there is some preexisting reason that all of your structures are in a contiguous array.

So, doesn't this preexisting guarantee also imply that resizes never happen or are extremely infrequent? This isn't really a dynamically growing data structure? In which case, all of the balancing, giant templated code stuff is pretty heavy-weight compared to, say, qsort() and bsearch(). (That might not prove true for something like a fixed-size cache, for example. Need more information on that target use. Does a fixed-size cache really need ordered lookup, though?)

My other thought is: if you can rely on this property (fixed or mostly fixed number of contiguous elements), you could make better use of the large contiguous memory assumption and cache locality during lookup by having two separate contiguous regions; one for data irrelevent to cmp(), and one for ARB_ENTRY() + key data (anything relevent to cmp). The current memory layout wastes cache on values of any key we look at during the search process; we'd rather only cache values we actually need.

Anyway, without more detail, I suspect either the problem-space is either not actually constrained enough to make this scheme viable, or it is so constrained that it could use something better; ideally without adding ~758 lines of CPP-templated code in a common system header. (If you want a templated data structure, just use C++.)

In that case I don't see why this algorithm is any worse than the others you listed. Radix trees and B-trees would be less space-efficient, and plain heaps don't give you an efficient way to look up a given key.

Radix trees have much faster lookup, if keys are integers. You can avoid an uninlined function call for every comparison searching the tree. Our pctrie in particular has better cache locality/branching factor than binary trees like this. (No idea if the intended use of ARB is integer keys, but the example test code uses ints.)

Neither radix nor btrees are meaningfully less space-efficient, in any big-O sense, AFAIK? B-trees have the advantage of locality of keys, infrequent balancing, and little pointer indirection (over ARB), and are usable with ordered maps (over radix trees or hash tables). If ordering is not required at all, a hash table would be a much better fit than this construction.

In D20324#443418, @cem wrote:

So, doesn't this preexisting guarantee also imply that resizes never happen or are extremely infrequent? This isn't really a dynamically growing data structure? In which case, all of the balancing, giant templated code stuff is pretty heavy-weight compared to, say, qsort() and bsearch(). (That might not prove true for something like a fixed-size cache, for example. Need more information on that target use. Does a fixed-size cache really need ordered lookup, though?)

I think I understand this a bit better than I did yesterday. The idea is that you allocate your array of objects, and initialize the ARB with the array size, so there's an upper limit on the number of entries in the tree. Then you use ARB_GETFREE() to find an array entry that isn't already part of the tree, initialize it, and insert it. So ARB also implements a simple allocator with no extra space required, which you don't get with qsort() and bsearch().

As far as I can see there's no support for resizing the array, but with this implementation it should be fairly easy to grow the array since doing that would not invalidate any ARB state except maxnodes.

My other thought is: if you can rely on this property (fixed or mostly fixed number of contiguous elements), you could make better use of the large contiguous memory assumption and cache locality during lookup by having two separate contiguous regions; one for data irrelevent to cmp(), and one for ARB_ENTRY() + key data (anything relevent to cmp). The current memory layout wastes cache on values of any key we look at during the search process; we'd rather only cache values we actually need.

Yeah. IMO this design really only makes a lot of sense if you're trying hard to optimize for memory usage. Having a single contiguous array with its own allocator does have the property that it will likely be promoted to a superpage if it is large enough. If the distribution of lookups is uniform, maybe you might care as much or more about minimizing TLB utilization as d-cache utilization?

Anyway, without more detail, I suspect either the problem-space is either not actually constrained enough to make this scheme viable, or it is so constrained that it could use something better; ideally without adding ~758 lines of CPP-templated code in a common system header. (If you want a templated data structure, just use C++.)

In that case I don't see why this algorithm is any worse than the others you listed. Radix trees and B-trees would be less space-efficient, and plain heaps don't give you an efficient way to look up a given key.

Radix trees have much faster lookup, if keys are integers. You can avoid an uninlined function call for every comparison searching the tree. Our pctrie in particular has better cache locality/branching factor than binary trees like this. (No idea if the intended use of ARB is integer keys, but the example test code uses ints.)

It's worth pointing out that you can't really define your own comparator at all if you're going to use a radix tree.

Neither radix nor btrees are meaningfully less space-efficient, in any big-O sense, AFAIK? B-trees have the advantage of locality of keys, infrequent balancing, and little pointer indirection (over ARB), and are usable with ordered maps (over radix trees or hash tables). If ordering is not required at all, a hash table would be a much better fit than this construction.

Not in a big-O sense, but they certainly require more memory per key if you do all of your allocations up front, especially if you are using C pointers, and depending on the maximum number of keys. With a B-tree, in the worst case I think half of the space allocated for nodes could be unused; you could have a tree of degree d with all nodes having d children despite having space for 2d.

tests/sys/sys/arb_test.c
44 ↗(On Diff #58001)

Why is it global?

51 ↗(On Diff #58001)

Style is wrong.

70 ↗(On Diff #58001)

Why 128? The use of calloc() doesn't quite make sense since we're allocating both the ARB head and the nodes. I'd expect the allocation to look like:

root = malloc(sizeof(*root) + ITER * sizeof(struct node));

I think I understand this a bit better than I did yesterday. The idea is that you allocate your array of objects, and initialize the ARB with the array size, so there's an upper limit on the number of entries in the tree.

Ok, so it's a bounded ordered map of some kind; what's the policy on insert when full?

So ARB also implements a simple allocator with no extra space required, which you don't get with qsort() and bsearch().

I don't follow; a qsort+bsearch implementation of an ordered map wouldn't require any more space overhead than ARB does here. A LINEAR_MAP_HEAD would exist, like ARB_HEAD(), but would only need to know curnodes and maxnodes. Allocation is always from the curnodes'th index. And qsort (insert/delete/update) and bsearch (lookup) would both take nmemb parameters of curnodes. So I don't think this particular criticism has merit, unless I'm missing something. Maybe the comparison is that there isn't a nice LINEAR_MAP_{HEAD,INSERT,FIND,...} API wrapper around the concept? Sure.

As far as I can see there's no support for resizing the array, but with this implementation it should be fairly easy to grow the array since doing that would not invalidate any ARB state except maxnodes.

Sure, but potentially O(n) operation due to realloc copy. Plain sorted linear array has the same resize cost. (Both can amortize out to what, O(log N), with the right scaling factor? But the worst case is still quite bad.) Real trees typically have lower worst case insert penalty.

Yeah. IMO this design really only makes a lot of sense if you're trying hard to optimize for memory usage. Having a single contiguous array with its own allocator does have the property that it will likely be promoted to a superpage if it is large enough. If the distribution of lookups is uniform, maybe you might care as much or more about minimizing TLB utilization as d-cache utilization?

You could allocate both arrays in a single contiguous hunk for TLB optimization and keep the d-cache benefit.

It's worth pointing out that you can't really define your own comparator at all if you're going to use a radix tree.

Yes, that's the distinction I'm trying to draw. O(log N) lookup is the best you're going to do for lookup in datastructures with arbitrary comparators. If you can guarantee keys are integers, you can do better than a general-purpose ordered map. Need more context on use to determine what's appropriate.

Not in a big-O sense, but they certainly require more memory per key if you do all of your allocations up front, especially if you are using C pointers, and depending on the maximum number of keys. With a B-tree, in the worst case I think half of the space allocated for nodes could be unused; you could have a tree of degree d with all nodes having d children despite having space for 2d.

Up front, sure. If the need is small enough that that matters, I suspect sorted linear array wins out even with frequent inserts/deletes.

More memory per key? Not necessarily. You could store a leaf's values in a unsorted contiguous array and use small integer indexes as references embedded in the leaf, somewhat similar to this model. With a branching factor of d you would have 1/d C pointers and 1 index overhead per key; comparatively, ARB_ENTRY requires 3 indices and a color per key. sizeof(void*)/d ≤ 3 just indicates d must be ≥ 3 on LP64 and amusingly, ≥ 2 on ILP32, to have lower per-key memory overhead for full leaves. Double that to account for half-full leaves and the minimum branching factor is still smaller than you'd want to use anyway. Memory use by non-leaf nodes of a B-tree is de minimis, even with full C pointers.

But anyway, I still want to hear more about this specific use case, because this is definitely an unconventional data structure and it's unclear what the goals are.

tests/sys/sys/arb_test.c
70 ↗(On Diff #58001)

Given the math and how large ARB code is already, why not expose a macro to calculate the correct size for the type of ARB and some number of nodes?

I'll elaborate more soon, but basically, this is in the context of https://reviews.freebsd.org/D20477. In short: this is less about using it "when you need a tree", and more about having a tree that you can easily transfer between domains, in this case between the kernel and userspace.

I'll elaborate more soon, but basically, this is in the context of https://reviews.freebsd.org/D20477. In short: this is less about using it "when you need a tree", and more about having a tree that you can easily transfer between domains, in this case between the kernel and userspace.

That review raised more questions than it answered :-(.

tests/sys/sys/arb_test.c
44 ↗(On Diff #58001)

To avoid spurious differences with rb_test.c and splay_test.c, in the same directory, both of which came from OpenBSD.

51 ↗(On Diff #58001)

See above. On the other hand I'm not sure if we need to keep in sync so much?

70 ↗(On Diff #58001)

D'oh, you're both right. Now, there already is a macro for roughly that in the stats(3) code (ie a layer above), called TDGST_NCTRS2VSDSZ(), but I like the idea of moving it into ARB. How to name it? ARB_ALLOCSIZE(roottype, maxnodes, elementtype) perhaps?

Explain what the point of ARB is at the beginning of DESCRIPTION.

This revision now requires review to proceed.Jun 13 2019, 8:36 PM

So... is there anything else left to do here?

cem requested changes to this revision.Jun 26 2019, 6:15 AM

So... is there anything else left to do here?

A design justification is still missing. Pointing to another proposed patch with fundamentally the same unanswered design questions as a potential consumer isn't in itself a good justification to take this.

This revision now requires changes to proceed.Jun 26 2019, 6:15 AM
In D20324#449149, @cem wrote:

So... is there anything else left to do here?

A design justification is still missing. Pointing to another proposed patch with fundamentally the same unanswered design questions as a potential consumer isn't in itself a good justification to take this.

But I've added the justification (explanation of what's the point of ARB, from the consumer point of view) to the man page already. What else is needed?

But I've added the justification (explanation of what's the point of ARB, from the consumer point of view) to the man page already. What else is needed?

Are you referring to the single sentence, "[ARB trees] are useful, e.g., when the tree needs to be [transferred] between userspace and kernel", or am I missing the justification? If I am missing the complete justification, please point me at the part of the text that justifies the design. I am looking for considerably better justification than that single sentence.

trasz edited the summary of this revision. (Show Details)
In D20324#450030, @cem wrote:

But I've added the justification (explanation of what's the point of ARB, from the consumer point of view) to the man page already. What else is needed?

Are you referring to the single sentence, "[ARB trees] are useful, e.g., when the tree needs to be [transferred] between userspace and kernel", or am I missing the justification? If I am missing the complete justification, please point me at the part of the text that justifies the design. I am looking for considerably better justification than that single sentence.

Okay, so after backing away for a moment and rereading the whole conversation, and with additional feedback from Lawrence, who wrote this code in the first place, I think I can answer it a bit better.

First - red-black trees were chosen instead of eg radix trees because the original t-digest algorithm used AVL trees, and red-black trees are close enough.

Second - the main difference between ARB and RB is obviously the fact that the former are contiguous. This follows naturally from the design of stats(3) itself. However, based on feedback from the author, my understanding of priorities here was a bit off, and you were right - my assumption when reading this code was that the primary reason for writing ARB was the ease of transferring it to userspace and avoiding the allocation overhead, with compactness being secondary; according to Lawrence, the memory bandwidth and cache locality was more important, with the ease of transfer being just a "handy side-effect".

Okay, so after backing away for a moment and rereading the whole conversation, and with additional feedback from Lawrence, who wrote this code in the first place, I think I can answer it a bit better.

Thanks for the follow-up!

First - red-black trees were chosen … because the original t-digest algorithm used AVL trees, and red-black trees are close enough.

So any old "map" will do? Does it need to be ordered / support range queries, or are point queries plus arbitrary order traversal sufficient?

Second - the main difference between ARB and RB is obviously the fact that the former are contiguous. This follows naturally from the design of stats(3) itself. However, based on feedback from the author, my understanding of priorities here was a bit off, and you were right - my assumption when reading this code was that the primary reason for writing ARB was the ease of transferring it to userspace and avoiding the allocation overhead, with compactness being secondary; according to Lawrence, the memory bandwidth and cache locality was more important, with the ease of transfer being just a "handy side-effect".

So ARB trees are kind of a hack to make RB trees have better cache locality. But we already have better non-intrusive map-type data structures with arguably better cache locality; the ARB modification introduces similar limitations on use (vs standard RB trees) as any of these other kinds. So I really don't see much benefit to adding a weird one-off data structure here.

Move stuff to <sys/arb.h>

(Note to self: tinderboxed.)

I have no objection to this with a separate header and man page. It does seem convenient to have a map structure that can be copied with memcpy() or copyout() or so.

I'm still skeptical of the need to memcpy out a tree structure (are we doing queries on both sides? one side necessarily being stale?). This has never really been addressed?

I agree that the separate header is less objectionable than putting it in sys/tree.h, though.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 14 2019, 7:23 PM
This revision was automatically updated to reflect the committed changes.
In D20324#471860, @cem wrote:

I'm still skeptical of the need to memcpy out a tree structure (are we doing queries on both sides? one side necessarily being stale?). This has never really been addressed?

I think the digest code in stats(3) shows this pretty well: it's not like you're modifying the tree on both sides, rather, it's that you have a data which is best expressed as a tree, and you need to return it to userspace, so it can be queried further.

I agree that the separate header is less objectionable than putting it in sys/tree.h, though.

Thanks!