Page MenuHomeFreeBSD

Introduce stats(3)
ClosedPublic

Authored by trasz on May 30 2019, 8:58 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 20, 2:41 AM
Unknown Object (File)
Thu, Jan 16, 5:28 PM
Unknown Object (File)
Tue, Jan 14, 10:45 PM
Unknown Object (File)
Tue, Jan 14, 2:39 AM
Unknown Object (File)
Sun, Jan 5, 12:58 AM
Unknown Object (File)
Dec 8 2024, 1:41 PM
Unknown Object (File)
Dec 8 2024, 1:41 PM
Unknown Object (File)
Dec 8 2024, 1:41 PM

Details

Summary

Introduce stats(3), a flexible statistics gathering API.

This provides a framework to define a template describing
a set of "variables of interest" and the intended way for the
framework to maintain them (for example the maximum, sum,
t-digest, or a combination thereof). Afterwards the user
code feeds in the raw data, and the framework maintains
these variables inside a user-provided, opaque stats blobs.
The framework also provides a way to selectively extract the
stats from the blobs. The stats(3) framework can be used in
both userspace and the kernel.

See the stats(3) manual page for details.

This will be used by the upcoming TCP statistics gathering code,
https://reviews.freebsd.org/D20655.

This depends on both the ARB tree macros:
https://reviews.freebsd.org/D20324

and the Q math macros:
https://reviews.freebsd.org/D20116.

The stats(3) framework is disabled by default for now,
except in the NOTES kernel (for QA); it is expected
to be enabled in amd64 GENERIC after a cool down period.

Sponsored By: Klara Inc, Netflix
Obtained from: Netflix

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 26757
Build 25106: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Still no examples, but linking to the consumer helps.

Update to the new Q_INI(3) syntax; no functional changes.

lstewart added inline comments.
share/man/man3/stats.3
716

Oops, should have been caps B for SI bytes suffix. Similarly true for the annotated memory layout further down.

share/man/man3/stats.3
763

I think the same '8b' vs '8B' or '8 bytes' change may be needed here.

Some drive by comments — I haven't had time to read the 3-4 thousand lines of code in this review fully.

I was pointed this way from the arb tree review because "the digest code in stats(3) [justifies the arb data structure]." So I'll try to find and read that bit, but it looks like that's thousands of lines of code, so... it's still not an especially compelling case for ARB.

First impressions are that this is not only huge, but dense (low on comments even just showing the relation between algorithms and the source text, tdunning's t-digest paper), and the bits related to randomness are sloppy or incorrect.

Endian definitions for voistatdata_numeric etc imply this is intended to be an on-wire or disk structure? Or that you want portions of wider types to align with non-pad parts of other types in the same union? I didn't see where such an assumption was used, but if it is, accessing a union member that isn't the same one you most recently initialized is technically undefined behavior.

sys/kern/subr_stats.c
48

This concern can be resolved with arb macros living in the highly sorted arb.h now, right?

(In general, headers should NOT include their dependencies, but as you point out, there is an inherent tension with alpha-sort inclusion. You can resolve it in a number of unsatisfying ways, one of which would be header inclusion. Other options are things like a smaller sys/_foo.h header inclusion, the ugly strategy of renaming headers to get the alphasort right, or the perhaps uglier strategy of just dumping your definitions in one of the always-sorts-first headers like cdefs.h, types.h, or params.h.)

214–290

C99 has array initializers as well (you use C99 struct initializers a few lines down); just:

{
    [VS_STYPE_VOISTATE] = "VOISTATE",
...

For ex.

300

Standard C requires at least one element to be explicitly initialized. Generally { 0 } will do.

1042–1053

This talks about "true random" a lot and then uses random(3), which is NOT a random number generator. One or the other needs to be fixed.

Additionally, the formula used to reduce the range of random(3), even if were a real RNG, introduces bias.

Given the ability to provide an explicit deterministic seed value, my guess is that the intent would be better served by replacing random() with arc4random_uniform(101);.

2931–2934

This seems like a dangerous assumption? Maybe there should just be a more explicit ARB API for this intended use?

2971

Using random(3) is almost always incorrect. In fact, I think it's always incorrect, even if you just want a fast PRNG. This is a red flag.

lib/libstats/Makefile
6–13

Is here any userspace-only code that can be extracted from the giant subr_stats file and made its own file here?

Likewise for the kernel-only portions?

Without having scanned in exhaustive detail, it seems like at least some significant portions are #ifdef kernel or vice versa, and IMO it would significantly help legibility to have more, smaller files with more explicit purpose.

This comment was removed by rpokala.
sys/kern/subr_stats.c
48

(I should clarify: in this context, I was talking about FreeBSD in particular. Headers not including their dependencies is at least in part a build optimization policy adopted by the FreeBSD project, as well as, e.g., Google, and not necessarily universal in C projects.)

jilles added inline comments.
sys/kern/subr_stats.c
48

Even within FreeBSD it is not consistent. POSIX specifies its headers to take care of their dependencies themselves, so headers used by userland applications may be inconsistent depending on whether an interface is defined by POSIX or not.

sys/kern/subr_stats.c
48

Yes, the "should" is aspirational. In practice it may be more true in the kernel than in userspace, in part due to POSIX. If a sys/ header is specified by POSIX (I don't know if this is the case), recursive inclusion could be conditional on #ifndef _KERNEL?

sys/kern/subr_stats.c
3269–3274

Discussed some offline with @trasz . As discussed offline, the sys/tree.h version of ARB added the RB_REBALANCE macro, but it was dropped when ARB was moved to its own header. That breaks this DIAGNOSTIC code. So the question was, do we fix it by adding RB_REBALANCE or removing the DIAGNOSTIC section? Separately, I took issue with the name of (A)RB_REBALANCE.

How it's used: the highlighted code (and similar below) is modifying the object's key (cnt) and then reinserting the node only if the key is newly out of order.
(A)RB_REBALANCE is an optimization that avoids invoking RB_REMOVE(); RB_INSERT(); on the modified ctd32 node unless it's wrongly ordered compared to its neighboring nodes in the tree.

It's a fine optimization, but I'd suggest changing the name. The optimization has nothing to do with overall tree balance, but instead is about correcting key ordering (i.e., maintaining the basic binary tree invariant) around a single updated-key node. Some possible names I've brainstormed are REINSERT, UPDATEKEY, and maybe REPLACEKEY. Better ideas welcome.

I'm on board with adding the same optimization macro/function to RB in sys/tree.h. It's used in this DIAGNOSTIC code for debugging, sure, but the same general optimization makes just as much sense for RB trees as ARB trees. But I'd like it to have a better name.

Finally, we discussed removing the unneeded field argument from the RB_GENERATE_REBALANCE generator. I guess it's kind of consistent with all of the other generators, but the RB_ version of the macro doesn't use it. I don't feel strongly about this.

sys/kern/subr_stats.c
3269–3274

@cem two minor nits with your comments, but otherwise all reasonable:

For posterity's sake, FYI the object's key here with respect to order in the tree is mu (updated a bit earlier at line 3261 Q_QADDQ(&ctd32->mu, x)), not cnt. mu essentially represents the "center of mass" of the histogram bucket, while count is the number of samples falling within the bucket. The digest must be ordered/traversed by ascending value of mu in this implementation of the algorithm.

I would also note that it is in fact possible that reinsertion may trigger a tree rebalance i.e. naming the macro REBALANCE wasn't strictly speaking a misnomer even though a rebalance isn't a guaranteed outcome of calling it if the centroid is still in the right spot. Having said that, I'm not opposed to any of the alternate suggestions either.

sys/kern/subr_stats.c
3269–3274

Ah, thanks @lstewart. I appreciate the correction on the key field — mea culpa :-). I assumed the tree movement would be adjacent to the key argument and have only skimmed the code; it's the next statement over. Cool.

Sure, rebalance is a possible outcome of any INSERT or REMOVE, but I don't think it makes much sense to call either of those operations REBALANCE either. IMO, this operation is in essentially the same boat — the tree balancing is a background property of RB trees in general; it isn't the logical operation we're trying to perform. So I prefer a more specific name. :-)

trasz marked 7 inline comments as done.

Fixes from cem@, except from randomness. Also adds WITH_STATS.

sys/kern/subr_stats.c
2931–2934

TBH I'd leave it as it is for now. It works, and shouldn't cause any problems until someone decides to replace ARB or modify so it no longer meets this assumption, which I'd say is unlikely.

sys/kern/subr_stats.c
47

"stats" alpha sorts after qmath and arb, so this comment probably isn't needed?

Must it come after one of stddef or stdint? If so, that might be the relevant comment. If not, "sta" sorts before "std".

222

Thanks.

257

These could also be C99 initialized, and drop the comment.

302

Seems like we could use the same self-documenting C99 initializer style here too, no? Shouldn't impact line length.

[VSD_DTYPE_VOISTATE] = {0},
[VSD_DTYPE_INT_S32] = {.int32 = {.s32 = ...}},
...
2931–2934

I'll I'm asking for here is shifting that comment and logic to the arb header so the behavior is well-specified at that API boundary.

/*
 * Reset the ARB tree. Embedded object values remain intact, and unlike
 * ARB_INIT(), this does not initialise the free list.
 */
#define	ARB_RESET_TREE(head, name, maxn)	\
	*(head) = ARB_INITIALIZER(name, maxn)
trasz marked 2 inline comments as done.

More fixes from cem@.

sys/kern/subr_stats.c
257

Sigh, no idea how did I managed to miss those.

Document ARB_RESET_TREE(3).

cem requested changes to this revision.Sep 30 2019, 5:14 PM

Thanks, the new changes look better. (To be clear, I still haven't approved the overall revision.) I still have some objections, see below.

share/man/man3/arb.3
495 ↗(On Diff #62746)

The grammar is a little awkward. It's fine in man pages to just use very short sentences:

"The ... macro discards the tree topology. It does not modify embedded object values or the .Nm free list."

(If nothing else, there is a duplicate period at the end of the current sentence.)

sys/kern/subr_stats.c
305

Thanks, I like that this way is both self-documenting and makes it impossible to accidentally get the order wrong.

1042–1053

Definitely still have strong reservations about this.

2971

This one isn't a blocker, especially if you intend to fix it later; just a design issue. That said, I'd suggest s/randomly/pseudorandomly/ in the comment.

2974

We're already seeding the srandom PRNG with a vaguely unpredictable value; might as well seed with one of the arc4random(9) family of functions instead of a timestamp.

This revision now requires changes to proceed.Sep 30 2019, 5:14 PM
In D20477#443980, @cem wrote:

So like, what's the motivation of this change? What are the goals? How was this design chosen? There's 6000 lines of diff present as a fait accompli without any kind of socialization on arch@ or anywhere else. Code is great, but the high level stuff is missing.

Comment density is "low" and copy-paste density is "high." Did you consider templating this, either with awk or C++?

"stats(3)" is a bit misleading; 3900 LoC are a new kernel file. What is the expected consumer? What is the general API surface? Where is the design doc?

Where is the stats(9) manual page? What do stats(3) consumers look like?

The above was never addressed, I think?

Will there ever be any in tree? Can this live in ports?

(Only) that part was addressed.

In D20477#472506, @cem wrote:

... low on comments even just showing the relation between algorithms and the source text, tdunning's t-digest paper ....

Endian definitions for voistatdata_numeric etc imply this is intended to be an on-wire or disk structure? Or that you want portions of wider types to align with non-pad parts of other types in the same union?

This was never answered either.

lib/libstats/Makefile
6–13

This is still an open concern

trasz marked an inline comment as done.

Fix comments to not claim random(3) is truly random (as a temporary
measure until it's replaced with arc4random(3)), and improve the man
page.

In D20477#477255, @cem wrote:
In D20477#443980, @cem wrote:

So like, what's the motivation of this change? What are the goals? How was this design chosen? There's 6000 lines of diff present as a fait accompli without any kind of socialization on arch@ or anywhere else. Code is great, but the high level stuff is missing.

Comment density is "low" and copy-paste density is "high." Did you consider templating this, either with awk or C++?

"stats(3)" is a bit misleading; 3900 LoC are a new kernel file. What is the expected consumer? What is the general API surface? Where is the design doc?

Where is the stats(9) manual page? What do stats(3) consumers look like?

The above was never addressed, I think?

The stats(3) API can be used in both kernel and userspace; think sbuf(9) or libnv(9). In this case the stats are maintained in the kernel and returned to userspace, but you could eg use stats(3) to collect statistics relevant to some userspace daemon, in userspace, and then have another piece of userspace retrieve them.

For a trivial example of a future in-tree consumer see https://reviews.freebsd.org/D21324.

Will there ever be any in tree? Can this live in ports?

(Only) that part was addressed.

In D20477#472506, @cem wrote:

... low on comments even just showing the relation between algorithms and the source text, tdunning's t-digest paper ....

Endian definitions for voistatdata_numeric etc imply this is intended to be an on-wire or disk structure? Or that you want portions of wider types to align with non-pad parts of other types in the same union?

This was never answered either.

I'm guessing it is future-proofing to make it possible to collect statblobs instead of having to dump them into text or JSON first.

sys/kern/subr_stats.c
2971

Using random(3) is almost always incorrect. In fact, I think it's always incorrect, even if you just want a fast PRNG. This is a red flag.

@cem What is the basis for this assertion?

sys/kern/subr_stats.c
2971

random(3) doesn't generate random numbers (that is: it's not a CSPRNG). So for that use, it's flat wrong. But I do not think you are trying to use it for that.

As a PRNG for simulations, random(3) isn't particularly good (statistical properties) or fast. See, e.g., http://www.pcg-random.org/ . (Yes, it's from the PCG authors, so there may be some bias there; but see also http://www.pcg-random.org/posts/other-good-options.html .)

I'll admit, as far as I know we don't have any of the better PRNGs in usable parts of base. (There is some xorshift variant buried deep in CDDL dtrace code, but that's not an API.) random(3) may be the best one readily available. From that perspective, random(3) is fine for a PRNG.

sys/kern/subr_stats.c
1045

s/seeded random number/seeded number/

trasz marked an inline comment as done.

Fix comment.

Thanks for doing this work. It looks like a great addition to FreeBSD.

[Trying to remove my "request changes" flag without marking explicit approval (i.e., don't consider it blocked on me, but also don't put me on reviewed-by.)]

I think there may still be some room to cleanly separate out some the userspace-specific code from the shared common code, but don't have cycles to do so myself and don't see anyone else doing it.

Adding a better PRNG interface is totally outside the scope of this change, but would be a nice thing to do at some point; if it happens, it would make sense to adapt this code to use it. (We can't change random(3)'s algorithm.)

I'm still confused why some of the structures are defined in endian-specific ways — it would be good if the author could answer that since guesses are just guesses. But, whatever.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 7 2019, 7:05 PM
This revision was automatically updated to reflect the committed changes.
In D20477#477820, @cem wrote:

I'm still confused why some of the structures are defined in endian-specific ways — it would be good if the author could answer that since guesses are just guesses.

@cem I was thinking it would eventually be nice to facilitate storing and passing statsblobs around in binary format without having to always convert blobs to/from a potentially foreign endianness if they were produced on a machine of same endianness as the consumer. I made a few token gestures in this regard e.g. the endian hint in the statsblob header, and endian-specific layout tweaks for voistatdata_numeric where it explicitly unionises types of different length in the same embedded chunk of statsblob memory. These are by no means the exhaustive set of changes required to support such functionality though, hence the "token gesture" designation :)

Thanks Lawrence!

In D20477#477820, @cem wrote:

I'm still confused why some of the structures are defined in endian-specific ways — it would be good if the author could answer that since guesses are just guesses.

@cem I was thinking it would eventually be nice to facilitate storing and passing statsblobs around in binary format without having to always convert blobs to/from a potentially foreign endianness if they were produced on a machine of same endianness as the consumer.

Wouldn't that ("same endianness as the consumer") be accomplished with a single, non-conditional struct/union definition? (The persisted endian value is great; I love that it is explicitly defined.)

I made a few token gestures in this regard e.g. the endian hint in the statsblob header, and endian-specific layout tweaks for voistatdata_numeric where it explicitly unionises types of different length in the same embedded chunk of statsblob memory.

I guess the part I'm confused about is the endian-specific layout tweaks. I am totally for designing portable binary data structures, but I'd think endian alone, with converter routines, would be sufficient.

What does the conditional definition achieve? It looks like you're trying to collocate the memory of the various fields at the same structure offsets. I guess that'd make it slightly easier to write an in-place converter, right?

But it depends on various assumptions about alignment and packing that are not verified (and would be difficult to verify at compile time on a single machine). None of these structs are marked __packed, for example.

Explicit offset definitions and CTASSERT()s/_Static_assert()s would be necessary for a robust version of this scheme, but also facilitate just having the more complex converter routine that translated offsets as well as flipping bytes around.

These are by no means the exhaustive set of changes required to support such functionality though, hence the "token gesture" designation :)

Putting aside the conditional padding, I'm confused why what we have now is so verbose and discards the existing structure types. I'll put a line comment on the particular area with my thoughts for better context. Just food for thought, I'm not demanding anything here.

Best regards,
Conrad

head/sys/sys/stats.h
35–38 ↗(On Diff #63000)

For userspace, it would be fantastic if there was some way to register interest in some stats with an fd, hooking into the existing kevent/select/poll eventloop pattern. You could imagine something as simple as a named pipe userspace writes some simple protocol to to add/remove keys of interest, and then poll() or whatever just blocks until the fd is readable (something like what the kernel and devd use today).

You could even deliver stats blobs directly over the socket. Or use it as a more simplistic wakeup mechanism, sending a single byte (value doesn't matter).

Obviously this doesn't make much sense for firehose statistics, but for a general framework one of the problems we've had in the past is a lack of a way for userspace to monitor stat events; repeated polling has latency and power use downsides.

172–210 ↗(On Diff #63000)

I think you could push the 32-bit structures inside the padding. You could use your existing structure names, which seem to line of up with the unnamed structs, here. E.g.,

struct voistatdata_numeric {
  union {
    struct {
#if BYTE_ORDER == BIG_ENDIAN
      uint32_t pad;
#endif
      union {
        struct voistatdata_int32 int32;
        struct voistatdata_q32 q32;
#if LONG_BIT == 32
        /* Tangent, I'm not sure I see the value of 'long' type, but I'm sure you had some reason for it, right? */
        struct voistatdata_intlong intlong;
#endif
      }; /* anonymous union of 32-bit types */
#if BYTE_ORDER == LITTLE_ENDIAN
      uint32_t pad;
#endif
    }; /* anonymous struct for aligning 32-bit sub-struct offsets */
#if LONG_BIT == 64
    struct voistatdata_intlong intlong;
#endif
    struct voistatdata_int64 int64;
    struct voistatdata_q64 q64;
  };
};

(This reduces the sheer quantity of ifdef'd padding byte lines and uses the correct structure types for the inner 32-bit values, which allows you to generate correctly typed pointers without casting, for example.)

This shouldn't change the layout in a way that impacts the ABI of statsv1 (although independently, it would be great if there were compile-time tests for checking that ABI).

255–258 ↗(On Diff #63000)

(This is the sort of layout I think is more problematic from an ABI perspective. You might have implicit padding after oob or between val and cnt, and it might vary between architectures with the same endianness and sizeof(LONG). But maybe this is an in-memory only structure.)

dougm added inline comments.
head/sys/kern/subr_stats.c
2449 ↗(On Diff #63000)

Expanding ARB_INIT and ARB_MAXNODES, you see that this expands to something that includes:

(head)->arb_maxnodes = (head)->arb_maxnodes;

which is either an error, or something that deserves an explanatory comment somewhere.