Page MenuHomeFreeBSD

Make use of stats(3) in the TCP stack

Authored by trasz on Jun 15 2019, 2:48 PM.



Make use of the stats(3) framework in the TCP stack.

This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract them
using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences are that
this mechanism is easy to extend without breaking ABI, and provides
statistical information instead of raw "snapshots" of values at a given
point in time. stats(3) is more generic and can be used in both
userland and the kernel.

This work depends on:

the stats(3) framework:

The qmath(3) macros:

and the Array-base Red Black Tree implementation:

An example consumer tool for using these TCP stats is here:

Sponsored By: Klara Inc, Netflix
Obtained from: Netflix

Diff Detail

rS FreeBSD src repository
Lint OK
No Unit Test Coverage
Build Status
Buildable 27128
Build 25404: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
trasz edited the summary of this revision. (Show Details)Jun 15 2019, 2:50 PM
trasz edited the summary of this revision. (Show Details)Jun 15 2019, 3:47 PM
trasz updated this revision to Diff 58673.Jun 15 2019, 3:58 PM

Drop unneeded diff.

allanjude added inline comments.

should the fixup of missing TF_FASTOPEN bits be a separate commit?

thj added a subscriber: thj.Jun 19 2019, 10:18 AM
thj added inline comments.

tcpsyninfo looks to be 16 bytes to me, shouldn't this update to


trasz added inline comments.Jun 28 2019, 8:52 PM

But __tcpi_pad[] is u_int32_t, not a char?

allanjude added inline comments.Jun 29 2019, 3:21 PM

indeed, tcpi_pad[] is a u_int32_t (so 4 bytes), so dropping it from 26 to 22 accounts for the 16 bytes. Unless I am missing something.

thj added inline comments.Jul 3 2019, 8:19 AM

Ah okay, I missed that.


Is the goodput measurement time (i.e. number of ms for the measurement window) reported by stats somewhere?

Does the calculation break with an rtt <1ms?

I think a comment here might be helpful for future readers of the code, something along the lines of

report goodput in bits per [time]
trasz edited the summary of this revision. (Show Details)Jul 15 2019, 4:30 PM
trasz edited the summary of this revision. (Show Details)Jul 15 2019, 6:43 PM
trasz added inline comments.Jul 17 2019, 8:06 PM

I don't think the raw measurement time is reported - but there's the digest computed for goodput, which I'm guessing is what you'd want the times for?

Regarding rtt < 1ms - well, it's just 'capped from the bottom' with the max(), but I don't know how realistic times below 1ms are, given that the resolution of tcp_ts_getticks() seems to be 1ms?

As for the units - that's a good suggestion; let me do a little experiment to make sure I get this right and I'll add it to the VOI declarations in tcp.h.

trasz edited the summary of this revision. (Show Details)Jul 18 2019, 9:36 PM
bz added a subscriber: bz.

Adding TCP folks explicitly as reviewers.


Please don't forget to update on commit.

bz added inline comments.Aug 9 2019, 3:21 PM

The comment sounds wrong to me. One variable looks like a sysctl and with that should not need to be user land visible.
For the other, I don't know the new stats framework but I'd hope it'd use a proper "copyin/out" mechanism and not directly read kernel memory using libkvm with vnet variables.

jtl added a subscriber: jtl.Aug 9 2019, 6:48 PM
jtl added inline comments.

I agree. I don't think any of these three lines need to be visible to userland. I think they can safely move within the #ifdef _KERNEL block. I suspect that this is a little cruft which managed to sneak in during the 3+ years of development on this project.

lstewart added inline comments.

Actually, they do need to be visible to userland in the current way things work, but the comment does possibly need to be expanded to explain why (or my arguably hacky way of doing things needs to be improved).

The underlying issue here is the need for statsblob templates defined in the kernel or userland to be shared across both domains, such that a statsblob generated in one domain based on a particular template can be manipulated in the other domain. By shared, I don't mean in a direct memory access sense, but the statsblob template needs to exist or be discoverable in both domains (or in future, perhaps across machines).

The only direction that matters in the current code being upstreamed is for templates defined in kernel, and then statsblobs based on a kernel-defined template being exported to userspace, and manipulated in userspace e.g. for rendering to JSON for logging purposes.

The "right" solution would be to have an API for exporting/importing statsblob templates. I punted on doing it the "right" way and cheated.

Regarding the TCP-related stats code, tcp_stats_init() in tcp_stats.c defines and registers the "TCP_DEFAULT" statsblob template together with its associated variables of interest. tcp_stats.c is compiled into the kernel, and tcp_stats_init() is called from tcp_init() so that the kernel statsblob template registry contains the "TCP_DEFAULT" template for use by the TCP stack.

kern/subr_stats.c is both compiled into the kernel and into a userland library (similar to the way sbuf(9) is built for both domains), and in order to have the stats templates match between the kernel and userland, I compile tcp_stats.c into the userland library too with tcp_stats_init() tagged as a library constructor so that the "TCP_DEFAULT" template is defined the same as what's in the kernel, and is registered on userland library load. As a consequence, tcp_stats.c has a bit of #ifdef goo to make compilation in userland work correctly, and these tcp_var.h bits need to be visible to make the userland library compile.

This scheme does mean that the kernel and userland can get out of sync if a change is made to the template but only one of the kernel or userland library are recompiled against the new code. In such a situation, statsblobs generated e.g. in kernel, would have a template hash id which differs from the templates known in userland, and the userland code would fail any statsblob manipulation operations that required the template to be known (return an errno).

In practice, this scheme works well and has not caused problems in the Netflix use case. It was always the plan to eventually implement the API for template export/import and clean this up when that work is done, but I don't think the API is a requirement for upstreaming.

bz added inline comments.Aug 12 2019, 7:05 AM

Can you point me at the userland code which "reads" these things? Given one is a sysctl already I'd think it'd be easy to adjust that. I am only wondering if the code was originally written with VIMAGE in mind?

lstewart added inline comments.Aug 12 2019, 7:54 AM

@bz Oops, sorry for being unclear. The userland library does not read or interact with any of these tcp_var.h variables in kernel. If you look in tcp_stats.c you'll see where this all comes together.

tcp_stats_init() is executed in both kernel, and separately in the userland library as a library constructor. V_tcp_perconn_stats_dflt_tpl is used in tcp_stats_init() and therefore needs to be resolvable for compilation of stats code in both kernel and userland. In the userland library it just resolves to a library-scope global integer - no relation/access to the kernel variable of same name.

V_tcp_perconn_stats_enable does not strictly speaking need to be visible to userland as it's only referenced from kernel code, but it doesn't hurt having it here and keeping these bits together (though I would also not object to it moving elsewhere).

allanjude edited the summary of this revision. (Show Details)Aug 13 2019, 6:05 PM
jtl added inline comments.Aug 19 2019, 1:39 PM

FWIW, I think this dual-use (especially of the V_tcp_perconn_stats_dflt_tpl variable) makes things unclear. I think it would be nice to eliminate it, if possible, without adding code duplication. However, I understand that may not be possible.

I *suspect* you may be able to make the constructor be a static function in the user-land library. This would avoid the need to expose it via headers.

Also, as best I can tell, V_tcp_perconn_stats_dflt_tpl is not used in the userland library outside of tcp_stats_init. Is that true? If so, it might be possible to do a trivial rewrite of this function to use a local variable, and only update V_tcp_perconn_stats_dflt_tpl in the kernel.

trasz updated this revision to Diff 62720.Sep 29 2019, 8:01 PM
trasz marked 2 inline comments as done.

Better comment, build fixes and more cleanup, including removing some unrelated
code changes I've somehow missed before.

trasz updated this revision to Diff 62721.Sep 29 2019, 8:11 PM

Move the previously userspace-accessible V_ stuff into tcp_stats.c.

Note to self: needs tinderboxing.

trasz marked 6 inline comments as done.Sep 29 2019, 8:12 PM
trasz added inline comments.

I've moved it from a header into tcp_stats.c. Does that look ok?

allanjude added inline comments.Oct 2 2019, 4:14 PM

Does this address your feebback @thj?

julian added a subscriber: julian.Oct 7 2019, 8:47 PM
julian added inline comments.

you could move this section to the top so that there are not intermixed user/kernel parts..


Not sure if this is relevant but at $PRIORJOB I implemented some similar facilities. It was used for tracking RTT for sessions but we were uninterested in 'local' session so anything with an RTT of less than 1mSec (from memory) was called 0 and ignored.

trasz marked an inline comment as done.Oct 14 2019, 10:57 AM
trasz added inline comments.

But then we lose the grouping of those constants. Not sure which way is better, tbh.

trasz updated this revision to Diff 63497.Oct 20 2019, 6:07 PM

Regenerate after upstreaming stats(3).

trasz updated this revision to Diff 63550.Oct 22 2019, 5:55 PM

Upload the right version. :-/

trasz updated this revision to Diff 63605.Oct 23 2019, 8:00 PM

Add another chunk that got lost somewhere along the way; fixes build with DIAGNOSTIC.

trasz updated this revision to Diff 63608.Oct 23 2019, 8:39 PM

Drop local change to GENERIC.

trasz added a comment.Oct 23 2019, 8:39 PM

(Tinderboxed.) added inline comments.

Is it possible to make the ABI requirement explicit?

static_assert(sizeof(struct(tcp_info)) == xxx, "ABI constraint")

This would prevent unwanted mistakes by later changes done by other people.

trasz updated this revision to Diff 63872.Nov 1 2019, 10:03 PM

Fix tinderbox on i386.

trasz added inline comments.Nov 6 2019, 8:57 PM

The current version of the patch doesn't modify that code, but it's not a bad idea. I've spun it off to a separate review:

lstewart added inline comments.Nov 18 2019, 3:30 AM

This block she be within #ifdef STATS seeing as the code which consumes the goodput measurement is as well.


The INP_WUNLOCK can be shifted outside the #ifdef


Should all these be under #ifdef STATS? May also want to consider moving somewhere else e.g. down where #ifdef TCPPCAP is...


Wrap in #ifdef STATS?


Wrap in #ifdef STATS?

trasz updated this revision to Diff 64548.Nov 18 2019, 10:58 PM
trasz marked 4 inline comments as done.

Fixes from lstewart@.

trasz added inline comments.Nov 18 2019, 10:59 PM

Making them conditional could break ABI for userspace. Quick grepping for 'tcpcb' shows some stuff in DTrace scripts, ipfilter, systat(1), and trpt(1).

lstewart added inline comments.Nov 18 2019, 11:20 PM

Yes... there's no clear cut choice, but default off options (e.g. TCPPCAP a few lines below) tend to put their bits under #ifdef to avoid bloat for the majority who don't explicitly enable the feature. I don't have a strong opinion on this, but thought I'd flag it in case other opinions existed.

trasz updated this revision to Diff 64858.Nov 25 2019, 5:32 PM

Fix problem reported by Tom: when computing RTT, take the HZ into account; the 'rtt' argument is in ticks, not milliseconds. Thanks to Allan for tracking that down.

trasz updated this revision to Diff 64859.Nov 25 2019, 5:33 PM

Sigh; don't change GENERIC.

thj accepted this revision.Nov 30 2019, 1:13 PM

When kern.hz is 100 timer resolution below 10ms is disgarded. You can see clearly in
the RTT measurement, on a path with 15ms delay TCP stats will report an rtt of 20ms.
It also seems to be the case that all sub 10ms delay paths report a minimum delay of
10ms. This does not present if I boot with kern.hz set to 1000.

I think this deserves a note in the man page.

Please add Tested by: thj when you commit

This revision is now accepted and ready to land.Nov 30 2019, 1:13 PM
This revision was automatically updated to reflect the committed changes.