Page MenuHomeFreeBSD

Make use of stats(3) in the TCP stack
ClosedPublic

Authored by trasz on Jun 15 2019, 2:48 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 12:03 PM
Unknown Object (File)
Thu, Jan 16, 9:05 PM
Unknown Object (File)
Thu, Jan 9, 7:51 AM
Unknown Object (File)
Sun, Jan 5, 10:55 AM
Unknown Object (File)
Fri, Jan 3, 11:58 AM
Unknown Object (File)
Fri, Jan 3, 11:57 AM
Unknown Object (File)
Fri, Jan 3, 11:06 AM
Unknown Object (File)
Fri, Jan 3, 10:30 AM

Details

Summary

Make use of the stats(3) framework in the TCP stack.

This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract them
using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences are that
this mechanism is easy to extend without breaking ABI, and provides
statistical information instead of raw "snapshots" of values at a given
point in time. stats(3) is more generic and can be used in both
userland and the kernel.

This work depends on:

the stats(3) framework:
https://reviews.freebsd.org/D20477

The qmath(3) macros:
https://reviews.freebsd.org/D20116

and the Array-base Red Black Tree implementation:
https://reviews.freebsd.org/D20324

An example consumer tool for using these TCP stats is here:
https://reviews.freebsd.org/D20656.

Sponsored By: Klara Inc, Netflix
Obtained from: Netflix

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
allanjude added inline comments.
sys/netinet/tcp_usrreq.c
1533 ↗(On Diff #58673)

should the fixup of missing TF_FASTOPEN bits be a separate commit?

thj added inline comments.
sys/netinet/tcp.h
335 ↗(On Diff #58673)

tcpsyninfo looks to be 16 bytes to me, shouldn't this update to

__tcpi_pad[10]

sys/netinet/tcp.h
335 ↗(On Diff #58673)

But __tcpi_pad[] is u_int32_t, not a char?

sys/netinet/tcp.h
335 ↗(On Diff #58673)

indeed, tcpi_pad[] is a u_int32_t (so 4 bytes), so dropping it from 26 to 22 accounts for the 16 bytes. Unless I am missing something.

sys/netinet/tcp.h
335 ↗(On Diff #58673)

Ah okay, I missed that.

sys/netinet/tcp_input.c
321 ↗(On Diff #58673)

Is the goodput measurement time (i.e. number of ms for the measurement window) reported by stats somewhere?

Does the calculation break with an rtt <1ms?

I think a comment here might be helpful for future readers of the code, something along the lines of

report goodput in bits per [time]
sys/netinet/tcp_input.c
321 ↗(On Diff #58673)

I don't think the raw measurement time is reported - but there's the digest computed for goodput, which I'm guessing is what you'd want the times for?

Regarding rtt < 1ms - well, it's just 'capped from the bottom' with the max(), but I don't know how realistic times below 1ms are, given that the resolution of tcp_ts_getticks() seems to be 1ms?

As for the units - that's a good suggestion; let me do a little experiment to make sure I get this right and I'll add it to the VOI declarations in tcp.h.

bz added a subscriber: bz.

Adding TCP folks explicitly as reviewers.

share/man/man4/tcp.4
37 ↗(On Diff #58673)

Please don't forget to update on commit.

sys/netinet/tcp_var.h
753 ↗(On Diff #58673)

The comment sounds wrong to me. One variable looks like a sysctl and with that should not need to be user land visible.
For the other, I don't know the new stats framework but I'd hope it'd use a proper "copyin/out" mechanism and not directly read kernel memory using libkvm with vnet variables.

jtl added inline comments.
sys/netinet/tcp_var.h
753 ↗(On Diff #58673)

I agree. I don't think any of these three lines need to be visible to userland. I think they can safely move within the #ifdef _KERNEL block. I suspect that this is a little cruft which managed to sneak in during the 3+ years of development on this project.

lstewart added inline comments.
sys/netinet/tcp_var.h
753 ↗(On Diff #58673)

Actually, they do need to be visible to userland in the current way things work, but the comment does possibly need to be expanded to explain why (or my arguably hacky way of doing things needs to be improved).

The underlying issue here is the need for statsblob templates defined in the kernel or userland to be shared across both domains, such that a statsblob generated in one domain based on a particular template can be manipulated in the other domain. By shared, I don't mean in a direct memory access sense, but the statsblob template needs to exist or be discoverable in both domains (or in future, perhaps across machines).

The only direction that matters in the current code being upstreamed is for templates defined in kernel, and then statsblobs based on a kernel-defined template being exported to userspace, and manipulated in userspace e.g. for rendering to JSON for logging purposes.

The "right" solution would be to have an API for exporting/importing statsblob templates. I punted on doing it the "right" way and cheated.

Regarding the TCP-related stats code, tcp_stats_init() in tcp_stats.c defines and registers the "TCP_DEFAULT" statsblob template together with its associated variables of interest. tcp_stats.c is compiled into the kernel, and tcp_stats_init() is called from tcp_init() so that the kernel statsblob template registry contains the "TCP_DEFAULT" template for use by the TCP stack.

kern/subr_stats.c is both compiled into the kernel and into a userland library (similar to the way sbuf(9) is built for both domains), and in order to have the stats templates match between the kernel and userland, I compile tcp_stats.c into the userland library too with tcp_stats_init() tagged as a library constructor so that the "TCP_DEFAULT" template is defined the same as what's in the kernel, and is registered on userland library load. As a consequence, tcp_stats.c has a bit of #ifdef goo to make compilation in userland work correctly, and these tcp_var.h bits need to be visible to make the userland library compile.

This scheme does mean that the kernel and userland can get out of sync if a change is made to the template but only one of the kernel or userland library are recompiled against the new code. In such a situation, statsblobs generated e.g. in kernel, would have a template hash id which differs from the templates known in userland, and the userland code would fail any statsblob manipulation operations that required the template to be known (return an errno).

In practice, this scheme works well and has not caused problems in the Netflix use case. It was always the plan to eventually implement the API for template export/import and clean this up when that work is done, but I don't think the API is a requirement for upstreaming.

sys/netinet/tcp_var.h
753 ↗(On Diff #58673)

Can you point me at the userland code which "reads" these things? Given one is a sysctl already I'd think it'd be easy to adjust that. I am only wondering if the code was originally written with VIMAGE in mind?

sys/netinet/tcp_var.h
753 ↗(On Diff #58673)

@bz Oops, sorry for being unclear. The userland library does not read or interact with any of these tcp_var.h variables in kernel. If you look in tcp_stats.c you'll see where this all comes together.

tcp_stats_init() is executed in both kernel, and separately in the userland library as a library constructor. V_tcp_perconn_stats_dflt_tpl is used in tcp_stats_init() and therefore needs to be resolvable for compilation of stats code in both kernel and userland. In the userland library it just resolves to a library-scope global integer - no relation/access to the kernel variable of same name.

V_tcp_perconn_stats_enable does not strictly speaking need to be visible to userland as it's only referenced from kernel code, but it doesn't hurt having it here and keeping these bits together (though I would also not object to it moving elsewhere).

sys/netinet/tcp_var.h
753 ↗(On Diff #58673)

FWIW, I think this dual-use (especially of the V_tcp_perconn_stats_dflt_tpl variable) makes things unclear. I think it would be nice to eliminate it, if possible, without adding code duplication. However, I understand that may not be possible.

I *suspect* you may be able to make the constructor be a static function in the user-land library. This would avoid the need to expose it via headers.

Also, as best I can tell, V_tcp_perconn_stats_dflt_tpl is not used in the userland library outside of tcp_stats_init. Is that true? If so, it might be possible to do a trivial rewrite of this function to use a local variable, and only update V_tcp_perconn_stats_dflt_tpl in the kernel.

trasz marked 2 inline comments as done.

Better comment, build fixes and more cleanup, including removing some unrelated
code changes I've somehow missed before.

Move the previously userspace-accessible V_ stuff into tcp_stats.c.

Note to self: needs tinderboxing.

trasz added inline comments.
sys/netinet/tcp_var.h
753 ↗(On Diff #58673)

I've moved it from a header into tcp_stats.c. Does that look ok?

sys/netinet/tcp_input.c
321 ↗(On Diff #58673)

Does this address your feebback @thj?

julian added inline comments.
sys/netinet/cc/cc.h
111 ↗(On Diff #62721)

you could move this section to the top so that there are not intermixed user/kernel parts..

sys/netinet/tcp_input.c
321 ↗(On Diff #58673)

Not sure if this is relevant but at $PRIORJOB I implemented some similar facilities. It was used for tracking RTT for sessions but we were uninterested in 'local' session so anything with an RTT of less than 1mSec (from memory) was called 0 and ignored.

trasz added inline comments.
sys/netinet/cc/cc.h
111 ↗(On Diff #62721)

But then we lose the grouping of those constants. Not sure which way is better, tbh.

Regenerate after upstreaming stats(3).

Upload the right version. :-/

Add another chunk that got lost somewhere along the way; fixes build with DIAGNOSTIC.

Drop local change to GENERIC.

donner added inline comments.
sys/netinet/tcp.h
337 ↗(On Diff #63608)

Is it possible to make the ABI requirement explicit?

static_assert(sizeof(struct(tcp_info)) == xxx, "ABI constraint")

This would prevent unwanted mistakes by later changes done by other people.

sys/netinet/tcp.h
337 ↗(On Diff #63608)

The current version of the patch doesn't modify that code, but it's not a bad idea. I've spun it off to a separate review: https://reviews.freebsd.org/D22267.

sys/netinet/tcp_output.c
1494 ↗(On Diff #63872)

This block she be within #ifdef STATS seeing as the code which consumes the goodput measurement is as well.

sys/netinet/tcp_usrreq.c
1951 ↗(On Diff #63872)

The INP_WUNLOCK can be shifted outside the #ifdef

sys/netinet/tcp_var.h
218 ↗(On Diff #63872)

Should all these be under #ifdef STATS? May also want to consider moving somewhere else e.g. down where #ifdef TCPPCAP is...

790 ↗(On Diff #63872)

Wrap in #ifdef STATS?

827 ↗(On Diff #63872)

Wrap in #ifdef STATS?

trasz marked 4 inline comments as done.

Fixes from lstewart@.

sys/netinet/tcp_var.h
218 ↗(On Diff #63872)

Making them conditional could break ABI for userspace. Quick grepping for 'tcpcb' shows some stuff in DTrace scripts, ipfilter, systat(1), and trpt(1).

sys/netinet/tcp_var.h
218 ↗(On Diff #63872)

Yes... there's no clear cut choice, but default off options (e.g. TCPPCAP a few lines below) tend to put their bits under #ifdef to avoid bloat for the majority who don't explicitly enable the feature. I don't have a strong opinion on this, but thought I'd flag it in case other opinions existed.

Fix problem reported by Tom: when computing RTT, take the HZ into account; the 'rtt' argument is in ticks, not milliseconds. Thanks to Allan for tracking that down.

Sigh; don't change GENERIC.

When kern.hz is 100 timer resolution below 10ms is disgarded. You can see clearly in
the RTT measurement, on a path with 15ms delay TCP stats will report an rtt of 20ms.
It also seems to be the case that all sub 10ms delay paths report a minimum delay of
10ms. This does not present if I boot with kern.hz set to 1000.

I think this deserves a note in the man page.

Please add Tested by: thj when you commit

This revision is now accepted and ready to land.Nov 30 2019, 1:13 PM
This revision was automatically updated to reflect the committed changes.