Page MenuHomeFreeBSD

tcp: add eventlog(9) provider for TCP
Needs ReviewPublic

Authored by nickbanks_netflix.com on Thu, May 28, 6:36 PM.
Tags
None
Referenced Files
F159994636: D57309.id178862.diff
Sat, Jun 20, 8:43 AM
F159971506: D57309.id179809.diff
Sat, Jun 20, 2:32 AM
F159969308: D57309.id178843.diff
Sat, Jun 20, 1:59 AM
F159966419: D57309.id178861.diff
Sat, Jun 20, 1:03 AM
F159961185: D57309.id178843.diff
Fri, Jun 19, 11:22 PM
Unknown Object (File)
Fri, Jun 19, 9:02 AM
Unknown Object (File)
Wed, Jun 17, 11:10 PM
Unknown Object (File)
Tue, Jun 16, 2:19 PM

Details

Summary

Register a "tcp" eventlog(9) provider and emit per-connection events
from the in-tree TCP code paths. Each tcpcb gets a per-connection
eventlog session identified by inp_gencnt. When the user enables
eventlog on a socket via the new TCP_EVENTLOG sockopt, the session
is flagged for capture and dumped against any subscribers. Sessions
follow stack swaps via tcp_ensure_eventlog_session_on_switch(), so a
connection moving between eventlog-capable stacks (e.g. base ↔ rack)
stays observable across the transition.

Schema (include/eventlog/tcp_eventlog_schema.src):

  • SESSION/CONN - SESSION_CREATE, CONN_SET_IP_V[46], LOG_ID, CONN_PARAMS, MSS, CONN_OWNER, TCP_STACK_CHANGE, CC_ALGO_CHANGE
  • DATA - IN, OUT, ACK, SACK, USER_SEND, SENDFILE
  • CC / WINDOW - CWND, CC_PARAMS, ENTER/EXIT_RECOVERY, PRR_STATE, WINDOW_COLLAPSE, HYSTART, GOODPUT
  • RTT - RTT, RTT_SAMPLE
  • LOSS - RTO, RETRANSMIT_REASON, DSACK
  • TIMER - TIMER_START / _CANCEL / _FIRED, SB_WAKE
  • PACING - PACING_CALC, PACE_STATE, HW_PACE_RATE
  • APP / OPT - SOCKET_OPT_UINT32, SOCKET_OPT_ERR, OUTPUT_DECISION

The RTT keyword is split out from CC because EVENT RTT (300) fires on
every SRTT update; consumers that want CC visibility (cwnd / PRR /
hystart / goodput) without that firehose can subscribe to CC alone.
STRUCT GOODPUT carries the per-goodput-window SRTT and the all-time-min
RTT alongside gp_bw / raw_bw so consumers can correlate throughput
against RTT without interleaving a separate RTT event stream.

Emission points: tcp_input.c (RTT update), tcp_subr.c (provider
registration, dump callback, dump_session, conn_params,
sendfile/log-id helpers, stack switch), tcp_syncache.c (CONN_SET_IP),
tcp_timer.c (RTO), tcp_usrreq.c (TCP_EVENTLOG sockopt, CONN_OWNER on
attach, CC_ALGO_CHANGE in tcp_set_cc_mod), tcp_output.c, the rack/bbr
stacks, the cubic/newreno CC modules, and kern_sendfile.c.

The new sockopt TCP_EVENTLOG (79) and the new t_flags2 bit
TF2_EVENTLOG_ENABLED gate per-connection capture. CONN_OWNER tracks
the owning PID on active opens (socket(2) caller) and on passive
accepts (listener owner is inherited).

The build system in sys/conf/{kern.pre.mk,kmod.mk} (introduced with
eventlog(9)) auto-generates <eventlog/tcp_eventlog.h> from the schema
on every relevant rebuild, so producers always see consistent
emission macros.

tcp(4):

  • Document the new TCP_EVENTLOG sockopt and its dump-on-enable behaviour; cross-reference the kern.eventlog.tcp.default sysctl in eventlog(9) for the system-wide default.
  • Add elog(1) and eventlog(9) to SEE ALSO.

elog(1):

  • New flag -f / --format <type> (only "tcp" is currently accepted). With "-o dir=<path>" capture, files are post-renamed using the first LOG_ID and CONN_SET_IP_V[46] events seen in the session, producing <log_id>_<lport>_<rip>_<rport>.elog. Missing fields decode as "unknown".
  • struct session_file gains the per-session metadata fields used by the renamer.
  • extract_tcp_metadata() captures the metadata; rename_session_file() runs once at session-end (and at exit cleanup).

elog(1) man page (elog.1):

  • Document -f / --format with the TCP file-naming convention.
  • TCP-flavored examples now name "tcp" explicitly instead of the generic "<provider>" placeholder.

tests/sys/kern/elog_test.py:

  • Replaces the smoke-only test file with end-to-end coverage. Each test pins net.inet.tcp.functions_default to "freebsd" and sets kern.eventlog.tcp.default=1 around a loopback exchange so that captures contain stable, predictable events.
  • Coverage:
    • basic capture from the tcp provider with no traffic and

with a loopback exchange (per stack, per address family);

  • provider-tag and event-name presence in human-readable

output;

  • dump-state replay of an existing connection's parameters;
  • keyword filtering (RX/TX);
  • level filtering and SESSION-keyword bypass;
  • binary capture round-trip via -r;
  • -o dir=<path> per-session split + -f tcp rename.

Sponsored by: Netflix

Signed-off-by: Nick Banks <nickbanks@netflix.com>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 73888
Build 70771: arc lint + arc unit

Event Timeline

Build fixes after running buildkernel: 1) Replace legacy 5-arg eventlog_provider_create() calls in tcp_subr.c and tcp_stacks/rack.c with upstream 2-arg form using a struct eventlog_provider_config. 2) Replace Netflix-internal INP_DROPPED check in the sendfile eventlog hook with upstream tp->t_flags & TF_DISCONNECTED. 3) Declare 'appended' in tcp_usr_send and capture sbavail() deltas around sbappendstream*. 4) Drop the SACK_LOG branch in rack_ack_received() since num_sack_blks/sack_blocks are not in scope; re-enable once rack_ack_received() is extended to receive SACK info from its callers.

Build fixes after running buildkernel - see commit body for details

I have plans to add Delayed ACK profiling eventually, does this collide with anything, or is someone going to beat me to it?

This is with a view to merging Minshall from xnu, which already does D-ACK profiling (NOT DSACK).

In D57309#1313341, @bms wrote:

I have plans to add Delayed ACK profiling eventually, does this collide with anything, or is someone going to beat me to it?

This is with a view to merging Minshall from xnu, which already does D-ACK profiling (NOT DSACK).

@bms This adds some helpful events that would help you do some profiling (though you might ultimately need a bit more):

  • ACK transmission is visible via EVENT OUT 101 (filter flags & TH_ACK, len=0 for pure ACKs).
  • DELACK timer life-cycle is visible via EVENT TIMER_START 500 (reason=DELAYACK) and EVENT TIMER_FIRED 502 (timer_type=DELACK).

You can collect these events via elog -c tcp VERBOSE TX|TIMER|CONN or if you want to correlate to received packets, do elog -c tcp VERBOSE RX|TX|TIMER|CONN.

Hopefully this would help any analyze you do if you add the improved delayed ACK logic.

Ah. This more or less removes an item from my TODO list. Thank you. The intent was always to profile first, add Minshall later.

I see that a new event logging channel is proposed in this change, rather than using the existing BBLog, as I had planned to do.

Whilst netstat -T reports the retransmit, out-of-order and send-with-zero-window counters, this relies on static counters, and is not the right approach. Adding a new field to tcpcb, as xnu does, potentially has unwanted cache-line effects. Although it does implement its own macOS-specific logging scheme.

sys/netinet/tcp_var.h
447

I'm concerned about the potential for additional cache misses due to checking this pointer when this pointer is likely to be on a cold cache line in the tcb. It looks like you've also added a flag to flags2... that might be more efficient to check, as it is near the top of the tcb and might be more likely to be in cache.

This does not compile for me on a GENERIC kernel. I noted the issues inline.

sys/netinet/tcp.h
222

Please use 80 as discussed. 79 is already used by D55338.

sys/netinet/tcp_input.c
109–111

Don't you need #include <netinet/tcp_hpts.h> here? This file does not compile for me without it?

110

Why is line here?

sys/netinet/tcp_stacks/bbr.c
94–95

Don't you need #include <netinet/tcp_hpts.h> here? This file does not compile for me without it?

2279

This function is not defined. Do you mean tcp_tv_to_usec()?

sys/netinet/tcp_stacks/rack.c
5434

This function is not defined. Do you mean tcp_tv_to_usec()?

sys/netinet/tcp_syncache.c
1033

Is it intended that the TCP_EVENTLOG socket option is not inherited from the listener? I would expect that. You can get that if you add here:

	if ((sototcpcb(lso)->t_flags2 & TF2_EVENTLOG_ENABLED) != 0)
		tp->t_flags2 |= TF2_EVENTLOG_ENABLED;
sys/netinet/tcp_var.h
862

Could you use

#define	TF2_EVENTLOG_ENABLED	0x01000000 /* User requested eventlog enabled */

since the above conflicts with

#define	TF2_RST_ON_CLOSE	0x00800000 /* Send a RST when the socket is closed */

from D55338.

I run elog -o dir=. -f tcp -c tcp in my home directory and a client, which uses the IPPROTO_TCP-level socket option TP_EVENTLOG to enable logging. The resulting file has the name:

tuexen@freebsd16:~ % file *.elog
\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245_42405_\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245_42405.elog: data

Is that intended?

include/eventlog/tcp_eventlog_schema.src
385

Is this intended to be the value which is on the wire or the one after the scaling operation?Right now it is the one after the scaling, not the one on the wire. This is inconsistent with the OUT event.

396

Is this intended to be the value which is on the wire or the one before the scaling operation? Right now it is the one on the wire. This is inconsistent with the IN event.
Maybe provide both?

sys/netinet/tcp_input.c
1519

Using ntohl() here is wrong, since the header fields are already in host byte oder.

1520

Using ntohl() here is wrong, since the header fields are already in host byte oder.

include/eventlog/tcp_eventlog_schema.src
327

Why is this a flag an not an enum? Aren't you missing a 2MSL?

512

What is the time unit? tick, ms, us?

513

What is the reason?

include/eventlog/tcp_eventlog_schema.src
103

Isn't the 2MSL timer missing?

394

Why does this event not contain the state information like the IN event?

sys/netinet/tcp.h
222

Actually: please use 81. 79 and 80 will be used for

#define        TCP_RST_REASON_CODE     79      /* Get and set Diagnostic Payload in RST segments */
#define        TCP_RST_REASON_ENABLE   80      /* Enable Diagnostic Payload in RST segments */
sys/netinet/tcp_var.h
862

Actually please use

#define	TF2_EVENTLOG_ENABLED	0x02000000 /* User requested eventlog enabled */

as I use in the implementation of draft-ietf-tcpm-rst-diagnostic-payload

#define        TF2_RST_ON_CLOSE        0x00800000 /* Send a RST when the socket is closed */
#define        TF2_RST_DIAG_PAYLOAD    0x01000000 /* Include diagnostic payload in RSTs */
rrs requested changes to this revision.Tue, Jun 9, 8:46 PM
rrs added a subscriber: jtl.

I have a serious problem with this proposed commit. Near as I can see the Eventlog created is just a knock-off recreation
of TCP BBlogging. I have heard (while I was traveling) from Michael that this has been somewhat discussed and the reasons
given were things like:

Wanting better performance (turning on BBlogs takes cpu)
Wanting it for other sub-systems
Wanting it to be easier to integrate with AI.

Now as to using it in other sub-systems, when @jtl designed the BBlogging system he explicitly did it in such a
way so that other sub-systems could use it. You might want to chat with him about that since he did design
it with that in mind. We have used it extensively with TCP but in theory there is no reason that it cannot be
adapted to other uses.

Performance wise this could have very easily been done with a tracepoint. A tracepoint (which I should not have to
be explaining if you spent time understanding the code) was designed to have a small set of events that can be
triggered to just log those things.. it was explicitly not designed to be able to turn on multiple trace points. That was
based on Michael and My experience with SCTP... But even if you wanted to do that i.e. have several layered tracepoints
that could be developed.. or with some clever macros you could have each "event" be setup so that it could respond to
two trace points.

There is no way we should be adding in a complete duplication to what already exists in the BBlogging system... if what you
were doing here was unique and could not be done with BBlogging then that would be a different story.. but this is IMO needless
duplication .

What I would like seen here is the removal of this whole event log system and change your eventlog points in the
tcp code ot use BB tracepoints. I am sure if you think about it you could do it and still get your data feed for AI without
adding a whole other system pretty much duplicated from the BBlogging systme.

R

This revision now requires changes to proceed.Tue, Jun 9, 8:46 PM
In D57309#1317999, @rrs wrote:

I have a serious problem with this proposed commit. Near as I can see the Eventlog created is just a knock-off recreation
of TCP BBlogging.

Thanks Randall, and thanks for taking the time while traveling. BBLog has been enormously valuable and I'm not trying to diminish it or the work you and jtl put into it — eventlog is meant to be the long-term consolidation of this kind of observability, not a swipe at black box logging.

TL;DR: eventlog isn't a re-implementation of BBLog with a different trigger. Both the data model (schema-generated, purpose-built structs vs. one fixed flex struct) and the collection model (per-CPU, per-subscriber buffers with system-wide/multi-stack subscription and non-connection events vs. a per-connection ring in the tcpcb) are fundamentally different.

I want to address the "needless duplication / just use a tracepoint" framing directly, because I think it rests on eventlog being a re-implementation of BBLog's data model. It isn't — the collection model and the data model are both fundamentally different, and that's the whole point.

Data model. A BBLog event is one fixed struct tcp_log_buffer (version 9, ~320 bytes, 42 members) copied wholesale per event, including a union tcp_log_stackspecific whose fields are flex1..flex8 / u32_flex[14], reinterpreted per event id. The header is candid about this: "tcp_log_stackspecific is currently being used as 'event specific' log info by all stacks ... Until this is cleaned up more generically and throughout." eventlog instead generates a compact, purpose-built struct per event from a .src schema, and the same schema generates the user-space formatters and enum/flag decoders. There are no flex fields and consumers don't hand-parse a fixed blob. So it's not a recreation of BBLog — it's specifically designed to not carry that fixed-struct/flex-field model forward.

On using points to control which events fire (instead of eventlog's keyword/level filter). I want to make sure I'm engaging the right mechanism, because BBLog has two. A *tracepoint* is a capture trigger: when a code site is hit and a single global setting matches, it arms BBLog on that one connection, after which all of that connection's events record. The facility that actually selects *which* events fire is a *BBpoint*: you tag events with a point and enable that point. That's the real analog to eventlog's keyword filtering, so that's how I'm reading the suggestion.

The difficulty is that BBpoints are single-select and per-connection — the header says it directly: "you can have multiple points only one of which is active at a time." eventlog instead gives each subscriber an independent severity level and a 32-bit keyword bitmask, so you can OR together arbitrary categories and two consumers can run different filters at the same time. The multi-select extension you sketched — layered tracepoints, or macros so each event answers to two points — is essentially that keyword bitmask; building it on the BBpoint side means re-creating eventlog's filtering, but bound to per-connection state rather than per-subscriber.

And filtering is only one axis. Whatever selects the events, BBLog still writes the fixed struct into a per-connection ring that consumers hand-decode. A point system doesn't change the per-connection memory model, the fixed/flex struct, the schema-driven consumers, the non-connection/global events, or system-wide subscription. So selecting events via points addresses the smallest piece of what eventlog does and leaves the structural differences in place.

The cases BBLog structurally can't cover (your "if it's unique, different story" test):

  • Non-connection / global events. BBLog's buffer lives in the connection control block and every API takes a TCP connection. HPTS (already an eventlog provider) has no connection to attach a buffer to, and stack-wide sampled histograms aren't tied to one connection. There's nothing to hang a per-connection ring on.
  • System-wide / multi-stack subscription with zero per-connection memory. To be clear, eventlog can also be enabled per-connection — in fact it reuses the same log-id path BBLog does, so setting a connection's BBLog log id enables an eventlog session on it too. Per-connection opt-in isn't the difference. The difference is two-fold: (1) eventlog additionally supports "show me all TCP events across default+RACK+BBR" from one elog invocation with no per-connection setup, and (2) in either mode the records land in a per-CPU, per-subscriber buffer and allocate nothing on the connection. BBLog only captures into a per-connection ring living in the connection control block: there's no system-wide collection, and every captured connection costs ring memory.

Performance when on. Both are ~free when off. When on, BBLog allocates a record and copies the full struct (a connection snapshot plus the TCP header and options) per event; eventlog does a lockless enter and a single small copy into a per-CPU buffer, with no allocation. Happy to share microbenchmarks.

Other subsystems. I agree jtl designed BBLog with extensibility in mind and I'm glad to talk with him. But the implementation in tree is tightly bound to TCP — the ring lives on the connection, records embed the TCP header, locking assumes the TCP lock, and IDs key off TCP log-id buckets. Adapting it to a non-TCP subsystem means building the buffer/lifecycle/ID layer separately anyway; eventlog did that generalization once.

I'm explicitly not proposing to remove BBLog. The plan is coexistence now, migrate tooling next, and retire BBLog only once eventlog has full coverage and community adoption. I've written up the design and the BBLog comparison in detail (happy to share the doc / D56979), and I'd welcome a synchronous design discussion with you, jtl, tuexen, and gallatin to walk through the architecture and the benchmarks.

Nick:

In respect to tracepoints... I was referring to what went in with tracepoints called BBPOINT.. in tcp_log_buf.h you see:

static inline int
tcp_bblogging_point_on(struct tcpcb *tp, uint8_t bbpoint)
{

if (tp->_t_logstate <= TCP_LOG_STATE_OFF)
        return (0);
if ((tp->_t_logstate == TCP_LOG_VIA_BBPOINTS) &&
    (tp->_t_logpoint == bbpoint))
        return (1);
else if (tp->_t_logstate == TCP_LOG_VIA_BBPOINTS)
        return (0);
return (1);

}

Note the fact that you can set in logging via BBPOINT.. so that if you set to have BBPOINT '5' on then only the
BBPOINT 5 places will come out. Not all of the BBlogging.

This feature is not used much but it allows you to set up a "point" and only have that BBpoint trigger. There is
currently only <1> BBpoint defined.. aka TCP_BBPOINT_REQ_LEVEL_LOGGING. And that is obviously not been
used much. But it *is* a feature that allows you to turn on a very narrow set of BBlogs that you define.

Now as to the variable length of logging, Do you seriously want to log out events larger than 328 bytes? I
think if you review what you are logging you *should* be able to fit in the log structure.

Note that Michael and I considered using a mask.. but our mask in SCTP we found was not used. It was one point or all.. but in theory there
is no reason why you could not add a mask and have multi-layers in BBPOINTs... far better to enhance the existing system then it is
to create a new one IMO.

Thanks Randall — that clears up the BBPOINT piece. You're right on the mechanics (single-select today, one point defined, a mask could be added), and I agree that enhancing beats rewriting all else equal. But a mask only changes which events fire, which is the least of the differences. The structural ones it doesn't touch:

  • Filtering is per-connection, not per-subscriber. We rely on filtering heavily and agree it's valuable — the question is only where the filter state lives. _t_logpoint lives in the tcpcb, so the active selection is shared by everyone watching that connection and two consumers can't run different filters at once. eventlog's level+keyword filter is per-subscriber, so independent consumers coexist.
  • One fixed record is the wrong size in both directions. A 328-byte struct can't serve either end, and "fit it into the existing structure" only holds while every provider's events are TCP-shaped and small.
    • Down: counting the full record (a 32-byte header + payload), a TCP IN is about 64 bytes vs the 328-byte tcp_log_buffer, so the common case is over-served severalfold on memory and copy cost.
    • Up: event_length is a uint16_t (up to 64 KB), so this is a general framework that can carry full-packet capture or large records like the AMD IBS work.
  • Everything keys off a tcpcb. The ring lives in the connection, records embed the TCP header, locking assumes the TCP lock. HPTS (already an eventlog provider) and stack-wide histograms have no connection to attach to. Generalizing BBLog means doing the same lift eventlog already did, but in-place on a hot subsystem instead of a parallel framework you can adopt incrementally.

None of these are features that bolt onto BBLog — they're a different structure for how events are described, stored, filtered, and consumed. And the points above are only a few examples: there's also schema-defined events with autogenerated producer structures, automatic binary-to-text decoding generated from that same schema, and more. The things that make eventlog general are exactly the things BBLog's design hardcodes, so enhancing BBLog to match would mean undoing those assumptions one by one — which converges on rebuilding it as eventlog anyway. It's cleaner to build the general framework once and migrate to it.

Address tuexen & gallatin review feedback: GENERIC build fixes (TCP_EVENTLOG 79->81, TF2_EVENTLOG_ENABLED bit move, add <netinet/tcp_hpts.h>, replace internal timeval_to_usec64() with tcp_tv_to_lusec()); log IN/OUT fields in host byte order and use tcp_get_flags() to capture the AE flag; schema adds win_raw+win to IN/OUT and state to OUT, renames timeout->timeout_us, adds timer TWOMSL; elog(1) uses calloc() for per-session state; tcp_syncache.c dynamically enables the eventlog session and dumps state on connections accepted from an eventlog-enabled listener.

Thanks for the very thorough review, Michael — this caught several real bugs, not just nits. I've uploaded a new diff that addresses every point; replies are inline. A few overall notes:

GENERIC now builds. The four causes were the sockopt number, the TF2_* flag bit, the missing #include <netinet/tcp_hpts.h> in tcp_input.c, and the timeval_to_usec64() leak in bbr.c/rack.c. Verified on an amd64 buildworld + GENERIC buildkernel (including the in-tree tcp_rack and tcp_bbr modules) — clean.

Byte order / window semantics. All IN/OUT emit sites now log in host byte order, and IN and OUT both carry win_raw (on-wire 16-bit) and win (effective 32-bit after << snd_scale).

The .elog garbage filename (a run of \245/0xA5 bytes) was a real bug: the per-session bookkeeping struct in elog(1) was allocated with malloc(), leaving has_log_id/has_conn_info/log_id[]/remote_ip[] uninitialised; on a junk-filled (0xA5) heap that surfaced as the garbage name whenever a session hadn't yet seen TCP metadata. Switched to calloc() — the file is now named from the log-id / 4-tuple instead.

The inline replies below are attached to the diff you reviewed so they line up with your comments.

Thanks for the very thorough review, Michael — this caught several real bugs, not just nits. I've uploaded a new diff that addresses every point; replies are inline. A few overall notes:

GENERIC now builds. The four causes were the sockopt number, the TF2_* flag bit, the missing #include <netinet/tcp_hpts.h> in tcp_input.c, and the timeval_to_usec64() leak in bbr.c/rack.c. Verified on an amd64 buildworld + GENERIC buildkernel (including the in-tree tcp_rack and tcp_bbr modules) — clean.

Byte order / window semantics. All IN/OUT emit sites now log in host byte order, and IN and OUT both carry win_raw (on-wire 16-bit) and win (effective 32-bit after << snd_scale).

The .elog garbage filename (a run of \245/0xA5 bytes) was a real bug: the per-session bookkeeping struct in elog(1) was allocated with malloc(), leaving has_log_id/has_conn_info/log_id[]/remote_ip[] uninitialised; on a junk-filled (0xA5) heap that surfaced as the garbage name whenever a session hadn't yet seen TCP metadata. Switched to calloc() — the file is now named from the log-id / 4-tuple instead.

The inline replies below are attached to the diff you reviewed so they line up with your comments.

Thanks for the fixes. I will continue testing. What I already observed is that timer for the base stack are not logged. Let me propose a fix for that. And I also think that some logging for syscalls are missing. I am testing this with two simple programs using the base stack and comparing what information I get via BBLog and via eventlog. Give me a couple of days.

Collected point-by-point below, keyed to your inline comments by file:line:

sys/netinet/tcp.h:222 — Done; TCP_EVENTLOG is now 81 (79 = TCP_RST_REASON_CODE, 80 reserved for TCP_RST_REASON_ENABLE).

sys/netinet/tcp_var.h:862 — Done; TF2_EVENTLOG_ENABLED is now 0x02000000, clearing TF2_RST_ON_CLOSE (0x00800000) and the reserved TF2_RST_DIAG_PAYLOAD (0x01000000); TF2_BITS index updated to \32.

sys/netinet/tcp_input.c:109 — Yes; added #include <netinet/tcp_hpts.h> (needed for tcp_get_u64_usecs()). Its absence was the main GENERIC build break.

sys/netinet/tcp_input.c:110 — That early tiwin = th->th_win << tp->snd_scale; was a redundant assignment from the port (upstream already computes tiwin later in tcp_do_segment()); removed, and the scaled window is now computed inline only at the emit site.

sys/netinet/tcp_input.c:1520-1521 — Correct; dropped the ntohl(). tcp_fields_to_host() runs before the per-stack tfb_tcp_do_segment dispatch, so th_seq/th_ack/th_win/th_urp are already in host order at the emit site; the calls were byte-swapping on little-endian. All IN/OUT sites now log host order. I also added ntohs(th->th_sum) at IN, since tcp_fields_to_host() does not swap the checksum and OUT was already logging the host value.

sys/netinet/tcp_stacks/bbr.c:94bbr.c already includes <netinet/tcp_hpts.h> (it's there in upstream main); the breakage was the timeval_to_usec64() call just below.

sys/netinet/tcp_stacks/bbr.c:2279 & sys/netinet/tcp_stacks/rack.c:5434 — Fixed; timeval_to_usec64() was a Netflix-internal helper that leaked into the patch. I used tcp_tv_to_lusec() rather than tcp_tv_to_usec(): the schema's rcv_time_us is 64-bit microseconds and tcp_tv_to_usec() returns uint32_t (it would truncate/wrap); tcp_tv_to_lusec() is the uint64_t variant in the same header.

sys/netinet/tcp_syncache.c:1022 — Fixed — thanks for the exact snippet. I added the inheritance, and went slightly further: when a connection is accepted on an eventlog-enabled listener I also enable its per-connection session and dump its current state, so a subscriber gets a complete view from the first event rather than only catching events after the next state change.

include/eventlog/tcp_eventlog_schema.src:384 & :395 — Fixed; IN and OUT now both carry win_raw (on-wire 16-bit) and win (effective 32-bit after << snd_scale), so the two events are consistent.

include/eventlog/tcp_eventlog_schema.src:393 — Fixed; STRUCT OUT now carries state:uint8_t:enum_tcp_state, mirroring IN.

include/eventlog/tcp_eventlog_schema.src:102 — Added 7:TWOMSL to ENUM timer_reason.

include/eventlog/tcp_eventlog_schema.src:326 — Added 0x40:TWOMSL to FLAG timer_type. On flag-vs-enum: timer_reason (the enum) records the single cause that triggered the event; timer_type (the flag) records the set of timers in play, which is not always one — a single TIMER_FIRED can reflect more than one armed callout. Happy to switch it to an enum if you'd prefer, now that there's a clean TWOMSL value.

include/eventlog/tcp_eventlog_schema.src:511 — Microseconds; renamed timeouttimeout_us.

include/eventlog/tcp_eventlog_schema.src:512 — It's the cause the timer was started (enum_timer_reason: RETRANSMIT, PERSIST, KEEPALIVE, DELACK, …, and now TWOMSL); the decoder prints the symbolic name. I can add a short schema comment if you'd prefer it spelled out at the field.

@rrs - I just wanted to provide a bit more context here.

Eventlog is working to address needs from multiple sides here, providing support for multiple event producers and consumers, in a way that scales well. We already have several providers beyond TCP and HPTS working, including CPU sampling, process and thread information, interface status and datapath sampling, and have plans to integrate more (user space support for stuff like nginx, AMD IBS, etc.). And on the consumer side, we need to support independent background collectors (low volume sampled data, A/B test detailed data, all for upload) and tooling and development capture. Additionally, we've built rich tooling to take all this data (in a single file) and provide time synchronized views (screenshot below). I am working on open sourcing the project in parallel. To be able to scale the number of eventlog providers and the tooling integration, all of it had to be structured, schema-based for automated processing. So, a lot of different things are driving the design. Ultimately, the goal is to satisfy the needs far beyond just what TCP needed in the past.

image.png (1×2 px, 624 KB)

I observed that the default file name .elog files for TCP is something like 7.elog, if 7 is the inp_gencnt. Would it make sense to have some tcp specific part in the name like tcp_7.elog or so?

sys/netinet/tcp_subr.c
661

I think it would be good to check here that inp->inp_lport != 0 and/or for inp->inp_fport != 0. The reason is that if a program creates the socket, enables event logging and then calls connect(), enabling event logging inserts a CONN_SET_IP_V4 event with all entries being 0 and the later connect() call inserts another CONN_SET_IP_V4 event with the appropriate information. However, since the elog tool (right now) only looks for the first CONN_SET_IP_V4 or CONN_SET_IP_V6 entry, the file end up with the name unknown_0_0.0.0.0_0.elog.
To avoid this, either only insert only valid CONN_SET_IP_V4 events or change elog tool to look for the first valid one.
Whether inp->inp_fport != 0 should be checked how the listening sockets should be handled.

668

Same as above, just for IPv6 instead of IPv4.

sys/netinet/tcp_timer.c
870

I changed the above function to:

static uint8_t const tcp_eventlog_timer_flag[TT_N] = {
    [TT_REXMT] = TCP_EVENTLOG_FLAG_RXT,
    [TT_PERSIST] = TCP_EVENTLOG_FLAG_PERSIST,
    [TT_KEEP] = TCP_EVENTLOG_FLAG_KEEP,
    [TT_2MSL] = TCP_EVENTLOG_FLAG_TWOMSL,
    [TT_DELACK] = TCP_EVENTLOG_FLAG_DELACK,
};

static uint8_t const tcp_eventlog_timer_reason[TT_N] = {
    [TT_REXMT] = TCP_EVENTLOG_TIMER_REASON_RXT,
    [TT_PERSIST] = TCP_EVENTLOG_TIMER_REASON_PERSIST,
    [TT_KEEP] = TCP_EVENTLOG_TIMER_REASON_KEEPALIVE,
    [TT_2MSL] = TCP_EVENTLOG_TIMER_REASON_TWOMSL,
    [TT_DELACK] = TCP_EVENTLOG_TIMER_REASON_DELAYACK,
};

static void
tcp_log_timer(struct tcpcb *tp, tt_which which, tt_what what, uint32_t ticks)
{
	struct tcp_log_buffer *lgb;
	uint64_t ms, us;

	INP_WLOCK_ASSERT(tptoinpcb(tp));
	if (tcp_bblogging_on(tp))
		lgb = tcp_log_event(tp, NULL, NULL, NULL, TCP_LOG_RTO, 0, 0,
		    NULL, false, NULL, NULL, 0, NULL);
	else
		lgb = NULL;
	if (lgb != NULL) {
		lgb->tlb_flex1 = (what << 8) | which;
		if (what == TT_STARTING) {
			/* Convert ticks to ms and store it in tlb_flex2. */
			if (hz == 1000)
				lgb->tlb_flex2 = ticks;
			else {
				ms = (((uint64_t)ticks * 1000) + (hz - 1)) / hz;
				if (ms > UINT32_MAX)
					lgb->tlb_flex2 = UINT32_MAX;
				else
					lgb->tlb_flex2 = (uint32_t)ms;
			}
		}
	}
	switch (what) {
	case TT_PROCESSING:
		TCP_EVENTLOG_TIMER_FIRED_LOG(tp->t_eventlog_session,
		    tcp_eventlog_timer_flag[which], /* why */1);
		break;
	case TT_PROCESSED:
		/* This is missing. */
		break;
	case TT_STARTING:
		if (hz == 1000)
			us = 1000 * (uint64_t)ticks;
		else
			us = (((uint64_t)ticks * 1000000) + (hz - 1)) / hz;
		TCP_EVENTLOG_TIMER_START_LOG(tp->t_eventlog_session,
		    us > UINT32_MAX ? UINT32_MAX : (uint32_t)us, /* overflow */
		    tcp_eventlog_timer_reason[which]);
		break;
	case TT_STOPPING:
		TCP_EVENTLOG_TIMER_CANCEL_LOG(tp->t_eventlog_session); /* no timer id */
		break;
	}
}

and then changed all allocations of tcp_bblog_timer() to tcp_log_timer(). This gets timer logging for the default stack somewhat working:

5        [ 2]186cb::22:46:04.183794 [tcp][9][TIMER_START] Started KEEPALIVE timer for 75000000 us
6        [ 2]186cb::22:46:04.183829 [tcp][9][IN] Received 0 bytes (seq 4102050744, ack 3158266436, win 81920, flags ACK) in state SYN_RECEIVED rcv_time 113111338 us
7        [ 2]186cb::22:46:04.183831 [tcp][9][TIMER_START] Started KEEPALIVE timer for 4294967295 us
8        [ 2]186cb::22:46:04.183833 [tcp][9][RTT] RTT sample 1000 us applied to srtt 0 us (path 0 us, all-time low 0 us, GP min 0 us, gp_srtt 0 us)
9        [ 2]186cb::22:46:04.183834 [tcp][9][TIMER_CANCEL] Timer canceled
10       [ 3]186cb::22:46:04.184368 [tcp][9][IN] Received 8192 bytes (seq 4102050744, ack 3158266436, win 81920, flags PUSH|ACK) in state ESTABLISHED rcv_time 113111876 us
11       [ 3]186cb::22:46:04.184374 [tcp][9][TIMER_START] Started DELAYACK timer for 40000 us
12       [ 0]186c7::22:46:04.226913 [tcp][9][TIMER_FIRED] Timer expired (type DELACK, result 1)

This brings up a couple of comments/questions:

  • Can we change timeout_us:uint32_t in STRUCT TIMER_START to timeout_us:uint64_t to avoid overflows, which already happen above?
  • Can we add a enum_timer_reason or flag_timer_type to EVENT TIMER_CANCEL? I think it would be good to know, which timer was canceled?
  • What is the semantic of result in STRUCT TIMER_FIRED? Does it mean the event is logged when the timer processing is done (TT_PROCESSED) instead of started (TT_PROCESSING)?
  • Does it makes sense to add a event which corresponds to TT_PROCESSED?

Address review feedback: emit timer lifecycle events (START/CANCEL/FIRED/PROCESSED) from the base-stack timer callers; widen TIMER_START.timeout_us to uint64_t; add timer_reason payloads; guard CONN_SET_IP on a bound local port; elog(1) provider-prefixed per-session filename

Address review feedback: emit timer lifecycle events (START/CANCEL/FIRED/PROCESSED) from the base-stack timer callers; widen TIMER_START.timeout_us to uint64_t; add timer_reason payloads; guard CONN_SET_IP on a bound local port; elog(1) provider-prefixed per-session filename

Thank you very much quickly addressing the issues. Will continue testing...