Details

Reviewers

tuexen
glebius
gallatin
rrs
pouria
ali_mashtizadeh.com

Group Reviewers

transport

Summary

Register a "tcp" eventlog(9) provider and emit per-connection events
from the in-tree TCP code paths. Each tcpcb gets a per-connection
eventlog session identified by inp_gencnt. When the user enables
eventlog on a socket via the new TCP_EVENTLOG sockopt, the session
is flagged for capture and dumped against any subscribers. Sessions
follow stack swaps via tcp_ensure_eventlog_session_on_switch(), so a
connection moving between eventlog-capable stacks (e.g. base ↔ rack)
stays observable across the transition.

Schema (include/eventlog/tcp_eventlog_schema.src):

SESSION/CONN - SESSION_CREATE, CONN_SET_IP_V[46], LOG_ID, CONN_PARAMS, MSS, CONN_OWNER, TCP_STACK_CHANGE, CC_ALGO_CHANGE
DATA - IN, OUT, ACK, SACK, USER_SEND, SENDFILE
CC / WINDOW - CWND, CC_PARAMS, ENTER/EXIT_RECOVERY, PRR_STATE, WINDOW_COLLAPSE, HYSTART, GOODPUT
RTT - RTT, RTT_SAMPLE
LOSS - RTO, RETRANSMIT_REASON, DSACK
TIMER - TIMER_START / _CANCEL / _FIRED, SB_WAKE
PACING - PACING_CALC, PACE_STATE, HW_PACE_RATE
APP / OPT - SOCKET_OPT_UINT32, SOCKET_OPT_ERR, OUTPUT_DECISION

The RTT keyword is split out from CC because EVENT RTT (300) fires on
every SRTT update; consumers that want CC visibility (cwnd / PRR /
hystart / goodput) without that firehose can subscribe to CC alone.
STRUCT GOODPUT carries the per-goodput-window SRTT and the all-time-min
RTT alongside gp_bw / raw_bw so consumers can correlate throughput
against RTT without interleaving a separate RTT event stream.

Emission points: tcp_input.c (RTT update), tcp_subr.c (provider
registration, dump callback, dump_session, conn_params,
sendfile/log-id helpers, stack switch), tcp_syncache.c (CONN_SET_IP),
tcp_timer.c (RTO), tcp_usrreq.c (TCP_EVENTLOG sockopt, CONN_OWNER on
attach, CC_ALGO_CHANGE in tcp_set_cc_mod), tcp_output.c, the rack/bbr
stacks, the cubic/newreno CC modules, and kern_sendfile.c.

The new sockopt TCP_EVENTLOG (79) and the new t_flags2 bit
TF2_EVENTLOG_ENABLED gate per-connection capture. CONN_OWNER tracks
the owning PID on active opens (socket(2) caller) and on passive
accepts (listener owner is inherited).

The build system in sys/conf/{kern.pre.mk,kmod.mk} (introduced with
eventlog(9)) auto-generates <eventlog/tcp_eventlog.h> from the schema
on every relevant rebuild, so producers always see consistent
emission macros.

tcp(4):

Document the new TCP_EVENTLOG sockopt and its dump-on-enable behaviour; cross-reference the kern.eventlog.tcp.default sysctl in eventlog(9) for the system-wide default.
Add elog(1) and eventlog(9) to SEE ALSO.

elog(1):

New flag -f / --format <type> (only "tcp" is currently accepted). With "-o dir=<path>" capture, files are post-renamed using the first LOG_ID and CONN_SET_IP_V[46] events seen in the session, producing <log_id>_<lport>_<rip>_<rport>.elog. Missing fields decode as "unknown".
struct session_file gains the per-session metadata fields used by the renamer.
extract_tcp_metadata() captures the metadata; rename_session_file() runs once at session-end (and at exit cleanup).

elog(1) man page (elog.1):

Document -f / --format with the TCP file-naming convention.
TCP-flavored examples now name "tcp" explicitly instead of the generic "<provider>" placeholder.

tests/sys/kern/elog_test.py:

Replaces the smoke-only test file with end-to-end coverage. Each test pins net.inet.tcp.functions_default to "freebsd" and sets kern.eventlog.tcp.default=1 around a loopback exchange so that captures contain stable, predictable events.
Coverage:
- basic capture from the tcp provider with no traffic and

with a loopback exchange (per stack, per address family);

provider-tag and event-name presence in human-readable

output;

dump-state replay of an existing connection's parameters;
keyword filtering (RX/TX);
level filtering and SESSION-keyword bypass;
binary capture round-trip via -r;
-o dir=<path> per-session split + -f tcp rename.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 74622
Build 71505: arc lint + arc unit

Event Timeline

nickbanks_netflix.com created this revision.May 28 2026, 6:36 PM

Herald added a reviewer: transport. · View Herald TranscriptMay 28 2026, 6:36 PM

Herald added subscribers: ziaee, olce, melifaro, imp. · View Herald Transcript

nickbanks_netflix.com requested review of this revision.May 28 2026, 6:36 PM

Harbormaster completed remote builds in B73516: Diff 178843.May 28 2026, 6:36 PM

nickbanks_netflix.com added a parent revision: D56979: kern: import eventlog(9), a kernel event logging framework.May 28 2026, 6:36 PM

Build fixes after running buildkernel: 1) Replace legacy 5-arg eventlog_provider_create() calls in tcp_subr.c and tcp_stacks/rack.c with upstream 2-arg form using a struct eventlog_provider_config. 2) Replace Netflix-internal INP_DROPPED check in the sendfile eventlog hook with upstream tp->t_flags & TF_DISCONNECTED. 3) Declare 'appended' in tcp_usr_send and capture sbavail() deltas around sbappendstream*. 4) Drop the SACK_LOG branch in rack_ack_received() since num_sack_blks/sack_blocks are not in scope; re-enable once rack_ack_received() is extended to receive SACK info from its callers.

Harbormaster completed remote builds in B73525: Diff 178861.May 28 2026, 10:22 PM

Build fixes after running buildkernel - see commit body for details

Harbormaster completed remote builds in B73526: Diff 178862.May 28 2026, 10:22 PM

I have plans to add Delayed ACK profiling eventually, does this collide with anything, or is someone going to beat me to it?

This is with a view to merging Minshall from xnu, which already does D-ACK profiling (NOT DSACK).

In D57309#1313341, @bms wrote:

I have plans to add Delayed ACK profiling eventually, does this collide with anything, or is someone going to beat me to it?

This is with a view to merging Minshall from xnu, which already does D-ACK profiling (NOT DSACK).

@bms This adds some helpful events that would help you do some profiling (though you might ultimately need a bit more):

ACK transmission is visible via EVENT OUT 101 (filter flags & TH_ACK, len=0 for pure ACKs).
DELACK timer life-cycle is visible via EVENT TIMER_START 500 (reason=DELAYACK) and EVENT TIMER_FIRED 502 (timer_type=DELACK).

Hopefully this would help any analyze you do if you add the improved delayed ACK logic.

Ah. This more or less removes an item from my TODO list. Thank you. The intent was always to profile first, add Minshall later.

I see that a new event logging channel is proposed in this change, rather than using the existing BBLog, as I had planned to do.

Whilst netstat -T reports the retransmit, out-of-order and send-with-zero-window counters, this relies on static counters, and is not the right approach. Adding a new field to tcpcb, as xnu does, potentially has unwanted cache-line effects. Although it does implement its own macOS-specific logging scheme.

gallatin added inline comments.Jun 3 2026, 10:45 PM

sys/netinet/tcp_var.h
447	I'm concerned about the potential for additional cache misses due to checking this pointer when this pointer is likely to be on a cold cache line in the tcb. It looks like you've also added a flag to flags2... that might be more efficient to check, as it is near the top of the tcb and might be more likely to be in cache.

This does not compile for me on a GENERIC kernel. I noted the issues inline.

sys/netinet/tcp.h
222	Please use `80` as discussed. `79` is already used by D55338.
sys/netinet/tcp_input.c
109–111	Don't you need `#include <netinet/tcp_hpts.h>` here? This file does not compile for me without it?
110	Why is line here?
sys/netinet/tcp_stacks/bbr.c
94–95	Don't you need `#include <netinet/tcp_hpts.h>` here? This file does not compile for me without it?
2279	This function is not defined. Do you mean `tcp_tv_to_usec()`?
sys/netinet/tcp_stacks/rack.c
5434	This function is not defined. Do you mean `tcp_tv_to_usec()`?

tuexen added inline comments.Jun 6 2026, 7:15 PM

sys/netinet/tcp_syncache.c
1033	Is it intended that the `TCP_EVENTLOG` socket option is not inherited from the listener? I would expect that. You can get that if you add here: if ((sototcpcb(lso)->t_flags2 & TF2_EVENTLOG_ENABLED) != 0) tp->t_flags2 \|= TF2_EVENTLOG_ENABLED;
sys/netinet/tcp_var.h
862	Could you use #define TF2_EVENTLOG_ENABLED 0x01000000 /* User requested eventlog enabled / since the above conflicts with #define TF2_RST_ON_CLOSE 0x00800000 / Send a RST when the socket is closed */ from D55338.

I run elog -o dir=. -f tcp -c tcp in my home directory and a client, which uses the IPPROTO_TCP-level socket option TP_EVENTLOG to enable logging. The resulting file has the name:

tuexen@freebsd16:~ % file *.elog
\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245_42405_\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245\245_42405.elog: data

Is that intended?

tuexen added inline comments.Jun 6 2026, 7:38 PM

include/eventlog/tcp_eventlog_schema.src
385	Is this intended to be the value which is on the wire or the one after the scaling operation?Right now it is the one after the scaling, not the one on the wire. This is inconsistent with the OUT event.
396	Is this intended to be the value which is on the wire or the one before the scaling operation? Right now it is the one on the wire. This is inconsistent with the IN event. Maybe provide both?

tuexen added inline comments.Jun 6 2026, 8:22 PM

sys/netinet/tcp_input.c
1519	Using `ntohl()` here is wrong, since the header fields are already in host byte oder.
1520	Using `ntohl()` here is wrong, since the header fields are already in host byte oder.

tuexen added inline comments.Jun 6 2026, 9:17 PM

include/eventlog/tcp_eventlog_schema.src
327	Why is this a flag an not an enum? Aren't you missing a 2MSL?
512	What is the time unit? tick, ms, us?
513	What is the reason?

tuexen added inline comments.Jun 6 2026, 9:26 PM

include/eventlog/tcp_eventlog_schema.src
103	Isn't the 2MSL timer missing?
394	Why does this event not contain the state information like the IN event?

tuexen added inline comments.Jun 7 2026, 12:52 PM

sys/netinet/tcp.h

222

Actually: please use 81. 79 and 80 will be used for

#define        TCP_RST_REASON_CODE     79      /* Get and set Diagnostic Payload in RST segments */
#define        TCP_RST_REASON_ENABLE   80      /* Enable Diagnostic Payload in RST segments */

sys/netinet/tcp_var.h

862

Actually please use

#define	TF2_EVENTLOG_ENABLED	0x02000000 /* User requested eventlog enabled */

as I use in the implementation of draft-ietf-tcpm-rst-diagnostic-payload

#define        TF2_RST_ON_CLOSE        0x00800000 /* Send a RST when the socket is closed */
#define        TF2_RST_DIAG_PAYLOAD    0x01000000 /* Include diagnostic payload in RSTs */

I have a serious problem with this proposed commit. Near as I can see the Eventlog created is just a knock-off recreation
of TCP BBlogging. I have heard (while I was traveling) from Michael that this has been somewhat discussed and the reasons
given were things like:

Wanting better performance (turning on BBlogs takes cpu)
Wanting it for other sub-systems
Wanting it to be easier to integrate with AI.

Now as to using it in other sub-systems, when @jtl designed the BBlogging system he explicitly did it in such a
way so that other sub-systems could use it. You might want to chat with him about that since he did design
it with that in mind. We have used it extensively with TCP but in theory there is no reason that it cannot be
adapted to other uses.

Performance wise this could have very easily been done with a tracepoint. A tracepoint (which I should not have to
be explaining if you spent time understanding the code) was designed to have a small set of events that can be
triggered to just log those things.. it was explicitly not designed to be able to turn on multiple trace points. That was
based on Michael and My experience with SCTP... But even if you wanted to do that i.e. have several layered tracepoints
that could be developed.. or with some clever macros you could have each "event" be setup so that it could respond to
two trace points.

There is no way we should be adding in a complete duplication to what already exists in the BBlogging system... if what you
were doing here was unique and could not be done with BBlogging then that would be a different story.. but this is IMO needless
duplication .

What I would like seen here is the removal of this whole event log system and change your eventlog points in the
tcp code ot use BB tracepoints. I am sure if you think about it you could do it and still get your data feed for AI without
adding a whole other system pretty much duplicated from the BBlogging systme.

This revision now requires changes to proceed.Jun 9 2026, 8:46 PM

In D57309#1317999, @rrs wrote:

I have a serious problem with this proposed commit. Near as I can see the Eventlog created is just a knock-off recreation
of TCP BBlogging.

Thanks Randall, and thanks for taking the time while traveling. BBLog has been enormously valuable and I'm not trying to diminish it or the work you and jtl put into it — eventlog is meant to be the long-term consolidation of this kind of observability, not a swipe at black box logging.

TL;DR: eventlog isn't a re-implementation of BBLog with a different trigger. Both the data model (schema-generated, purpose-built structs vs. one fixed flex struct) and the collection model (per-CPU, per-subscriber buffers with system-wide/multi-stack subscription and non-connection events vs. a per-connection ring in the tcpcb) are fundamentally different.

I want to address the "needless duplication / just use a tracepoint" framing directly, because I think it rests on eventlog being a re-implementation of BBLog's data model. It isn't — the collection model and the data model are both fundamentally different, and that's the whole point.

Data model. A BBLog event is one fixed struct tcp_log_buffer (version 9, ~320 bytes, 42 members) copied wholesale per event, including a union tcp_log_stackspecific whose fields are flex1..flex8 / u32_flex[14], reinterpreted per event id. The header is candid about this: "tcp_log_stackspecific is currently being used as 'event specific' log info by all stacks ... Until this is cleaned up more generically and throughout." eventlog instead generates a compact, purpose-built struct per event from a .src schema, and the same schema generates the user-space formatters and enum/flag decoders. There are no flex fields and consumers don't hand-parse a fixed blob. So it's not a recreation of BBLog — it's specifically designed to not carry that fixed-struct/flex-field model forward.

On using points to control which events fire (instead of eventlog's keyword/level filter). I want to make sure I'm engaging the right mechanism, because BBLog has two. A *tracepoint* is a capture trigger: when a code site is hit and a single global setting matches, it arms BBLog on that one connection, after which all of that connection's events record. The facility that actually selects *which* events fire is a *BBpoint*: you tag events with a point and enable that point. That's the real analog to eventlog's keyword filtering, so that's how I'm reading the suggestion.

The difficulty is that BBpoints are single-select and per-connection — the header says it directly: "you can have multiple points only one of which is active at a time." eventlog instead gives each subscriber an independent severity level and a 32-bit keyword bitmask, so you can OR together arbitrary categories and two consumers can run different filters at the same time. The multi-select extension you sketched — layered tracepoints, or macros so each event answers to two points — is essentially that keyword bitmask; building it on the BBpoint side means re-creating eventlog's filtering, but bound to per-connection state rather than per-subscriber.

And filtering is only one axis. Whatever selects the events, BBLog still writes the fixed struct into a per-connection ring that consumers hand-decode. A point system doesn't change the per-connection memory model, the fixed/flex struct, the schema-driven consumers, the non-connection/global events, or system-wide subscription. So selecting events via points addresses the smallest piece of what eventlog does and leaves the structural differences in place.

The cases BBLog structurally can't cover (your "if it's unique, different story" test):

Non-connection / global events. BBLog's buffer lives in the connection control block and every API takes a TCP connection. HPTS (already an eventlog provider) has no connection to attach a buffer to, and stack-wide sampled histograms aren't tied to one connection. There's nothing to hang a per-connection ring on.
System-wide / multi-stack subscription with zero per-connection memory. To be clear, eventlog can also be enabled per-connection — in fact it reuses the same log-id path BBLog does, so setting a connection's BBLog log id enables an eventlog session on it too. Per-connection opt-in isn't the difference. The difference is two-fold: (1) eventlog additionally supports "show me all TCP events across default+RACK+BBR" from one elog invocation with no per-connection setup, and (2) in either mode the records land in a per-CPU, per-subscriber buffer and allocate nothing on the connection. BBLog only captures into a per-connection ring living in the connection control block: there's no system-wide collection, and every captured connection costs ring memory.

Performance when on. Both are ~free when off. When on, BBLog allocates a record and copies the full struct (a connection snapshot plus the TCP header and options) per event; eventlog does a lockless enter and a single small copy into a per-CPU buffer, with no allocation. Happy to share microbenchmarks.

Other subsystems. I agree jtl designed BBLog with extensibility in mind and I'm glad to talk with him. But the implementation in tree is tightly bound to TCP — the ring lives on the connection, records embed the TCP header, locking assumes the TCP lock, and IDs key off TCP log-id buckets. Adapting it to a non-TCP subsystem means building the buffer/lifecycle/ID layer separately anyway; eventlog did that generalization once.

I'm explicitly not proposing to remove BBLog. The plan is coexistence now, migrate tooling next, and retire BBLog only once eventlog has full coverage and community adoption. I've written up the design and the BBLog comparison in detail (happy to share the doc / D56979), and I'd welcome a synchronous design discussion with you, jtl, tuexen, and gallatin to walk through the architecture and the benchmarks.

Nick:

In respect to tracepoints... I was referring to what went in with tracepoints called BBPOINT.. in tcp_log_buf.h you see:

static inline int
tcp_bblogging_point_on(struct tcpcb *tp, uint8_t bbpoint)
{

if (tp->_t_logstate <= TCP_LOG_STATE_OFF)
        return (0);
if ((tp->_t_logstate == TCP_LOG_VIA_BBPOINTS) &&
    (tp->_t_logpoint == bbpoint))
        return (1);
else if (tp->_t_logstate == TCP_LOG_VIA_BBPOINTS)
        return (0);
return (1);

}

Note the fact that you can set in logging via BBPOINT.. so that if you set to have BBPOINT '5' on then only the
BBPOINT 5 places will come out. Not all of the BBlogging.

This feature is not used much but it allows you to set up a "point" and only have that BBpoint trigger. There is
currently only <1> BBpoint defined.. aka TCP_BBPOINT_REQ_LEVEL_LOGGING. And that is obviously not been
used much. But it *is* a feature that allows you to turn on a very narrow set of BBlogs that you define.

Now as to the variable length of logging, Do you seriously want to log out events larger than 328 bytes? I
think if you review what you are logging you *should* be able to fit in the log structure.

Note that Michael and I considered using a mask.. but our mask in SCTP we found was not used. It was one point or all.. but in theory there
is no reason why you could not add a mask and have multi-layers in BBPOINTs... far better to enhance the existing system then it is
to create a new one IMO.

Thanks Randall — that clears up the BBPOINT piece. You're right on the mechanics (single-select today, one point defined, a mask could be added), and I agree that enhancing beats rewriting all else equal. But a mask only changes which events fire, which is the least of the differences. The structural ones it doesn't touch:

Filtering is per-connection, not per-subscriber. We rely on filtering heavily and agree it's valuable — the question is only where the filter state lives. _t_logpoint lives in the tcpcb, so the active selection is shared by everyone watching that connection and two consumers can't run different filters at once. eventlog's level+keyword filter is per-subscriber, so independent consumers coexist.
One fixed record is the wrong size in both directions. A 328-byte struct can't serve either end, and "fit it into the existing structure" only holds while every provider's events are TCP-shaped and small.
- Down: counting the full record (a 32-byte header + payload), a TCP IN is about 64 bytes vs the 328-byte tcp_log_buffer, so the common case is over-served severalfold on memory and copy cost.
- Up: event_length is a uint16_t (up to 64 KB), so this is a general framework that can carry full-packet capture or large records like the AMD IBS work.
Everything keys off a tcpcb. The ring lives in the connection, records embed the TCP header, locking assumes the TCP lock. HPTS (already an eventlog provider) and stack-wide histograms have no connection to attach to. Generalizing BBLog means doing the same lift eventlog already did, but in-place on a hot subsystem instead of a parallel framework you can adopt incrementally.

None of these are features that bolt onto BBLog — they're a different structure for how events are described, stored, filtered, and consumed. And the points above are only a few examples: there's also schema-defined events with autogenerated producer structures, automatic binary-to-text decoding generated from that same schema, and more. The things that make eventlog general are exactly the things BBLog's design hardcodes, so enhancing BBLog to match would mean undoing those assumptions one by one — which converges on rebuilding it as eventlog anyway. It's cleaner to build the general framework once and migrate to it.

Address tuexen & gallatin review feedback: GENERIC build fixes (TCP_EVENTLOG 79->81, TF2_EVENTLOG_ENABLED bit move, add <netinet/tcp_hpts.h>, replace internal timeval_to_usec64() with tcp_tv_to_lusec()); log IN/OUT fields in host byte order and use tcp_get_flags() to capture the AE flag; schema adds win_raw+win to IN/OUT and state to OUT, renames timeout->timeout_us, adds timer TWOMSL; elog(1) uses calloc() for per-session state; tcp_syncache.c dynamically enables the eventlog session and dumps state on connections accepted from an eventlog-enabled listener.

Harbormaster completed remote builds in B73812: Diff 179584.Jun 10 2026, 8:02 PM

Thanks for the very thorough review, Michael — this caught several real bugs, not just nits. I've uploaded a new diff that addresses every point; replies are inline. A few overall notes:

GENERIC now builds. The four causes were the sockopt number, the TF2_* flag bit, the missing #include <netinet/tcp_hpts.h> in tcp_input.c, and the timeval_to_usec64() leak in bbr.c/rack.c. Verified on an amd64 buildworld + GENERIC buildkernel (including the in-tree tcp_rack and tcp_bbr modules) — clean.

Byte order / window semantics. All IN/OUT emit sites now log in host byte order, and IN and OUT both carry win_raw (on-wire 16-bit) and win (effective 32-bit after << snd_scale).

The .elog garbage filename (a run of \245/0xA5 bytes) was a real bug: the per-session bookkeeping struct in elog(1) was allocated with malloc(), leaving has_log_id/has_conn_info/log_id[]/remote_ip[] uninitialised; on a junk-filled (0xA5) heap that surfaced as the garbage name whenever a session hadn't yet seen TCP metadata. Switched to calloc() — the file is now named from the log-id / 4-tuple instead.

The inline replies below are attached to the diff you reviewed so they line up with your comments.

Point-by-point replies inline.

In D57309#1318468, @nickbanks_netflix.com wrote:

Thanks for the very thorough review, Michael — this caught several real bugs, not just nits. I've uploaded a new diff that addresses every point; replies are inline. A few overall notes:

GENERIC now builds. The four causes were the sockopt number, the TF2_* flag bit, the missing #include <netinet/tcp_hpts.h> in tcp_input.c, and the timeval_to_usec64() leak in bbr.c/rack.c. Verified on an amd64 buildworld + GENERIC buildkernel (including the in-tree tcp_rack and tcp_bbr modules) — clean.

Byte order / window semantics. All IN/OUT emit sites now log in host byte order, and IN and OUT both carry win_raw (on-wire 16-bit) and win (effective 32-bit after << snd_scale).

The .elog garbage filename (a run of \245/0xA5 bytes) was a real bug: the per-session bookkeeping struct in elog(1) was allocated with malloc(), leaving has_log_id/has_conn_info/log_id[]/remote_ip[] uninitialised; on a junk-filled (0xA5) heap that surfaced as the garbage name whenever a session hadn't yet seen TCP metadata. Switched to calloc() — the file is now named from the log-id / 4-tuple instead.

The inline replies below are attached to the diff you reviewed so they line up with your comments.

Thanks for the fixes. I will continue testing. What I already observed is that timer for the base stack are not logged. Let me propose a fix for that. And I also think that some logging for syscalls are missing. I am testing this with two simple programs using the base stack and comparing what information I get via BBLog and via eventlog. Give me a couple of days.

Collected point-by-point below, keyed to your inline comments by file:line:

sys/netinet/tcp.h:222 — Done; TCP_EVENTLOG is now 81 (79 = TCP_RST_REASON_CODE, 80 reserved for TCP_RST_REASON_ENABLE).

sys/netinet/tcp_var.h:862 — Done; TF2_EVENTLOG_ENABLED is now 0x02000000, clearing TF2_RST_ON_CLOSE (0x00800000) and the reserved TF2_RST_DIAG_PAYLOAD (0x01000000); TF2_BITS index updated to \32.

sys/netinet/tcp_input.c:109 — Yes; added #include <netinet/tcp_hpts.h> (needed for tcp_get_u64_usecs()). Its absence was the main GENERIC build break.

sys/netinet/tcp_input.c:110 — That early tiwin = th->th_win << tp->snd_scale; was a redundant assignment from the port (upstream already computes tiwin later in tcp_do_segment()); removed, and the scaled window is now computed inline only at the emit site.

sys/netinet/tcp_input.c:1520-1521 — Correct; dropped the ntohl(). tcp_fields_to_host() runs before the per-stack tfb_tcp_do_segment dispatch, so th_seq/th_ack/th_win/th_urp are already in host order at the emit site; the calls were byte-swapping on little-endian. All IN/OUT sites now log host order. I also added ntohs(th->th_sum) at IN, since tcp_fields_to_host() does not swap the checksum and OUT was already logging the host value.

sys/netinet/tcp_stacks/bbr.c:94 — bbr.c already includes <netinet/tcp_hpts.h> (it's there in upstream main); the breakage was the timeval_to_usec64() call just below.

sys/netinet/tcp_stacks/bbr.c:2279 & sys/netinet/tcp_stacks/rack.c:5434 — Fixed; timeval_to_usec64() was a Netflix-internal helper that leaked into the patch. I used tcp_tv_to_lusec() rather than tcp_tv_to_usec(): the schema's rcv_time_us is 64-bit microseconds and tcp_tv_to_usec() returns uint32_t (it would truncate/wrap); tcp_tv_to_lusec() is the uint64_t variant in the same header.

sys/netinet/tcp_syncache.c:1022 — Fixed — thanks for the exact snippet. I added the inheritance, and went slightly further: when a connection is accepted on an eventlog-enabled listener I also enable its per-connection session and dump its current state, so a subscriber gets a complete view from the first event rather than only catching events after the next state change.

include/eventlog/tcp_eventlog_schema.src:384 & :395 — Fixed; IN and OUT now both carry win_raw (on-wire 16-bit) and win (effective 32-bit after << snd_scale), so the two events are consistent.

include/eventlog/tcp_eventlog_schema.src:393 — Fixed; STRUCT OUT now carries state:uint8_t:enum_tcp_state, mirroring IN.

include/eventlog/tcp_eventlog_schema.src:102 — Added 7:TWOMSL to ENUM timer_reason.

include/eventlog/tcp_eventlog_schema.src:326 — Added 0x40:TWOMSL to FLAG timer_type. On flag-vs-enum: timer_reason (the enum) records the single cause that triggered the event; timer_type (the flag) records the set of timers in play, which is not always one — a single TIMER_FIRED can reflect more than one armed callout. Happy to switch it to an enum if you'd prefer, now that there's a clean TWOMSL value.

include/eventlog/tcp_eventlog_schema.src:511 — Microseconds; renamed timeout → timeout_us.

include/eventlog/tcp_eventlog_schema.src:512 — It's the cause the timer was started (enum_timer_reason: RETRANSMIT, PERSIST, KEEPALIVE, DELACK, …, and now TWOMSL); the decoder prints the symbolic name. I can add a short schema comment if you'd prefer it spelled out at the field.

@rrs - I just wanted to provide a bit more context here.

Eventlog is working to address needs from multiple sides here, providing support for multiple event producers and consumers, in a way that scales well. We already have several providers beyond TCP and HPTS working, including CPU sampling, process and thread information, interface status and datapath sampling, and have plans to integrate more (user space support for stuff like nginx, AMD IBS, etc.). And on the consumer side, we need to support independent background collectors (low volume sampled data, A/B test detailed data, all for upload) and tooling and development capture. Additionally, we've built rich tooling to take all this data (in a single file) and provide time synchronized views (screenshot below). I am working on open sourcing the project in parallel. To be able to scale the number of eventlog providers and the tooling integration, all of it had to be structured, schema-based for automated processing. So, a lot of different things are driving the design. Ultimately, the goal is to satisfy the needs far beyond just what TCP needed in the past.

I observed that the default file name .elog files for TCP is something like 7.elog, if 7 is the inp_gencnt. Would it make sense to have some tcp specific part in the name like tcp_7.elog or so?

sys/netinet/tcp_subr.c
661	I think it would be good to check here that `inp->inp_lport != 0` and/or for `inp->inp_fport != 0`. The reason is that if a program creates the socket, enables event logging and then calls `connect()`, enabling event logging inserts a `CONN_SET_IP_V4` event with all entries being 0 and the later `connect()` call inserts another `CONN_SET_IP_V4` event with the appropriate information. However, since the `elog` tool (right now) only looks for the first `CONN_SET_IP_V4` or `CONN_SET_IP_V6` entry, the file end up with the name `unknown_0_0.0.0.0_0.elog`. To avoid this, either only insert only valid `CONN_SET_IP_V4` events or change `elog` tool to look for the first valid one. Whether `inp->inp_fport != 0` should be checked how the listening sockets should be handled.
668	Same as above, just for IPv6 instead of IPv4.

tuexen added inline comments.Jun 14 2026, 6:28 PM

sys/netinet/tcp_timer.c

870

I changed the above function to:

static uint8_t const tcp_eventlog_timer_flag[TT_N] = {
    [TT_REXMT] = TCP_EVENTLOG_FLAG_RXT,
    [TT_PERSIST] = TCP_EVENTLOG_FLAG_PERSIST,
    [TT_KEEP] = TCP_EVENTLOG_FLAG_KEEP,
    [TT_2MSL] = TCP_EVENTLOG_FLAG_TWOMSL,
    [TT_DELACK] = TCP_EVENTLOG_FLAG_DELACK,
};

static uint8_t const tcp_eventlog_timer_reason[TT_N] = {
    [TT_REXMT] = TCP_EVENTLOG_TIMER_REASON_RXT,
    [TT_PERSIST] = TCP_EVENTLOG_TIMER_REASON_PERSIST,
    [TT_KEEP] = TCP_EVENTLOG_TIMER_REASON_KEEPALIVE,
    [TT_2MSL] = TCP_EVENTLOG_TIMER_REASON_TWOMSL,
    [TT_DELACK] = TCP_EVENTLOG_TIMER_REASON_DELAYACK,
};

static void
tcp_log_timer(struct tcpcb *tp, tt_which which, tt_what what, uint32_t ticks)
{
	struct tcp_log_buffer *lgb;
	uint64_t ms, us;

	INP_WLOCK_ASSERT(tptoinpcb(tp));
	if (tcp_bblogging_on(tp))
		lgb = tcp_log_event(tp, NULL, NULL, NULL, TCP_LOG_RTO, 0, 0,
		    NULL, false, NULL, NULL, 0, NULL);
	else
		lgb = NULL;
	if (lgb != NULL) {
		lgb->tlb_flex1 = (what << 8) | which;
		if (what == TT_STARTING) {
			/* Convert ticks to ms and store it in tlb_flex2. */
			if (hz == 1000)
				lgb->tlb_flex2 = ticks;
			else {
				ms = (((uint64_t)ticks * 1000) + (hz - 1)) / hz;
				if (ms > UINT32_MAX)
					lgb->tlb_flex2 = UINT32_MAX;
				else
					lgb->tlb_flex2 = (uint32_t)ms;
			}
		}
	}
	switch (what) {
	case TT_PROCESSING:
		TCP_EVENTLOG_TIMER_FIRED_LOG(tp->t_eventlog_session,
		    tcp_eventlog_timer_flag[which], /* why */1);
		break;
	case TT_PROCESSED:
		/* This is missing. */
		break;
	case TT_STARTING:
		if (hz == 1000)
			us = 1000 * (uint64_t)ticks;
		else
			us = (((uint64_t)ticks * 1000000) + (hz - 1)) / hz;
		TCP_EVENTLOG_TIMER_START_LOG(tp->t_eventlog_session,
		    us > UINT32_MAX ? UINT32_MAX : (uint32_t)us, /* overflow */
		    tcp_eventlog_timer_reason[which]);
		break;
	case TT_STOPPING:
		TCP_EVENTLOG_TIMER_CANCEL_LOG(tp->t_eventlog_session); /* no timer id */
		break;
	}
}

and then changed all allocations of tcp_bblog_timer() to tcp_log_timer(). This gets timer logging for the default stack somewhat working:

5        [ 2]186cb::22:46:04.183794 [tcp][9][TIMER_START] Started KEEPALIVE timer for 75000000 us
6        [ 2]186cb::22:46:04.183829 [tcp][9][IN] Received 0 bytes (seq 4102050744, ack 3158266436, win 81920, flags ACK) in state SYN_RECEIVED rcv_time 113111338 us
7        [ 2]186cb::22:46:04.183831 [tcp][9][TIMER_START] Started KEEPALIVE timer for 4294967295 us
8        [ 2]186cb::22:46:04.183833 [tcp][9][RTT] RTT sample 1000 us applied to srtt 0 us (path 0 us, all-time low 0 us, GP min 0 us, gp_srtt 0 us)
9        [ 2]186cb::22:46:04.183834 [tcp][9][TIMER_CANCEL] Timer canceled
10       [ 3]186cb::22:46:04.184368 [tcp][9][IN] Received 8192 bytes (seq 4102050744, ack 3158266436, win 81920, flags PUSH|ACK) in state ESTABLISHED rcv_time 113111876 us
11       [ 3]186cb::22:46:04.184374 [tcp][9][TIMER_START] Started DELAYACK timer for 40000 us
12       [ 0]186c7::22:46:04.226913 [tcp][9][TIMER_FIRED] Timer expired (type DELACK, result 1)

This brings up a couple of comments/questions:

Can we change timeout_us:uint32_t in STRUCT TIMER_START to timeout_us:uint64_t to avoid overflows, which already happen above?
Can we add a enum_timer_reason or flag_timer_type to EVENT TIMER_CANCEL? I think it would be good to know, which timer was canceled?
What is the semantic of result in STRUCT TIMER_FIRED? Does it mean the event is logged when the timer processing is done (TT_PROCESSED) instead of started (TT_PROCESSING)?
Does it makes sense to add a event which corresponds to TT_PROCESSED?

Address review feedback: emit timer lifecycle events (START/CANCEL/FIRED/PROCESSED) from the base-stack timer callers; widen TIMER_START.timeout_us to uint64_t; add timer_reason payloads; guard CONN_SET_IP on a bound local port; elog(1) provider-prefixed per-session filename

Harbormaster completed remote builds in B73888: Diff 179809.Jun 15 2026, 4:10 PM

In D57309#1320333, @nickbanks_netflix.com wrote:

Address review feedback: emit timer lifecycle events (START/CANCEL/FIRED/PROCESSED) from the base-stack timer callers; widen TIMER_START.timeout_us to uint64_t; add timer_reason payloads; guard CONN_SET_IP on a bound local port; elog(1) provider-prefixed per-session filename

Thank you very much quickly addressing the issues. Will continue testing...

tuexen added inline comments.Jun 21 2026, 9:38 AM

include/eventlog/tcp_eventlog_schema.src
120	Is it possible to use `TCP_NODELAY` from `tcp.h` here instead of duplicating the numbers here?
377	It would be great to also have the information about: TOS byte / Traffic Class TCP options The last thing brings up the question if we can have something like an open array as the last element of an event? That could be used here for the options, where we can just have the bytes, or when we want to handle socket options in an event, where this could be used for the `option_value`.
389	Comments from `IN` apply here too.
398	Do you really want to compute the checksum here?
sys/netinet/tcp.h
222	Please use #define TCP_EVENTLOG 81 /* int: enable/disable eventlog */ because now the type is indicated in the comment.

tuexen added inline comments.Jun 21 2026, 6:11 PM

include/eventlog/tcp_eventlog_schema.src
377	It would be great to also have the information about: TOS byte / Traffic Class TCP options The last thing brings up the question if we can have something like an open array as the last element of an event? That could be used here for the options, where we can just have the bytes, or when we want to handle socket options in an event, where this could be used for the `option_value`. Maybe at the end: data_offset:uint8_t option_len:uint8_t option:uint8_t[option_len:8] keeping the `option_len` separate from the `data_offset` field would allow to cover recent proposals for extending the option space. Or putting the `data_offset` (4 bits) in the `flags`.

nickbanks_netflix.com marked 23 inline comments as done.Thu, Jun 25, 1:03 PM

nickbanks_netflix.com added inline comments.

include/eventlog/tcp_eventlog_schema.src
120	Short answer: No. Longer answer: It would add a lot of complexity to require parsing C files, either at build time in the AWK script or generating a header dependency dynamically. Also, it would not work for out-of-tree consumers of the schema (i.e. EventExplorer or some cloud backend processor) that simply don't have those headers. Additionally, even if we could define something like `tcp.h:TCP_NODELAY:TCP_NODELAY` that's not really fixing my main gripe, which is that you have to list all of these individually anyways (i.e. you must duplicate). But since TCP socket options are just standalone `#defines` and not an enum, there's no reasonable way to define a parser to even collect the set of them (the AI is the best solution here IMHO).
377	Yeah, we can do that.
398	I don't want to do any unnecessary computation. a) I thought this was just passing a value that already was there and b) I thought bblog had this too. If we can remove this as unnecessary, I'd be happy to do so. What do you think?

Drop IN/OUT checksum; log on-wire TCP options; add SHUTDOWN/CLOSE events; add SOCKET_OPT_BLOB for variable-length socket options

Harbormaster completed remote builds in B74597: Diff 181472.Tue, Jul 7, 6:26 PM

SOCKET_OPT_BLOB: drop varlen field ref from event format string

Harbormaster completed remote builds in B74622: Diff 181541.Wed, Jul 8, 3:10 PM

tcp: add eventlog(9) provider for TCP
Needs ReviewPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 181541

include/eventlog/tcp_eventlog_schema.src

share/man/man4/tcp.4

sys/kern/kern_sendfile.c

sys/modules/cc/cc_cubic/Makefile

sys/modules/cc/cc_newreno/Makefile

sys/modules/tcp/bbr/Makefile

sys/modules/tcp/rack/Makefile

sys/netinet/cc/cc_cubic.c

sys/netinet/cc/cc_newreno.c

sys/netinet/tcp.h

sys/netinet/tcp_input.c

sys/netinet/tcp_log_buf.h

sys/netinet/tcp_log_buf.c

sys/netinet/tcp_output.c

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_subr.c

sys/netinet/tcp_syncache.c

sys/netinet/tcp_timer.c

sys/netinet/tcp_usrreq.c

sys/netinet/tcp_var.h

tests/sys/kern/elog_test.py

usr.bin/elog/elog.1

usr.bin/elog/elog.c

tcp: add eventlog(9) provider for TCPNeeds ReviewPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 181541

include/eventlog/tcp_eventlog_schema.src

share/man/man4/tcp.4

sys/kern/kern_sendfile.c

sys/modules/cc/cc_cubic/Makefile

sys/modules/cc/cc_newreno/Makefile

sys/modules/tcp/bbr/Makefile

sys/modules/tcp/rack/Makefile

sys/netinet/cc/cc_cubic.c

sys/netinet/cc/cc_newreno.c

sys/netinet/tcp.h

sys/netinet/tcp_input.c

sys/netinet/tcp_log_buf.h

sys/netinet/tcp_log_buf.c

sys/netinet/tcp_output.c

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_subr.c

sys/netinet/tcp_syncache.c

sys/netinet/tcp_timer.c

sys/netinet/tcp_usrreq.c

sys/netinet/tcp_var.h

tests/sys/kern/elog_test.py

usr.bin/elog/elog.1

usr.bin/elog/elog.c

tcp: add eventlog(9) provider for TCP
Needs ReviewPublic
Actions

Revision Contents
Changeset List