Page MenuHomeFreeBSD

kern: import eventlog(9), a kernel event logging framework
Needs ReviewPublic

Authored by nickbanks_netflix.com on May 12 2026, 9:25 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jun 26, 12:34 AM
Unknown Object (File)
Thu, Jun 25, 2:34 PM
Unknown Object (File)
Wed, Jun 24, 9:54 AM
Unknown Object (File)
Wed, Jun 24, 8:04 AM
Unknown Object (File)
Wed, Jun 24, 8:00 AM
Unknown Object (File)
Wed, Jun 24, 7:29 AM
Unknown Object (File)
Wed, Jun 24, 1:00 AM
Unknown Object (File)
Tue, Jun 23, 8:20 AM

Details

Summary

Add eventlog(9), a subscription-based framework for emitting
structured per-session events from kernel subsystems to userspace
or in-kernel consumers. It has three concepts:

  • Provider: a kernel subsystem that emits events. Multiple providers may share a name (e.g. the default and RACK TCP stacks both register as "tcp"); each gets a unique 16-bit id.
  • Session: an observed entity, e.g. one TCP connection, identified by a provider-defined uint64_t.
  • Subscriber: a consumer of events, either via a per-CPU double-buffered ring exposed through the host-global /dev/eventlog device, or via a synchronous in-kernel callback.

The hot write path enters an smr(9) section, walks subscribers
without locks, and commits with a single atomic_fcmpset_64 on
64-bit targets or a per-pcpu MTX_SPIN on 32-bit targets; NMI
re-entrancy is detected via mtx_owned() to avoid deadlock.

A small DSL in <provider>_eventlog_schema.src files describes
event ids and payload layouts; eventlog_gen.awk processes a
schema into producer or consumer headers.

usr.bin/elog/ ships elog(1), a reference consumer that prints
events as text on stdout or writes a binary .elog stream (-o).
sys/kern/kern_eventlog_test.c implements ktest_eventlog(4) for
ring correctness under concurrent producers and the 32-bit
fallback path; tests/sys/kern/elog_test.py covers elog(1) CLI
smoke tests. Man pages: eventlog(9), elog(1), elog(5).

No in-tree providers ship with this import; downstream consumers
register their schemas and call the emit API directly.

Sponsored by: Netflix

Signed-off-by: Nick Banks <nickbanks@netflix.com>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 73914
Build 70797: arc lint + arc unit

Event Timeline

Hi! We simplified the preferred license template now that SPDX notation has become an ISO standard. Thanks for including nice docs!

Add bin/elog to targets/pseudo/userland/Makefile.depend so DIRDEPS_BUILD also picks up the new userland binary.

Switch newly-introduced files to the simplified BSD-2-Clause template (Copyright + SPDX, no inlined boilerplate) per the licensing-policy doc. Existing files we only modify keep their existing headers untouched.

Add SPDX header to tests/sys/kern/kern_eventlog_test.py (it was previously unmarked).

rrs is currently traveling and has no access to phabricator Therefore he asked me to forward some questions:

  1. What is this for (statistics, debugging, ?)?
  2. If it’s for debugging why was BBlogging not used?
  3. If it’s for stats, why were trace points in BBLogging used?

rrs is currently traveling and has no access to phabricator Therefore he asked me to forward some questions:

  1. What is this for (statistics, debugging, ?)?
  2. If it’s for debugging why was BBlogging not used?
  3. If it’s for stats, why were trace points in BBLogging used?

This is primarily for debugging, but is designed to be general purpose, so it can be used for statistics as well. EventLog attempts to tackle a few issues, and therefore bblogging wasn't viable:

  • It's not TCP connection specific. It's meant to support any kernel (and hopefully user in the future) component. Even for TCP, it's designed to support global, non-connection specific events (such as stack-wide goodput histograms, at a specific sampling rate).
  • It uses a schema to create well-defined events that can be used to auto-generate both event fire macros and binary-to-text code. It also makes it much easier to write/generate tooling that consumes the binary events to do analysis/visualizations
  • It fundamentally changes the collection and memory model. Subscribers set which events they want to collect (via keywords and flags) and supply the (per-CPU) buffers that events are written into. Obviously the per-TCP connection buffering model doesn't work if you don't have a TCP connection.

So, TL;DR, eventlog is different enough from how bblog is architected that it must be a new system.

So, TL;DR, eventlog is different enough from how bblog is architected that it must be a new system.

My understanding is that eventlog is designed to be a much more generic and efficient kernel event logging system that will eventually replace bblog, and things like KTR. Is that fair?

As a user, I can speak to its efficiency. I added ev logging to lagg / lacp and select driver code to log LACP traffic and state changes so as to point fingers at Arista where we are having some .. issues... maintaining stable LACP with 400g NICs. So I have this thing running always-on snooping 1.2Tb/s of lacp traffic on a 3x400g prototype box.

usr.bin/elog/elog.c
446

Isn't this a usecase for STAILQ_FOREACH_SAFE()?

615

Would it make sense to describe this format in elog.1? Why are you using a width of two for the CPU? Why a minimum of 4 hex digits for the thread?

Address review feedback from tuexen on usr.bin/elog/elog.c:

  1. Replace the open-coded STAILQ_FIRST/STAILQ_NEXT iterator in close_session_file() with STAILQ_FOREACH_SAFE. The function does a STAILQ_REMOVE inside the loop, which is exactly what _SAFE exists for; even though we return immediately after the remove (so the unsafe variant is not actually buggy here), the macro is the idiomatic match and matches the rest of the file.
  1. Document the per-event output line format in elog.1 (CPU, thread id, timestamp, provider, session id, event name, payload), and explain the field widths. CPU stays at "%2u" minimum-width and TID stays at "%04x" minimum-width; both grow when the value needs more digits.

No functional change to capture or output content.

Address review feedback from tuexen on usr.bin/elog/elog.c: STAILQ_FOREACH_SAFE in close_session_file(), and document the per-event output line format in elog.1. No functional change.

Address tuexen review on usr.bin/elog/elog.c: STAILQ_FOREACH_SAFE in close_session_file() and document the per-event output line format in elog.1.

usr.bin/elog/elog.c
959

Why not use getopt() or getopt_long() if you want support options with names? That way something like elog -enr name.elogwould work (as I expected).

usr.bin/elog/elog.1
182

Not sure it makes sense to mention oca.py upstream...

Address review feedback: parse elog(1) options with getopt_long; drop internal man-page xref

Address review feedback: parse elog(1) options with getopt_long; drop internal man-page xref

Thank you. Much easier to use now.

elog: emit 'unknown argument' for unrecognized options

usr.bin/elog/elog.c
121

I have a question regarding endianness. I think you wanted that *.elog files are written in host byte order. And this is what happens. But how should a reader of the file detect the byte order? I was assuming this can be done by checking the byte order of the magic number, but we have a magic string. I tested this on a little endian and big endian machine. Here is what I get (one .elog-file was generated on a little endian machine, one was generated on a big endian machine):

tuexen@blackbird:~ % hexdump -n 40 -C unknown_17463_127.0.0.1_1234.elog 
00000000  45 4c 4f 47 00 00 00 01  00 00 00 00 25 ce f9 26  |ELOG........%..&|
00000010  00 06 54 c1 d6 51 af 50  00 00 00 00 00 00 00 97  |..T..Q.P........|
00000020  00 00 00 00 00 00 00 00                           |........|
00000028
tuexen@blackbird:~ % hexdump -n 40 -C unknown_55792_127.0.0.1_1234.elog 
00000000  45 4c 4f 47 01 00 00 00  33 d5 c9 08 00 00 00 00  |ELOG....3.......|
00000010  d7 e3 2d fa 50 54 06 00  97 00 00 00 00 00 00 00  |..-.PT..........|
00000020  00 00 00 00 00 00 00 00                           |........|
00000028
tuexen@blackbird:~ %

So how do you envision the reader detects the byte ordering? By looking at the version number?

nickbanks_netflix.com added inline comments.
usr.bin/elog/elog.c
121

I think the best answer is to actually make elog always use little endian, which will require updating both elog and eventlog (the AWK script that writes event payloads). I will do that in a follow up commit instead of shoving more into this one.

What is the rationale for using an AWK file to generate source? I find that it makes it hard to intuitively reason about what this is doing. (And, FWIW, my past experience is that things like this almost never stand the test of time--you almost always find a new capability you need which requires refactoring.) Can this be reworked in a way which makes the code more obvious (e.g. code which is included, or a more standard macro expansion)?

I think past history (e.g. with black box logging) has shown that it is easy to get mismatches between the schema a kernel is using and the schema user space is using, particularly when in a development environment (like the main branch) where you may recompile the kernel much more than userspace. How does your code avoid/address these mismatches?

Assume this happens:

  • Active buffer is 0
  • Active buffer swaps to 1
  • Reader begins reading
  • Active buffer swaps to 0 while reader is still reading

What will result?

In D56979#1329232, @jtl wrote:

What is the rationale for using an AWK file to generate source? I find that it makes it hard to intuitively reason about what this is doing.

My (limited) understanding was that AWK files were the standard way to have build-time generated headers.

(And, FWIW, my past experience is that things like this almost never stand the test of time--you almost always find a new capability you need which requires refactoring.) Can this be reworked in a way which makes the code more obvious (e.g. code which is included, or a more standard macro expansion)?

I have gone through many iterations of this process already, adding/modifying features and it hasn't required any full refactor (so far). I'm happy to use a different model, though I'm unclear on what exactly you're looking for. So, I'd appreciate some suggestions.

I think past history (e.g. with black box logging) has shown that it is easy to get mismatches between the schema a kernel is using and the schema user space is using, particularly when in a development environment (like the main branch) where you may recompile the kernel much more than userspace. How does your code avoid/address these mismatches?

Today, the solution to this is that you only append to the schema (either new fields to an existing event, or new events). User space will only decode what it knows about.

In an internal PR, I am leaning towards a model where providers give a binary representation of the schema itself, that elog (or other subscribers, i.e. a cloud backend) use to dynamically decode events. The above, static model would also continue to work if desired, but would require (generally, manual) updates when the provider changes. Practically, this is Ok (IMO), because real consumers of these events need to understand the meaning behind them, for instance if you want to plot a TCP window over time, you need more than a list of struct formats and sizes. You need to know which event includes the CWND, and understand when/why it's logged.

Assume this happens:

  • Active buffer is 0
  • Active buffer swaps to 1
  • Reader begins reading
  • Active buffer swaps to 0 while reader is still reading

What will result?

A swap can only occur if the reader's buffer is empty. The writer's buffer may or may not be full (but it will have something, otherwise there was no point to swap two empty buffers). So, in your example above, if the reader was still reading then the buffer swap wouldn't/couldn't have occurred. If the reader had completed reading everything, then the swap would occur and then the next read from the reader would start consuming whatever the writer had put in the buffer before the swap.

In D56979#1329232, @jtl wrote:

What is the rationale for using an AWK file to generate source? I find that it makes it hard to intuitively reason about what this is doing.

My (limited) understanding was that AWK files were the standard way to have build-time generated headers.

They are a standard way to do this. I just don't recall another example that is so long and detailed. I generally don't like non-trivial auto-generated code because it makes debugging very difficult. With GenAI, this may become less of an issue. And I will admit that it is a matter of preference, not correctness. But my history in trying to track down obscure bugs tells me that auto-generated code obfuscates enough that it complicates debugging.

I really will have to take the time to grok the code better before I have a substantive suggestion for an alternative. The generic answers are macros and decomposing things that can be de-duplicated into a shared include file. I don't know whether either one is appropriate here.

I think past history (e.g. with black box logging) has shown that it is easy to get mismatches between the schema a kernel is using and the schema user space is using, particularly when in a development environment (like the main branch) where you may recompile the kernel much more than userspace. How does your code avoid/address these mismatches?

Today, the solution to this is that you only append to the schema (either new fields to an existing event, or new events). User space will only decode what it knows about.

In an internal PR, I am leaning towards a model where providers give a binary representation of the schema itself, that elog (or other subscribers, i.e. a cloud backend) use to dynamically decode events. The above, static model would also continue to work if desired, but would require (generally, manual) updates when the provider changes. Practically, this is Ok (IMO), because real consumers of these events need to understand the meaning behind them, for instance if you want to plot a TCP window over time, you need more than a list of struct formats and sizes. You need to know which event includes the CWND, and understand when/why it's logged.

I think that makes sense and could be a nice addition. Alternatively, versioning could help detect (and, possibly, resolve) mismatches. If you don't (yet) have versioning, you may want to add it as a stop gap until something better is ready. IIRC, those of us doing dev work with early versions of black box logs wasted lots of time chasing bad info because we hadn't realized the versions in the kernel and user space were mismatched.

Assume this happens:

  • Active buffer is 0
  • Active buffer swaps to 1
  • Reader begins reading
  • Active buffer swaps to 0 while reader is still reading

What will result?

A swap can only occur if the reader's buffer is empty. The writer's buffer may or may not be full (but it will have something, otherwise there was no point to swap two empty buffers). So, in your example above, if the reader was still reading then the buffer swap wouldn't/couldn't have occurred. If the reader had completed reading everything, then the swap would occur and then the next read from the reader would start consuming whatever the writer had put in the buffer before the swap.

And, I assume, we would drop new events until a swap can occur? If so, that is a reasonable outcome.

TBH, it will take me quite a while to grok this review and have an intelligent opinion. I think we've always had a hard time getting effective reviews on 10K-line (actually, 1K+ line) commits. That's a function of relying on human volunteers to review. If there is any way you can decompose the commit, it might help. But it doesn't look like it might be easily decomposable.