Page MenuHomeFreeBSD

kern: import eventlog(9), a kernel event logging framework
Needs ReviewPublic

Authored by nickbanks_netflix.com on Tue, May 12, 9:25 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jun 4, 5:37 PM
Unknown Object (File)
Thu, Jun 4, 2:20 PM
Unknown Object (File)
Thu, Jun 4, 11:51 AM
Unknown Object (File)
Thu, Jun 4, 4:48 AM
Unknown Object (File)
Wed, Jun 3, 3:30 PM
Unknown Object (File)
Tue, Jun 2, 11:08 AM
Unknown Object (File)
Tue, Jun 2, 10:29 AM
Unknown Object (File)
Sun, May 31, 6:04 AM
Subscribers

Details

Summary

Add eventlog(9), a subscription-based framework for emitting
structured per-session events from kernel subsystems to userspace
or in-kernel consumers. It has three concepts:

  • Provider: a kernel subsystem that emits events. Multiple providers may share a name (e.g. the default and RACK TCP stacks both register as "tcp"); each gets a unique 16-bit id.
  • Session: an observed entity, e.g. one TCP connection, identified by a provider-defined uint64_t.
  • Subscriber: a consumer of events, either via a per-CPU double-buffered ring exposed through the host-global /dev/eventlog device, or via a synchronous in-kernel callback.

The hot write path enters an smr(9) section, walks subscribers
without locks, and commits with a single atomic_fcmpset_64 on
64-bit targets or a per-pcpu MTX_SPIN on 32-bit targets; NMI
re-entrancy is detected via mtx_owned() to avoid deadlock.

A small DSL in <provider>_eventlog_schema.src files describes
event ids and payload layouts; eventlog_gen.awk processes a
schema into producer or consumer headers.

usr.bin/elog/ ships elog(1), a reference consumer that prints
events as text on stdout or writes a binary .elog stream (-o).
sys/kern/kern_eventlog_test.c implements ktest_eventlog(4) for
ring correctness under concurrent producers and the 32-bit
fallback path; tests/sys/kern/elog_test.py covers elog(1) CLI
smoke tests. Man pages: eventlog(9), elog(1), elog(5).

No in-tree providers ship with this import; downstream consumers
register their schemas and call the emit API directly.

Sponsored by: Netflix

Signed-off-by: Nick Banks <nickbanks@netflix.com>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 73746
Build 70629: arc lint + arc unit

Event Timeline

Hi! We simplified the preferred license template now that SPDX notation has become an ISO standard. Thanks for including nice docs!

Add bin/elog to targets/pseudo/userland/Makefile.depend so DIRDEPS_BUILD also picks up the new userland binary.

Switch newly-introduced files to the simplified BSD-2-Clause template (Copyright + SPDX, no inlined boilerplate) per the licensing-policy doc. Existing files we only modify keep their existing headers untouched.

Add SPDX header to tests/sys/kern/kern_eventlog_test.py (it was previously unmarked).

rrs is currently traveling and has no access to phabricator Therefore he asked me to forward some questions:

  1. What is this for (statistics, debugging, ?)?
  2. If it’s for debugging why was BBlogging not used?
  3. If it’s for stats, why were trace points in BBLogging used?

rrs is currently traveling and has no access to phabricator Therefore he asked me to forward some questions:

  1. What is this for (statistics, debugging, ?)?
  2. If it’s for debugging why was BBlogging not used?
  3. If it’s for stats, why were trace points in BBLogging used?

This is primarily for debugging, but is designed to be general purpose, so it can be used for statistics as well. EventLog attempts to tackle a few issues, and therefore bblogging wasn't viable:

  • It's not TCP connection specific. It's meant to support any kernel (and hopefully user in the future) component. Even for TCP, it's designed to support global, non-connection specific events (such as stack-wide goodput histograms, at a specific sampling rate).
  • It uses a schema to create well-defined events that can be used to auto-generate both event fire macros and binary-to-text code. It also makes it much easier to write/generate tooling that consumes the binary events to do analysis/visualizations
  • It fundamentally changes the collection and memory model. Subscribers set which events they want to collect (via keywords and flags) and supply the (per-CPU) buffers that events are written into. Obviously the per-TCP connection buffering model doesn't work if you don't have a TCP connection.

So, TL;DR, eventlog is different enough from how bblog is architected that it must be a new system.

So, TL;DR, eventlog is different enough from how bblog is architected that it must be a new system.

My understanding is that eventlog is designed to be a much more generic and efficient kernel event logging system that will eventually replace bblog, and things like KTR. Is that fair?

As a user, I can speak to its efficiency. I added ev logging to lagg / lacp and select driver code to log LACP traffic and state changes so as to point fingers at Arista where we are having some .. issues... maintaining stable LACP with 400g NICs. So I have this thing running always-on snooping 1.2Tb/s of lacp traffic on a 3x400g prototype box.

usr.bin/elog/elog.c
446

Isn't this a usecase for STAILQ_FOREACH_SAFE()?

615

Would it make sense to describe this format in elog.1? Why are you using a width of two for the CPU? Why a minimum of 4 hex digits for the thread?

Address review feedback from tuexen on usr.bin/elog/elog.c:

  1. Replace the open-coded STAILQ_FIRST/STAILQ_NEXT iterator in close_session_file() with STAILQ_FOREACH_SAFE. The function does a STAILQ_REMOVE inside the loop, which is exactly what _SAFE exists for; even though we return immediately after the remove (so the unsafe variant is not actually buggy here), the macro is the idiomatic match and matches the rest of the file.
  1. Document the per-event output line format in elog.1 (CPU, thread id, timestamp, provider, session id, event name, payload), and explain the field widths. CPU stays at "%2u" minimum-width and TID stays at "%04x" minimum-width; both grow when the value needs more digits.

No functional change to capture or output content.

Address review feedback from tuexen on usr.bin/elog/elog.c: STAILQ_FOREACH_SAFE in close_session_file(), and document the per-event output line format in elog.1. No functional change.

Address tuexen review on usr.bin/elog/elog.c: STAILQ_FOREACH_SAFE in close_session_file() and document the per-event output line format in elog.1.