Page MenuHomeFreeBSD

Improve dumping a kernel using the vtnet interface
ClosedPublic

Authored by tuexen on Apr 5 2022, 9:53 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Sep 7, 2:30 AM
Unknown Object (File)
Sat, Aug 31, 2:47 AM
Unknown Object (File)
Thu, Aug 29, 6:21 PM
Unknown Object (File)
Sun, Aug 25, 12:01 PM
Unknown Object (File)
Tue, Aug 20, 2:20 AM
Unknown Object (File)
Sun, Aug 18, 2:11 AM
Unknown Object (File)
Tue, Aug 13, 3:07 AM
Unknown Object (File)
Aug 7 2024, 11:15 AM
Subscribers

Details

Summary

When the code triggering the panic has not yet entered the network epoch, dumping the kernel will fail, since the code assumes that the network epoch is entered when performing software LRO. Therefore disable software LRO during dumping.

Test Plan

Use sudo sysctl debug.kdb.panic=1 to trigger a panic and then use the dump command to save a core to a remote server.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

tuexen requested review of this revision.Apr 5 2022, 9:53 AM

Which assertion fails?

db> dump
debugnet: overwriting mbuf zone pointers
debugnet_connect: searching for gateway MAC...
panic: Assertion in_epoch(net_epoch_preempt) failed at /root/freebsd-src/sys/netinet/tcp_lro.c:1502
cpuid = 0
time = 1649045641
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0xc7/frame 0xfffffe00897da010
kdb_backtrace() at kdb_backtrace+0xd3/frame 0xfffffe00897da170
vpanic() at vpanic+0x2b8/frame 0xfffffe00897da250
panic() at panic+0xb5/frame 0xfffffe00897da310
tcp_lro_flush_all() at tcp_lro_flush_all+0x48f/frame 0xfffffe00897da390
vtnet_rxq_eof() at vtnet_rxq_eof+0x17c0/frame 0xfffffe00897da570
vtnet_debugnet_poll() at vtnet_debugnet_poll+0xa3/frame 0xfffffe00897da5b0
debugnet_arp_gw() at debugnet_arp_gw+0x53c/frame 0xfffffe00897da6f0
debugnet_connect() at debugnet_connect+0x904/frame 0xfffffe00897da870
netdump_start() at netdump_start+0x2c5/frame 0xfffffe00897da9b0
dump_start() at dump_start+0x2ac/frame 0xfffffe00897dab30
cpu_minidumpsys() at cpu_minidumpsys+0x10ff/frame 0xfffffe00897dacb0
dumpsys_generic() at dumpsys_generic+0x160/frame 0xfffffe00897daea0
doadump() at doadump+0xe8/frame 0xfffffe00897daed0
db_dump() at db_dump+0x4a/frame 0xfffffe00897daef0
db_command() at db_command+0x441/frame 0xfffffe00897db090
db_command_loop() at db_command_loop+0x82/frame 0xfffffe00897db0b0
db_trap() at db_trap+0x27f/frame 0xfffffe00897db1f0
kdb_trap() at kdb_trap+0x2c3/frame 0xfffffe00897db2f0
trap() at trap+0x506/frame 0xfffffe00897db4e0
calltrap() at calltrap+0x8/frame 0xfffffe00897db4e0
--- trap 0x3, rip = 0xffffffff817737db, rsp = 0xfffffe00897db5b0, rbp = 0xfffffe00897db5d0 ---
kdb_enter() at kdb_enter+0x6b/frame 0xfffffe00897db5d0
vpanic() at vpanic+0x324/frame 0xfffffe00897db6b0
panic() at panic+0xb5/frame 0xfffffe00897db770
__rw_wlock_hard() at __rw_wlock_hard+0x1179/frame 0xfffffe00897db8d0
_rw_wlock_cookie() at _rw_wlock_cookie+0x1d7/frame 0xfffffe00897db9a0
cc_deregister_algo() at cc_deregister_algo+0x2e/frame 0xfffffe00897db9e0
cc_modevent() at cc_modevent+0x16e/frame 0xfffffe00897dba10
module_unload() at module_unload+0x4e/frame 0xfffffe00897dba30
linker_file_unload() at linker_file_unload+0x46b/frame 0xfffffe00897dbb40
kern_kldunload() at kern_kldunload+0x340/frame 0xfffffe00897dbb90
vfs_byname_kld() at vfs_byname_kld+0x151/frame 0xfffffe00897dbc50
sys_mount() at sys_mount+0x1de/frame 0xfffffe00897dbd30
amd64_syscall() at amd64_syscall+0x40c/frame 0xfffffe00897dbf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00897dbf30
--- syscall (198, FreeBSD ELF64, nosys), rip = 0x2ad12a, rsp = 0x825db4f08, rbp = 0x825db4f80 ---
Uptime: 1m20s

The problem only shows up if the panic happens without being in the network epoch. For panics where you are already in the network epoch, the dump works.

Which assertion fails?

db> dump
debugnet: overwriting mbuf zone pointers
debugnet_connect: searching for gateway MAC...
panic: Assertion in_epoch(net_epoch_preempt) failed at /root/freebsd-src/sys/netinet/tcp_lro.c:1502

I suspect the right solution is to somehow ensure that LRO is not in use when dumping. Either by checking dumping in vtnet_software_lro() or (probably better) clearing the software LRO flag in the vtnet softc in vtnet_debugnet_event().

BTW, I'm a little confused by vtnet_rxq_input(): doesn't it pass all input packets to vtnet_lro_rx(), not just TCP packets?

Which assertion fails?

db> dump
debugnet: overwriting mbuf zone pointers
debugnet_connect: searching for gateway MAC...
panic: Assertion in_epoch(net_epoch_preempt) failed at /root/freebsd-src/sys/netinet/tcp_lro.c:1502

I suspect the right solution is to somehow ensure that LRO is not in use when dumping. Either by checking dumping in vtnet_software_lro() or (probably better) clearing the software LRO flag in the vtnet softc in vtnet_debugnet_event().

Hmm. The proposed fix is similar to what is done in iflib.c.

BTW, I'm a little confused by vtnet_rxq_input(): doesn't it pass all input packets to vtnet_lro_rx(), not just TCP packets?

Not sure. Would need to look at the code...

Which assertion fails?

db> dump
debugnet: overwriting mbuf zone pointers
debugnet_connect: searching for gateway MAC...
panic: Assertion in_epoch(net_epoch_preempt) failed at /root/freebsd-src/sys/netinet/tcp_lro.c:1502

I suspect the right solution is to somehow ensure that LRO is not in use when dumping. Either by checking dumping in vtnet_software_lro() or (probably better) clearing the software LRO flag in the vtnet softc in vtnet_debugnet_event().

Hmm. The proposed fix is similar to what is done in iflib.c.

Well, the commit that added that just slapped net_epoch sections around all iflib_rxeof() calls. In the debugnet it doesn't make sense, since we won't call ether_input() when dumping: debugnet swaps out the if_input pointer.

We talked about this at the FreeBSD transport call. glebius@ suggested to follow markj@'s suggestion to disable TCP LRO when dumping the kernel. tuexen@ will look at it.

To sum up what we discussed on the call:

  • If we really want to enter epoch for network dumping, that should be done in the netdumper code, not copy-pasted to every driver.
  • Other option is to disable assertions when we are in dumper.
  • Disabling LRO is a good idea. In general for a dumper we want to execute as little code as possible and prefer simple code over high performance.

As suggested by Mark, disable software LRO for the vtnet interface during dumping.

tuexen retitled this revision from Enter epoch when dumping kernel via vtnet to Always allow dumping a kernel using the vtnet interface.Apr 17 2022, 8:51 PM
tuexen edited the summary of this revision. (Show Details)
tuexen edited the test plan for this revision. (Show Details)
tuexen retitled this revision from Always allow dumping a kernel using the vtnet interface to Improve dumping a kernel using the vtnet interface.Apr 17 2022, 8:55 PM
markj added inline comments.
sys/dev/virtio/network/if_vtnet.c
4414

I think this comment is too narrow: the real reason to disable LRO is that we simply don't want to use features not strictly required for netdump's functionality.

This revision is now accepted and ready to land.Apr 18 2022, 1:10 PM
sys/dev/virtio/network/if_vtnet.c
4414

But if we write that we want to disable all features no strictly required for dumping, shouldn't then the code not also disable TSO, checksum offloading and possibly more?
All I want to do is to get the dumper running and I'm disabling a feature which required to not being enabled.

sys/dev/virtio/network/if_vtnet.c
4414

Yes, in general features that we don't strictly need should be off when possible. At this point the system has panicked, so we also want to avoid reconfiguration operations which involve executing lots of driver code, so there's a tradeoff.

To be clear, I'm ok with the change.

This revision was automatically updated to reflect the committed changes.