Paths

Table of Contentst

Disambiguate KASSERTs in the m_apply() mbuf checking code and provide more context
ClosedPublic
Actions

Authored by sobomax on Feb 7 2024, 1:40 AM.

Details

Reviewers

markj
eugen_grosbein.net
glebius

Commits

rGa43fb3653b35: mbuf: improve KASSERT(9) falure messages in the m_apply()

Summary

While debugging some rare crashes in the ng_nat(4) code, which seems like mbuf is being corrupted in libalias, it's been noticed that two distinct KASSERTs in the m_apply() function have exactly the same panic message, making it impossible to distinguish after the fact.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

sobomax created this revision.Feb 7 2024, 1:40 AM

Herald added subscribers: glebius, melifaro, donner, imp. · View Herald TranscriptFeb 7 2024, 1:40 AM

sobomax requested review of this revision.Feb 7 2024, 1:40 AM

Max, you got commit bit, please go forward with pushing unless some other reviewer objects in next two days.

This revision is now accepted and ready to land.Feb 7 2024, 4:42 AM

This shouldn't be necessary since commit 0d9c3423f59bb. Most mb_unmapped_to_ext() calls have been removed from the tree. Which version are you running?

Withdraw as I was wrong with my quick conclusions.

This revision now requires review to proceed.Feb 8 2024, 3:54 AM

In D43776#998736, @glebius wrote:

Withdraw as I was wrong with my quick conclusions.

In D43776#998451, @markj wrote:

This shouldn't be necessary since commit 0d9c3423f59bb. Most mb_unmapped_to_ext() calls have been removed from the tree. Which version are you running?

@markj we've seen those crashes in the releng/13.1, which has 0d9c3423f59bb merged.

Nov 8 13:49:23 ip-10-11-227-254 syslogd: last message repeated 1 times
Nov 8 13:49:23 ip-10-11-227-254 kernel: Fatal trap 12: page fault while in kernel mode
Nov 8 13:49:23 ip-10-11-227-254 kernel: cpuid = 2; apic id = 02
Nov 8 13:49:23 ip-10-11-227-254 kernel: fault virtual address = 0x18
Nov 8 13:49:23 ip-10-11-227-254 kernel: fault code = supervisor read data, page not present
Nov 8 13:49:23 ip-10-11-227-254 kernel: instruction pointer = 0x20:0xffffffff80664190
Nov 8 13:49:23 ip-10-11-227-254 kernel: stack pointer = 0x28:0xfffffe00ba96ca60
Nov 8 13:49:23 ip-10-11-227-254 kernel: frame pointer = 0x28:0xfffffe00ba96caa0
Nov 8 13:49:23 ip-10-11-227-254 kernel: code segment = base 0x0, limit 0xfffff, type 0x1b
Nov 8 13:49:23 ip-10-11-227-254 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Nov 8 13:49:23 ip-10-11-227-254 kernel: processor eflags = interrupt enabled, resume, IOPL = 0
Nov 8 13:49:23 ip-10-11-227-254 kernel: current process = 0 (ena0 queue 1 cleanu)
Nov 8 13:49:23 ip-10-11-227-254 kernel: trap number = 12
Nov 8 13:49:23 ip-10-11-227-254 kernel: panic: page fault
Nov 8 13:49:23 ip-10-11-227-254 kernel: cpuid = 2
Nov 8 13:49:23 ip-10-11-227-254 kernel: time = 1699480104
Nov 8 13:49:23 ip-10-11-227-254 kernel: KDB: stack backtrace:
Nov 8 13:49:23 ip-10-11-227-254 kernel: #0 0xffffffff8061f6b5 at kdb_backtrace+0x65
Nov 8 13:49:23 ip-10-11-227-254 kernel: #1 0xffffffff805d5f9f at vpanic+0x17f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #2 0xffffffff805d5e13 at panic+0x43
Nov 8 13:49:23 ip-10-11-227-254 kernel: #3 0xffffffff808fbc35 at trap_fatal+0x385
Nov 8 13:49:23 ip-10-11-227-254 kernel: #4 0xffffffff808fbc8f at trap_pfault+0x4f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #5 0xffffffff808d2f78 at calltrap+0x8
Nov 8 13:49:23 ip-10-11-227-254 kernel: #6 0xffffffff8071c60f at in_cksum_skip+0x2f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #7 0xffffffff8072ea8d at in_delayed_cksum+0x3d
Nov 8 13:49:23 ip-10-11-227-254 kernel: #8 0xffffffff8070eb2f at ng_nat_rcvdata+0x32f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #9 0xffffffff80709f4f at ng_apply_item+0x2bf
Nov 8 13:49:23 ip-10-11-227-254 kernel: #10 0xffffffff80709a0e at ng_snd_item+0x28e
Nov 8 13:49:23 ip-10-11-227-254 kernel: #11 0xffffffff8070d6bc at ng_ether_input+0x4c
Nov 8 13:49:23 ip-10-11-227-254 kernel: #12 0xffffffff806e9319 at ether_nh_input+0x1c9
Nov 8 13:49:23 ip-10-11-227-254 kernel: #13 0xffffffff806f5788 at netisr_dispatch_src+0x88
Nov 8 13:49:23 ip-10-11-227-254 kernel: #14 0xffffffff806e85c9 at ether_input+0x39
Nov 8 13:49:23 ip-10-11-227-254 kernel: #15 0xffffffff80420f1d at ena_cleanup+0x59d
Nov 8 13:49:23 ip-10-11-227-254 kernel: #16 0xffffffff80633c91 at taskqueue_run_locked+0x181
Nov 8 13:49:23 ip-10-11-227-254 kernel: #17 0xffffffff80634fa2 at taskqueue_thread_loop+0xc2
Nov 8 13:49:23 ip-10-11-227-254 kernel: Uptime: 19d9h27m23s
Nov 8 13:49:23 ip-10-11-227-254 kernel: Rebooting...

In D43776#998874, @sobomax wrote:

In D43776#998736, @glebius wrote:

Withdraw as I was wrong with my quick conclusions.

In D43776#998451, @markj wrote:

This shouldn't be necessary since commit 0d9c3423f59bb. Most mb_unmapped_to_ext() calls have been removed from the tree. Which version are you running?

@markj we've seen those crashes in the releng/13.1, which has 0d9c3423f59bb merged.

Nov 8 13:49:23 ip-10-11-227-254 syslogd: last message repeated 1 times
Nov 8 13:49:23 ip-10-11-227-254 kernel: Fatal trap 12: page fault while in kernel mode
Nov 8 13:49:23 ip-10-11-227-254 kernel: cpuid = 2; apic id = 02
Nov 8 13:49:23 ip-10-11-227-254 kernel: fault virtual address = 0x18
Nov 8 13:49:23 ip-10-11-227-254 kernel: fault code = supervisor read data, page not present
Nov 8 13:49:23 ip-10-11-227-254 kernel: instruction pointer = 0x20:0xffffffff80664190
Nov 8 13:49:23 ip-10-11-227-254 kernel: stack pointer = 0x28:0xfffffe00ba96ca60
Nov 8 13:49:23 ip-10-11-227-254 kernel: frame pointer = 0x28:0xfffffe00ba96caa0
Nov 8 13:49:23 ip-10-11-227-254 kernel: code segment = base 0x0, limit 0xfffff, type 0x1b
Nov 8 13:49:23 ip-10-11-227-254 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Nov 8 13:49:23 ip-10-11-227-254 kernel: processor eflags = interrupt enabled, resume, IOPL = 0
Nov 8 13:49:23 ip-10-11-227-254 kernel: current process = 0 (ena0 queue 1 cleanu)
Nov 8 13:49:23 ip-10-11-227-254 kernel: trap number = 12
Nov 8 13:49:23 ip-10-11-227-254 kernel: panic: page fault
Nov 8 13:49:23 ip-10-11-227-254 kernel: cpuid = 2
Nov 8 13:49:23 ip-10-11-227-254 kernel: time = 1699480104
Nov 8 13:49:23 ip-10-11-227-254 kernel: KDB: stack backtrace:
Nov 8 13:49:23 ip-10-11-227-254 kernel: #0 0xffffffff8061f6b5 at kdb_backtrace+0x65
Nov 8 13:49:23 ip-10-11-227-254 kernel: #1 0xffffffff805d5f9f at vpanic+0x17f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #2 0xffffffff805d5e13 at panic+0x43
Nov 8 13:49:23 ip-10-11-227-254 kernel: #3 0xffffffff808fbc35 at trap_fatal+0x385
Nov 8 13:49:23 ip-10-11-227-254 kernel: #4 0xffffffff808fbc8f at trap_pfault+0x4f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #5 0xffffffff808d2f78 at calltrap+0x8
Nov 8 13:49:23 ip-10-11-227-254 kernel: #6 0xffffffff8071c60f at in_cksum_skip+0x2f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #7 0xffffffff8072ea8d at in_delayed_cksum+0x3d
Nov 8 13:49:23 ip-10-11-227-254 kernel: #8 0xffffffff8070eb2f at ng_nat_rcvdata+0x32f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #9 0xffffffff80709f4f at ng_apply_item+0x2bf
Nov 8 13:49:23 ip-10-11-227-254 kernel: #10 0xffffffff80709a0e at ng_snd_item+0x28e
Nov 8 13:49:23 ip-10-11-227-254 kernel: #11 0xffffffff8070d6bc at ng_ether_input+0x4c
Nov 8 13:49:23 ip-10-11-227-254 kernel: #12 0xffffffff806e9319 at ether_nh_input+0x1c9
Nov 8 13:49:23 ip-10-11-227-254 kernel: #13 0xffffffff806f5788 at netisr_dispatch_src+0x88
Nov 8 13:49:23 ip-10-11-227-254 kernel: #14 0xffffffff806e85c9 at ether_input+0x39
Nov 8 13:49:23 ip-10-11-227-254 kernel: #15 0xffffffff80420f1d at ena_cleanup+0x59d
Nov 8 13:49:23 ip-10-11-227-254 kernel: #16 0xffffffff80633c91 at taskqueue_run_locked+0x181
Nov 8 13:49:23 ip-10-11-227-254 kernel: #17 0xffffffff80634fa2 at taskqueue_thread_loop+0xc2
Nov 8 13:49:23 ip-10-11-227-254 kernel: Uptime: 19d9h27m23s
Nov 8 13:49:23 ip-10-11-227-254 kernel: Rebooting...

Based on the fault address (0x18), I suspect that the crash happens in m_apply() when loading m->m_len, i.e., the mbuf chain is shorter than expected. So I suspect the bug is elsewhere. It's hard to say without INVARIANTS or a kernel dump.

Looks like some other issue at play, I am seeing those crashes even with my patch applied.

Jan 20 23:06:03 ip-10-11-227-254 syslogd: last message repeated 1 times
Jan 20 23:06:03 ip-10-11-227-254 kernel: Fatal trap 12: page fault while in kernel mode
Jan 20 23:06:03 ip-10-11-227-254 kernel: cpuid = 2; apic id = 02
Jan 20 23:06:03 ip-10-11-227-254 kernel: fault virtual address  = 0x18
Jan 20 23:06:03 ip-10-11-227-254 kernel: fault code             = supervisor read data, page not present
Jan 20 23:06:03 ip-10-11-227-254 kernel: instruction pointer    = 0x20:0xffffffff80664190
Jan 20 23:06:03 ip-10-11-227-254 kernel: stack pointer          = 0x28:0xfffffe00bba099f0
Jan 20 23:06:03 ip-10-11-227-254 kernel: frame pointer          = 0x28:0xfffffe00bba09a30
Jan 20 23:06:03 ip-10-11-227-254 kernel: code segment           = base 0x0, limit 0xfffff, type 0x1b
Jan 20 23:06:03 ip-10-11-227-254 kernel:                        = DPL 0, pres 1, long 1, def32 0, gran 1
Jan 20 23:06:03 ip-10-11-227-254 kernel: processor eflags       = interrupt enabled, resume, IOPL = 0
Jan 20 23:06:03 ip-10-11-227-254 kernel: current process                = 0 (ena0 queue 3 cleanu)
Jan 20 23:06:03 ip-10-11-227-254 kernel: trap number            = 12
Jan 20 23:06:03 ip-10-11-227-254 kernel: panic: page fault
Jan 20 23:06:03 ip-10-11-227-254 kernel: cpuid = 2
Jan 20 23:06:03 ip-10-11-227-254 kernel: time = 1705820706
Jan 20 23:06:03 ip-10-11-227-254 kernel: KDB: stack backtrace:
Jan 20 23:06:03 ip-10-11-227-254 kernel: #0 0xffffffff8061f6b5 at kdb_backtrace+0x65
Jan 20 23:06:03 ip-10-11-227-254 kernel: #1 0xffffffff805d5f9f at vpanic+0x17f
Jan 20 23:06:03 ip-10-11-227-254 kernel: #2 0xffffffff805d5e13 at panic+0x43
Jan 20 23:06:03 ip-10-11-227-254 kernel: #3 0xffffffff808fbc35 at trap_fatal+0x385
Jan 20 23:06:03 ip-10-11-227-254 kernel: #4 0xffffffff808fbc8f at trap_pfault+0x4f
Jan 20 23:06:03 ip-10-11-227-254 kernel: #5 0xffffffff808d2fc8 at calltrap+0x8
Jan 20 23:06:03 ip-10-11-227-254 kernel: #6 0xffffffff8071c61f at in_cksum_skip+0x2f
Jan 20 23:06:03 ip-10-11-227-254 kernel: #7 0xffffffff8072ea9d at in_delayed_cksum+0x3d
Jan 20 23:06:03 ip-10-11-227-254 kernel: #8 0xffffffff8070eb3b at ng_nat_rcvdata+0x33b
Jan 20 23:06:03 ip-10-11-227-254 kernel: #9 0xffffffff80709f4f at ng_apply_item+0x2bf
Jan 20 23:06:03 ip-10-11-227-254 kernel: #10 0xffffffff80709a0e at ng_snd_item+0x28e
Jan 20 23:06:03 ip-10-11-227-254 kernel: #11 0xffffffff8070d6bc at ng_ether_input+0x4c
Jan 20 23:06:03 ip-10-11-227-254 kernel: #12 0xffffffff806e9319 at ether_nh_input+0x1c9
Jan 20 23:06:03 ip-10-11-227-254 kernel: #13 0xffffffff806f5788 at netisr_dispatch_src+0x88
Jan 20 23:06:03 ip-10-11-227-254 kernel: #14 0xffffffff806e85c9 at ether_input+0x39
Jan 20 23:06:03 ip-10-11-227-254 kernel: #15 0xffffffff80740354 at tcp_lro_flush+0x2f4
Jan 20 23:06:03 ip-10-11-227-254 kernel: #16 0xffffffff8074065b at tcp_lro_flush_all+0x17b
Jan 20 23:06:03 ip-10-11-227-254 kernel: #17 0xffffffff80421078 at ena_cleanup+0x6f8
Jan 20 23:06:03 ip-10-11-227-254 kernel: Uptime: 5d5h55m55s
Jan 20 23:06:03 ip-10-11-227-254 kernel: Rebooting...
Jan 20 23:06:03 ip-10-11-227-254 kernel: cpu_reset: Restarting BSP
Jan 20 23:06:03 ip-10-11-227-254 kernel: cpu_reset_proxy: Stopped CPU 2

Different calls chain though, @markj any ideas? 👀

In D43776#998892, @sobomax wrote:

Different calls chain though, @markj any ideas? 👀

Not offhand. You'd need to get a kernel dump and look at the mbuf chain, and/or try an INVARIANTS kernel.

Crash trace seems to be misleading, based on the crash IP (0xffffffff80664190), the crash is inside m_apply():

Dump of assembler code for function m_apply:
   0xffffffff80664100 <+0>:     push   %rbp
   0xffffffff80664101 <+1>:     mov    %rsp,%rbp
   0xffffffff80664104 <+4>:     push   %r15
   0xffffffff80664106 <+6>:     push   %r14
   0xffffffff80664108 <+8>:     push   %r13
   0xffffffff8066410a <+10>:    push   %r12
   0xffffffff8066410c <+12>:    push   %rbx
   0xffffffff8066410d <+13>:    sub    $0x18,%rsp
   0xffffffff80664111 <+17>:    mov    %r8,-0x38(%rbp)
   0xffffffff80664115 <+21>:    mov    %rcx,-0x30(%rbp)
   0xffffffff80664119 <+25>:    mov    %edx,%r12d
   0xffffffff8066411c <+28>:    mov    %rdi,%rbx
   0xffffffff8066411f <+31>:    test   %esi,%esi
   0xffffffff80664121 <+33>:    jle    0xffffffff80664140 <m_apply+64>
   0xffffffff80664123 <+35>:    mov    %esi,%eax
   0xffffffff80664125 <+37>:    cs nopw 0x0(%rax,%rax,1)
   0xffffffff8066412f <+47>:    nop
   0xffffffff80664130 <+48>:    sub    0x18(%rbx),%eax
   0xffffffff80664133 <+51>:    jl     0xffffffff80664149 <m_apply+73>
   0xffffffff80664135 <+53>:    mov    (%rbx),%rbx
   0xffffffff80664138 <+56>:    mov    %eax,%esi
   0xffffffff8066413a <+58>:    test   %eax,%eax
   0xffffffff8066413c <+60>:    jg     0xffffffff80664130 <m_apply+48>
   0xffffffff8066413e <+62>:    jmp    0xffffffff80664142 <m_apply+66>
   0xffffffff80664140 <+64>:    mov    %esi,%eax
   0xffffffff80664142 <+66>:    test   %r12d,%r12d
   0xffffffff80664145 <+69>:    jg     0xffffffff80664190 <m_apply+144>
   0xffffffff80664147 <+71>:    jmp    0xffffffff80664150 <m_apply+80>
   0xffffffff80664149 <+73>:    mov    %esi,%eax
   0xffffffff8066414b <+75>:    test   %r12d,%r12d
   0xffffffff8066414e <+78>:    jg     0xffffffff80664190 <m_apply+144>
   0xffffffff80664150 <+80>:    xor    %eax,%eax
   0xffffffff80664152 <+82>:    add    $0x18,%rsp
   0xffffffff80664156 <+86>:    pop    %rbx
   0xffffffff80664157 <+87>:    pop    %r12
   0xffffffff80664159 <+89>:    pop    %r13
   0xffffffff8066415b <+91>:    pop    %r14
   0xffffffff8066415d <+93>:    pop    %r15
   0xffffffff8066415f <+95>:    pop    %rbp
   0xffffffff80664160 <+96>:    ret
   0xffffffff80664161 <+97>:    cs nopw 0x0(%rax,%rax,1)
   0xffffffff8066416b <+107>:   nopl   0x0(%rax,%rax,1)
   0xffffffff80664170 <+112>:   cltq
   0xffffffff80664172 <+114>:   add    %rax,%rsi
   0xffffffff80664175 <+117>:   mov    -0x38(%rbp),%rdi
   0xffffffff80664179 <+121>:   mov    %r14d,%edx
   0xffffffff8066417c <+124>:   call   *-0x30(%rbp)
   0xffffffff8066417f <+127>:   test   %eax,%eax
   0xffffffff80664181 <+129>:   jne    0xffffffff80664152 <m_apply+82>
   0xffffffff80664183 <+131>:   sub    %r14d,%r12d
   0xffffffff80664186 <+134>:   mov    (%rbx),%rbx
   0xffffffff80664189 <+137>:   xor    %eax,%eax
   0xffffffff8066418b <+139>:   test   %r12d,%r12d
   0xffffffff8066418e <+142>:   jle    0xffffffff80664152 <m_apply+82>
   0xffffffff80664190 <+144>:   mov    0x18(%rbx),%r14d
                                ^^^^^^^^^^^^^^
End of assembler dump.

afedorov added a subscriber: afedorov.Feb 8 2024, 5:35 PM

In D43776#998935, @sobomax wrote:

Crash trace seems to be misleading, based on the crash IP (0xffffffff80664190), the crash is inside m_apply():

Right, as I said earlier, what's probably happening is that the load of m->m_len (m_len has offset 0x18) is triggering a NULL pointer dereference. So most likely, the mbuf chain length (m->m_pkthdr.len) is longer that the sum of the length of the mbufs in the chain. That is, the chain is corrupted somehow.

To debug, you might try adding a macro which verifies the chain length, and add the macro at various strategic points, e.g., entry into netgraph.

afedorov added inline comments.Feb 8 2024, 7:03 PM

sys/netgraph/ng_nat.c
773 ↗	(On Diff #133955)	@markj , it looks like mbuf is continuous, not chained. But, we still need a printout of the mbuf and preferably eth + ip + tcp header data.

Looks like I'm able to turn INVARIANTS just in that specific TU, let's see if we deploy it into production and narrow it down:

diff --git a/sys/kern/uipc_mbuf.c b/sys/kern/uipc_mbuf.c
index a26cf036b012..6de762fbd07a 100644
--- a/sys/kern/uipc_mbuf.c
+++ b/sys/kern/uipc_mbuf.c
@@ -31,6 +31,9 @@
  *     @(#)uipc_mbuf.c 8.2 (Berkeley) 1/4/94
  */

+#define INVARIANTS
+#define INVARIANT_SUPPORT
+
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD: 13c98d0895faaec787787f241d5a79bfdccce098 $");

Looks like I'm able to turn INVARIANTS just in that specific TU, let's see if we deploy it into production and narrow it down:

You are able of course, but that isn't going to work.

You need to enable INVARIANTS in the kernel config. If you anticipate that performance impact would be too big to run production normally, try using vm.debug.divisor=10 in your loader.conf. That will reduce impact of memory trashing/checking in UMA by factor of 10. This trashing is the biggest CPU hog enabled by INVARIANTS. With this tunable we are running production with up to 100 Gbit/s.

sobomax retitled this revision from Fix potential crash handling unmapped mbufs in ng_nat(4). to Fix crash handling updating checksums on mbufs in ng_nat(4)..Feb 9 2024, 5:18 PM

In D43776#998975, @glebius wrote:

Looks like I'm able to turn INVARIANTS just in that specific TU, let's see if we deploy it into production and narrow it down:

You are able of course, but that isn't going to work.

Why? I checked the generated & linked kernel code it properly has bunch of panic() calls attached to the end of the m_apply() now.

In D43776#998976, @glebius wrote:

You need to enable INVARIANTS in the kernel config. If you anticipate that performance impact would be too big to run production normally, try using vm.debug.divisor=10 in your loader.conf. That will reduce impact of memory trashing/checking in UMA by factor of 10. This trashing is the biggest CPU hog enabled by INVARIANTS. With this tunable we are running production with up to 100 Gbit/s.

Possibly, but those are instances sized up for a task and some of them busier than the others shuffling 80-90kPPS between kernel and userland and back.

@markj @glebius I suppose looking at the crash pattern, disabling LRO would probably workaround it, correct? We have other issues with LRO in this setup so maybe this is the easiest way out for now.

In D43776#999148, @sobomax wrote:

@markj @glebius I suppose looking at the crash pattern, disabling LRO would probably workaround it, correct? We have other issues with LRO in this setup so maybe this is the easiest way out for now.

Maybe? I don't know anything about your network configuration. I don't see anything in the backtraces which specifically implicates LRO.

In D43776#999145, @sobomax wrote:

Looks like I'm able to turn INVARIANTS just in that specific TU, let's see if we deploy it into production and narrow it down:

You are able of course, but that isn't going to work.

Why? I checked the generated & linked kernel code it properly has bunch of panic() calls attached to the end of the m_apply() now.

It may work if you are lucky. If this compilation unit checks only INVARIANTS that are imposed by the compilation unit and nothing else. And with "work" I mean not producing false positive panics. If we talk about helping to find the root cause of the problem, I doubt it will work. Very likely the root cause is outside this compilation unit.

In D43776#999307, @markj wrote:

In D43776#999148, @sobomax wrote:

@markj @glebius I suppose looking at the crash pattern, disabling LRO would probably workaround it, correct? We have other issues with LRO in this setup so maybe this is the easiest way out for now.

Maybe? I don't know anything about your network configuration. I don't see anything in the backtraces which specifically implicates LRO.

For what's worth in two of the recent crashes I see tcp_lro_flush() at the top of the call stack:

#14 0xffffffff806e85c9 at ether_input+0x39
#15 0xffffffff80740354 at tcp_lro_flush+0x2f4
#16 0xffffffff8074065b at tcp_lro_flush_all+0x17b
#17 0xffffffff80421078 at ena_cleanup+0x6f8
Uptime: 41d15h50m4s

JFYI, we use ng_nat to give ec2 instance a local "virtual" IP, matching IP assigned by the EC2 at the DMZ router, so we don't have to expose that internal IP to the various world-oriented subsystems. Like so:

[IF_ENA]->{actual: 10.0.0.x}->[NG_NAT]->{virtual: 1.2.3.4}->(APP)

setupshadow() {
  int="${1}"
  pubip="${2}"
  prvip="${3}"

  kldload -n ng_ether
  ngctl shutdown ${int}_NAT: || true
  ngctl mkpeer ${int}: nat lower in
  ngctl name ${int}:lower ${int}_NAT
  ngctl msg ${int}_NAT: setdlt 1
  ngctl msg ${int}_NAT: setaliasaddr ${prvip}
  ngctl msg ${int}_NAT: settarget ${pubip}
  ngctl msg ${int}_NAT: setmode {flags=0x04 mask=0x04}
  ngctl connect ${int}: ${int}_NAT: upper out
  ifconfig ${int} mtu 1500
}

@glebius dully noted. 🙉🧐

I'd start with enabling crashdumps perhaps, that should give some food for thought, we can then poke into bits and structs if we are lucky.
P.S. There are 37 KASSERTS in this TU alone.

In D43776#1000213, @glebius wrote:

In D43776#999145, @sobomax wrote:

Looks like I'm able to turn INVARIANTS just in that specific TU, let's see if we deploy it into production and narrow it down:

You are able of course, but that isn't going to work.

Why? I checked the generated & linked kernel code it properly has bunch of panic() calls attached to the end of the m_apply() now.

It may work if you are lucky. If this compilation unit checks only INVARIANTS that are imposed by the compilation unit and nothing else. And with "work" I mean not producing false positive panics. If we talk about helping to find the root cause of the problem, I doubt it will work. Very likely the root cause is outside this compilation unit.

@markj @glebius Yes, this patch does not help. :( We are still seeing occasional panics, interestingly enough most of the time when the configuration is being established. It also sometimes persists across multiple reboots:

lo1: changing name to 'ena0_shadow'
ena0_shadow: link state changed to UP
panic: m_apply, offset > size of mbuf chain
cpuid = 6
time = 1725536036
KDB: stack backtrace:
#0 0xffffffff806226d5 at kdb_backtrace+0x65
#1 0xffffffff805d8fbf at vpanic+0x17f
#2 0xffffffff805d8e33 at panic+0x43
#3 0xffffffff8066745e at m_apply+0x24e
#4 0xffffffff8071f99f at in_cksum_skip+0x2f
#5 0xffffffff80731e1d at in_delayed_cksum+0x3d
#6 0xffffffff80711ebf at ng_nat_rcvdata+0x32f
#7 0xffffffff8070d2df at ng_apply_item+0x2bf
#8 0xffffffff8070cd9e at ng_snd_item+0x28e
#9 0xffffffff80710a4c at ng_ether_input+0x4c
#10 0xffffffff806ec6a9 at ether_nh_input+0x1c9
#11 0xffffffff806f8b18 at netisr_dispatch_src+0x88
#12 0xffffffff806eb959 at ether_input+0x39
#13 0xffffffff80423f1d at ena_cleanup+0x59d
#14 0xffffffff80636cb1 at taskqueue_run_locked+0x181
#15 0xffffffff80637fc2 at taskqueue_thread_loop+0xc2
#16 0xffffffff805981de at fork_exit+0x7e
#17 0xffffffff808fb28e at fork_trampoline+0xe
Uptime: 48s

lo1: changing name to 'ena0_shadow'
ena0_shadow: link state changed to UP
panic: m_apply, offset > size of mbuf chain
cpuid = 0
time = 1725536131
KDB: stack backtrace:
#0 0xffffffff806226d5 at kdb_backtrace+0x65
#1 0xffffffff805d8fbf at vpanic+0x17f
#2 0xffffffff805d8e33 at panic+0x43
#3 0xffffffff8066745e at m_apply+0x24e
#4 0xffffffff8071f99f at in_cksum_skip+0x2f
#5 0xffffffff80731e1d at in_delayed_cksum+0x3d
#6 0xffffffff80711ebf at ng_nat_rcvdata+0x32f
#7 0xffffffff8070d2df at ng_apply_item+0x2bf
#8 0xffffffff8070cd9e at ng_snd_item+0x28e
#9 0xffffffff80710a4c at ng_ether_input+0x4c
#10 0xffffffff806ec6a9 at ether_nh_input+0x1c9
#11 0xffffffff806f8b18 at netisr_dispatch_src+0x88
#12 0xffffffff806eb959 at ether_input+0x39
#13 0xffffffff80423f1d at ena_cleanup+0x59d
#14 0xffffffff80636cb1 at taskqueue_run_locked+0x181
#15 0xffffffff80637fc2 at taskqueue_thread_loop+0xc2
#16 0xffffffff805981de at fork_exit+0x7e
#17 0xffffffff808fb28e at fork_trampoline+0xe
Uptime: 50s

lo1: changing name to 'ena0_shadow'
ena0_shadow: link state changed to UP
panic: m_apply, offset > size of mbuf chain
cpuid = 2
time = 1725536196
KDB: stack backtrace:
[...]
Uptime: 21s

ena0_shadow: link state changed to UP
panic: m_apply, offset > size of mbuf chain
cpuid = 0
time = 1725536338
KDB: stack backtrace:
[...]
Uptime: 1m38s
Rebooting...

ena0_shadow: link state changed to UP
panic: m_apply, offset > size of mbuf chain
cpuid = 6
time = 1725536448
KDB: stack backtrace:
[...]
Uptime: 1m7s
Rebooting...

This however is not 100% guaranteed, as it also happens sporadically.

I've added some debug around those lines and it looks like this is a VERY rate event. I just cannot hit that condition intentionally. After doing some digging, it looks like the th_x2 is only set by the few places in the libalias mostly in some rate/obsolete protocols (and proxy feature).

~/projects/freebsd13]$ find sys/ -name \*.c | xargs grep 'th_x2 = [^0]'
sys/netinet/libalias/alias_ftp.c:               tc->th_x2 = 1;
sys/netinet/libalias/alias_irc.c:               tc->th_x2 = 1;
sys/netinet/libalias/alias_proxy.c:     tc->th_x2 = 1;
sys/netinet/libalias/alias_skinny.c:    tc->th_x2 = 1;
sys/netinet/libalias/alias_skinny.c:    tc->th_x2 = 1;
sys/netinet/libalias/alias_skinny.c:    tc->th_x2 = 1;
sys/netinet/libalias/alias_smedia.c:    tc->th_x2 = 1;
sys/netinet/libalias/alias_smedia.c:                            tc->th_x2 = 1;

I have a big suspicion that we might be hitting some bug in the handlers of those obsolete protocols, corrupting something and triggering the panic. The fact that we put it in a reverse mode (i.e. with the "LAN" facing the open internet) and also the observation that the it happens in "batches" right after the interface is enabled persisting several reboots, might indicate some form of junk traffic triggering that otherwise sleeping bug. For the meantime I am disabling all those handlers in our build and we see if it helps.

sobomax updated this revision to Diff 143155.Sep 9 2024, 7:53 PM

sobomax retitled this revision from Fix crash handling updating checksums on mbufs in ng_nat(4). to Disambiguate KASSERTs in the m_apply() mbuf checking code and provide more context.

sobomax edited the summary of this revision. (Show Details)

Herald added a subscriber: olce. · View Herald TranscriptSep 9 2024, 7:53 PM

markj accepted this revision.Sep 9 2024, 8:36 PM

This revision is now accepted and ready to land.Sep 9 2024, 8:36 PM

Closed by commit rGa43fb3653b35: mbuf: improve KASSERT(9) falure messages in the m_apply() (authored by sobomax). · Explain WhySep 10 2024, 2:31 AM

This revision was automatically updated to reflect the committed changes.

sobomax added a commit: rGa43fb3653b35: mbuf: improve KASSERT(9) falure messages in the m_apply().