While debugging some rare crashes in the ng_nat(4) code, which seems like mbuf is being corrupted in libalias, it's been noticed that two distinct KASSERTs in the m_apply() function have exactly the same panic message, making it impossible to distinguish after the fact.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Max, you got commit bit, please go forward with pushing unless some other reviewer objects in next two days.
This shouldn't be necessary since commit 0d9c3423f59bb. Most mb_unmapped_to_ext() calls have been removed from the tree. Which version are you running?
@markj we've seen those crashes in the releng/13.1, which has 0d9c3423f59bb merged.
Nov 8 13:49:23 ip-10-11-227-254 syslogd: last message repeated 1 times
Nov 8 13:49:23 ip-10-11-227-254 kernel: Fatal trap 12: page fault while in kernel mode
Nov 8 13:49:23 ip-10-11-227-254 kernel: cpuid = 2; apic id = 02
Nov 8 13:49:23 ip-10-11-227-254 kernel: fault virtual address = 0x18
Nov 8 13:49:23 ip-10-11-227-254 kernel: fault code = supervisor read data, page not present
Nov 8 13:49:23 ip-10-11-227-254 kernel: instruction pointer = 0x20:0xffffffff80664190
Nov 8 13:49:23 ip-10-11-227-254 kernel: stack pointer = 0x28:0xfffffe00ba96ca60
Nov 8 13:49:23 ip-10-11-227-254 kernel: frame pointer = 0x28:0xfffffe00ba96caa0
Nov 8 13:49:23 ip-10-11-227-254 kernel: code segment = base 0x0, limit 0xfffff, type 0x1b
Nov 8 13:49:23 ip-10-11-227-254 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Nov 8 13:49:23 ip-10-11-227-254 kernel: processor eflags = interrupt enabled, resume, IOPL = 0
Nov 8 13:49:23 ip-10-11-227-254 kernel: current process = 0 (ena0 queue 1 cleanu)
Nov 8 13:49:23 ip-10-11-227-254 kernel: trap number = 12
Nov 8 13:49:23 ip-10-11-227-254 kernel: panic: page fault
Nov 8 13:49:23 ip-10-11-227-254 kernel: cpuid = 2
Nov 8 13:49:23 ip-10-11-227-254 kernel: time = 1699480104
Nov 8 13:49:23 ip-10-11-227-254 kernel: KDB: stack backtrace:
Nov 8 13:49:23 ip-10-11-227-254 kernel: #0 0xffffffff8061f6b5 at kdb_backtrace+0x65
Nov 8 13:49:23 ip-10-11-227-254 kernel: #1 0xffffffff805d5f9f at vpanic+0x17f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #2 0xffffffff805d5e13 at panic+0x43
Nov 8 13:49:23 ip-10-11-227-254 kernel: #3 0xffffffff808fbc35 at trap_fatal+0x385
Nov 8 13:49:23 ip-10-11-227-254 kernel: #4 0xffffffff808fbc8f at trap_pfault+0x4f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #5 0xffffffff808d2f78 at calltrap+0x8
Nov 8 13:49:23 ip-10-11-227-254 kernel: #6 0xffffffff8071c60f at in_cksum_skip+0x2f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #7 0xffffffff8072ea8d at in_delayed_cksum+0x3d
Nov 8 13:49:23 ip-10-11-227-254 kernel: #8 0xffffffff8070eb2f at ng_nat_rcvdata+0x32f
Nov 8 13:49:23 ip-10-11-227-254 kernel: #9 0xffffffff80709f4f at ng_apply_item+0x2bf
Nov 8 13:49:23 ip-10-11-227-254 kernel: #10 0xffffffff80709a0e at ng_snd_item+0x28e
Nov 8 13:49:23 ip-10-11-227-254 kernel: #11 0xffffffff8070d6bc at ng_ether_input+0x4c
Nov 8 13:49:23 ip-10-11-227-254 kernel: #12 0xffffffff806e9319 at ether_nh_input+0x1c9
Nov 8 13:49:23 ip-10-11-227-254 kernel: #13 0xffffffff806f5788 at netisr_dispatch_src+0x88
Nov 8 13:49:23 ip-10-11-227-254 kernel: #14 0xffffffff806e85c9 at ether_input+0x39
Nov 8 13:49:23 ip-10-11-227-254 kernel: #15 0xffffffff80420f1d at ena_cleanup+0x59d
Nov 8 13:49:23 ip-10-11-227-254 kernel: #16 0xffffffff80633c91 at taskqueue_run_locked+0x181
Nov 8 13:49:23 ip-10-11-227-254 kernel: #17 0xffffffff80634fa2 at taskqueue_thread_loop+0xc2
Nov 8 13:49:23 ip-10-11-227-254 kernel: Uptime: 19d9h27m23s
Nov 8 13:49:23 ip-10-11-227-254 kernel: Rebooting...
Based on the fault address (0x18), I suspect that the crash happens in m_apply() when loading m->m_len, i.e., the mbuf chain is shorter than expected. So I suspect the bug is elsewhere. It's hard to say without INVARIANTS or a kernel dump.
Looks like some other issue at play, I am seeing those crashes even with my patch applied.
Jan 20 23:06:03 ip-10-11-227-254 syslogd: last message repeated 1 times Jan 20 23:06:03 ip-10-11-227-254 kernel: Fatal trap 12: page fault while in kernel mode Jan 20 23:06:03 ip-10-11-227-254 kernel: cpuid = 2; apic id = 02 Jan 20 23:06:03 ip-10-11-227-254 kernel: fault virtual address = 0x18 Jan 20 23:06:03 ip-10-11-227-254 kernel: fault code = supervisor read data, page not present Jan 20 23:06:03 ip-10-11-227-254 kernel: instruction pointer = 0x20:0xffffffff80664190 Jan 20 23:06:03 ip-10-11-227-254 kernel: stack pointer = 0x28:0xfffffe00bba099f0 Jan 20 23:06:03 ip-10-11-227-254 kernel: frame pointer = 0x28:0xfffffe00bba09a30 Jan 20 23:06:03 ip-10-11-227-254 kernel: code segment = base 0x0, limit 0xfffff, type 0x1b Jan 20 23:06:03 ip-10-11-227-254 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Jan 20 23:06:03 ip-10-11-227-254 kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Jan 20 23:06:03 ip-10-11-227-254 kernel: current process = 0 (ena0 queue 3 cleanu) Jan 20 23:06:03 ip-10-11-227-254 kernel: trap number = 12 Jan 20 23:06:03 ip-10-11-227-254 kernel: panic: page fault Jan 20 23:06:03 ip-10-11-227-254 kernel: cpuid = 2 Jan 20 23:06:03 ip-10-11-227-254 kernel: time = 1705820706 Jan 20 23:06:03 ip-10-11-227-254 kernel: KDB: stack backtrace: Jan 20 23:06:03 ip-10-11-227-254 kernel: #0 0xffffffff8061f6b5 at kdb_backtrace+0x65 Jan 20 23:06:03 ip-10-11-227-254 kernel: #1 0xffffffff805d5f9f at vpanic+0x17f Jan 20 23:06:03 ip-10-11-227-254 kernel: #2 0xffffffff805d5e13 at panic+0x43 Jan 20 23:06:03 ip-10-11-227-254 kernel: #3 0xffffffff808fbc35 at trap_fatal+0x385 Jan 20 23:06:03 ip-10-11-227-254 kernel: #4 0xffffffff808fbc8f at trap_pfault+0x4f Jan 20 23:06:03 ip-10-11-227-254 kernel: #5 0xffffffff808d2fc8 at calltrap+0x8 Jan 20 23:06:03 ip-10-11-227-254 kernel: #6 0xffffffff8071c61f at in_cksum_skip+0x2f Jan 20 23:06:03 ip-10-11-227-254 kernel: #7 0xffffffff8072ea9d at in_delayed_cksum+0x3d Jan 20 23:06:03 ip-10-11-227-254 kernel: #8 0xffffffff8070eb3b at ng_nat_rcvdata+0x33b Jan 20 23:06:03 ip-10-11-227-254 kernel: #9 0xffffffff80709f4f at ng_apply_item+0x2bf Jan 20 23:06:03 ip-10-11-227-254 kernel: #10 0xffffffff80709a0e at ng_snd_item+0x28e Jan 20 23:06:03 ip-10-11-227-254 kernel: #11 0xffffffff8070d6bc at ng_ether_input+0x4c Jan 20 23:06:03 ip-10-11-227-254 kernel: #12 0xffffffff806e9319 at ether_nh_input+0x1c9 Jan 20 23:06:03 ip-10-11-227-254 kernel: #13 0xffffffff806f5788 at netisr_dispatch_src+0x88 Jan 20 23:06:03 ip-10-11-227-254 kernel: #14 0xffffffff806e85c9 at ether_input+0x39 Jan 20 23:06:03 ip-10-11-227-254 kernel: #15 0xffffffff80740354 at tcp_lro_flush+0x2f4 Jan 20 23:06:03 ip-10-11-227-254 kernel: #16 0xffffffff8074065b at tcp_lro_flush_all+0x17b Jan 20 23:06:03 ip-10-11-227-254 kernel: #17 0xffffffff80421078 at ena_cleanup+0x6f8 Jan 20 23:06:03 ip-10-11-227-254 kernel: Uptime: 5d5h55m55s Jan 20 23:06:03 ip-10-11-227-254 kernel: Rebooting... Jan 20 23:06:03 ip-10-11-227-254 kernel: cpu_reset: Restarting BSP Jan 20 23:06:03 ip-10-11-227-254 kernel: cpu_reset_proxy: Stopped CPU 2
Not offhand. You'd need to get a kernel dump and look at the mbuf chain, and/or try an INVARIANTS kernel.
Crash trace seems to be misleading, based on the crash IP (0xffffffff80664190), the crash is inside m_apply():
Dump of assembler code for function m_apply: 0xffffffff80664100 <+0>: push %rbp 0xffffffff80664101 <+1>: mov %rsp,%rbp 0xffffffff80664104 <+4>: push %r15 0xffffffff80664106 <+6>: push %r14 0xffffffff80664108 <+8>: push %r13 0xffffffff8066410a <+10>: push %r12 0xffffffff8066410c <+12>: push %rbx 0xffffffff8066410d <+13>: sub $0x18,%rsp 0xffffffff80664111 <+17>: mov %r8,-0x38(%rbp) 0xffffffff80664115 <+21>: mov %rcx,-0x30(%rbp) 0xffffffff80664119 <+25>: mov %edx,%r12d 0xffffffff8066411c <+28>: mov %rdi,%rbx 0xffffffff8066411f <+31>: test %esi,%esi 0xffffffff80664121 <+33>: jle 0xffffffff80664140 <m_apply+64> 0xffffffff80664123 <+35>: mov %esi,%eax 0xffffffff80664125 <+37>: cs nopw 0x0(%rax,%rax,1) 0xffffffff8066412f <+47>: nop 0xffffffff80664130 <+48>: sub 0x18(%rbx),%eax 0xffffffff80664133 <+51>: jl 0xffffffff80664149 <m_apply+73> 0xffffffff80664135 <+53>: mov (%rbx),%rbx 0xffffffff80664138 <+56>: mov %eax,%esi 0xffffffff8066413a <+58>: test %eax,%eax 0xffffffff8066413c <+60>: jg 0xffffffff80664130 <m_apply+48> 0xffffffff8066413e <+62>: jmp 0xffffffff80664142 <m_apply+66> 0xffffffff80664140 <+64>: mov %esi,%eax 0xffffffff80664142 <+66>: test %r12d,%r12d 0xffffffff80664145 <+69>: jg 0xffffffff80664190 <m_apply+144> 0xffffffff80664147 <+71>: jmp 0xffffffff80664150 <m_apply+80> 0xffffffff80664149 <+73>: mov %esi,%eax 0xffffffff8066414b <+75>: test %r12d,%r12d 0xffffffff8066414e <+78>: jg 0xffffffff80664190 <m_apply+144> 0xffffffff80664150 <+80>: xor %eax,%eax 0xffffffff80664152 <+82>: add $0x18,%rsp 0xffffffff80664156 <+86>: pop %rbx 0xffffffff80664157 <+87>: pop %r12 0xffffffff80664159 <+89>: pop %r13 0xffffffff8066415b <+91>: pop %r14 0xffffffff8066415d <+93>: pop %r15 0xffffffff8066415f <+95>: pop %rbp 0xffffffff80664160 <+96>: ret 0xffffffff80664161 <+97>: cs nopw 0x0(%rax,%rax,1) 0xffffffff8066416b <+107>: nopl 0x0(%rax,%rax,1) 0xffffffff80664170 <+112>: cltq 0xffffffff80664172 <+114>: add %rax,%rsi 0xffffffff80664175 <+117>: mov -0x38(%rbp),%rdi 0xffffffff80664179 <+121>: mov %r14d,%edx 0xffffffff8066417c <+124>: call *-0x30(%rbp) 0xffffffff8066417f <+127>: test %eax,%eax 0xffffffff80664181 <+129>: jne 0xffffffff80664152 <m_apply+82> 0xffffffff80664183 <+131>: sub %r14d,%r12d 0xffffffff80664186 <+134>: mov (%rbx),%rbx 0xffffffff80664189 <+137>: xor %eax,%eax 0xffffffff8066418b <+139>: test %r12d,%r12d 0xffffffff8066418e <+142>: jle 0xffffffff80664152 <m_apply+82> 0xffffffff80664190 <+144>: mov 0x18(%rbx),%r14d ^^^^^^^^^^^^^^ End of assembler dump.
Right, as I said earlier, what's probably happening is that the load of m->m_len (m_len has offset 0x18) is triggering a NULL pointer dereference. So most likely, the mbuf chain length (m->m_pkthdr.len) is longer that the sum of the length of the mbufs in the chain. That is, the chain is corrupted somehow.
To debug, you might try adding a macro which verifies the chain length, and add the macro at various strategic points, e.g., entry into netgraph.
Looks like I'm able to turn INVARIANTS just in that specific TU, let's see if we deploy it into production and narrow it down:
diff --git a/sys/kern/uipc_mbuf.c b/sys/kern/uipc_mbuf.c index a26cf036b012..6de762fbd07a 100644 --- a/sys/kern/uipc_mbuf.c +++ b/sys/kern/uipc_mbuf.c @@ -31,6 +31,9 @@ * @(#)uipc_mbuf.c 8.2 (Berkeley) 1/4/94 */ +#define INVARIANTS +#define INVARIANT_SUPPORT + #include <sys/cdefs.h> __FBSDID("$FreeBSD: 13c98d0895faaec787787f241d5a79bfdccce098 $");
Looks like I'm able to turn INVARIANTS just in that specific TU, let's see if we deploy it into production and narrow it down:
You are able of course, but that isn't going to work.
You need to enable INVARIANTS in the kernel config. If you anticipate that performance impact would be too big to run production normally, try using vm.debug.divisor=10 in your loader.conf. That will reduce impact of memory trashing/checking in UMA by factor of 10. This trashing is the biggest CPU hog enabled by INVARIANTS. With this tunable we are running production with up to 100 Gbit/s.
Why? I checked the generated & linked kernel code it properly has bunch of panic() calls attached to the end of the m_apply() now.
Possibly, but those are instances sized up for a task and some of them busier than the others shuffling 80-90kPPS between kernel and userland and back.
Maybe? I don't know anything about your network configuration. I don't see anything in the backtraces which specifically implicates LRO.
It may work if you are lucky. If this compilation unit checks only INVARIANTS that are imposed by the compilation unit and nothing else. And with "work" I mean not producing false positive panics. If we talk about helping to find the root cause of the problem, I doubt it will work. Very likely the root cause is outside this compilation unit.
For what's worth in two of the recent crashes I see tcp_lro_flush() at the top of the call stack:
#14 0xffffffff806e85c9 at ether_input+0x39 #15 0xffffffff80740354 at tcp_lro_flush+0x2f4 #16 0xffffffff8074065b at tcp_lro_flush_all+0x17b #17 0xffffffff80421078 at ena_cleanup+0x6f8 Uptime: 41d15h50m4s
JFYI, we use ng_nat to give ec2 instance a local "virtual" IP, matching IP assigned by the EC2 at the DMZ router, so we don't have to expose that internal IP to the various world-oriented subsystems. Like so:
[IF_ENA]->{actual: 10.0.0.x}->[NG_NAT]->{virtual: 1.2.3.4}->(APP)
setupshadow() { int="${1}" pubip="${2}" prvip="${3}" kldload -n ng_ether ngctl shutdown ${int}_NAT: || true ngctl mkpeer ${int}: nat lower in ngctl name ${int}:lower ${int}_NAT ngctl msg ${int}_NAT: setdlt 1 ngctl msg ${int}_NAT: setaliasaddr ${prvip} ngctl msg ${int}_NAT: settarget ${pubip} ngctl msg ${int}_NAT: setmode {flags=0x04 mask=0x04} ngctl connect ${int}: ${int}_NAT: upper out ifconfig ${int} mtu 1500 }
@glebius dully noted. ๐๐ง
I'd start with enabling crashdumps perhaps, that should give some food for thought, we can then poke into bits and structs if we are lucky.
P.S. There are 37 KASSERTS in this TU alone.
@markj @glebius Yes, this patch does not help. :( We are still seeing occasional panics, interestingly enough most of the time when the configuration is being established. It also sometimes persists across multiple reboots:
lo1: changing name to 'ena0_shadow' ena0_shadow: link state changed to UP panic: m_apply, offset > size of mbuf chain cpuid = 6 time = 1725536036 KDB: stack backtrace: #0 0xffffffff806226d5 at kdb_backtrace+0x65 #1 0xffffffff805d8fbf at vpanic+0x17f #2 0xffffffff805d8e33 at panic+0x43 #3 0xffffffff8066745e at m_apply+0x24e #4 0xffffffff8071f99f at in_cksum_skip+0x2f #5 0xffffffff80731e1d at in_delayed_cksum+0x3d #6 0xffffffff80711ebf at ng_nat_rcvdata+0x32f #7 0xffffffff8070d2df at ng_apply_item+0x2bf #8 0xffffffff8070cd9e at ng_snd_item+0x28e #9 0xffffffff80710a4c at ng_ether_input+0x4c #10 0xffffffff806ec6a9 at ether_nh_input+0x1c9 #11 0xffffffff806f8b18 at netisr_dispatch_src+0x88 #12 0xffffffff806eb959 at ether_input+0x39 #13 0xffffffff80423f1d at ena_cleanup+0x59d #14 0xffffffff80636cb1 at taskqueue_run_locked+0x181 #15 0xffffffff80637fc2 at taskqueue_thread_loop+0xc2 #16 0xffffffff805981de at fork_exit+0x7e #17 0xffffffff808fb28e at fork_trampoline+0xe Uptime: 48s
lo1: changing name to 'ena0_shadow' ena0_shadow: link state changed to UP panic: m_apply, offset > size of mbuf chain cpuid = 0 time = 1725536131 KDB: stack backtrace: #0 0xffffffff806226d5 at kdb_backtrace+0x65 #1 0xffffffff805d8fbf at vpanic+0x17f #2 0xffffffff805d8e33 at panic+0x43 #3 0xffffffff8066745e at m_apply+0x24e #4 0xffffffff8071f99f at in_cksum_skip+0x2f #5 0xffffffff80731e1d at in_delayed_cksum+0x3d #6 0xffffffff80711ebf at ng_nat_rcvdata+0x32f #7 0xffffffff8070d2df at ng_apply_item+0x2bf #8 0xffffffff8070cd9e at ng_snd_item+0x28e #9 0xffffffff80710a4c at ng_ether_input+0x4c #10 0xffffffff806ec6a9 at ether_nh_input+0x1c9 #11 0xffffffff806f8b18 at netisr_dispatch_src+0x88 #12 0xffffffff806eb959 at ether_input+0x39 #13 0xffffffff80423f1d at ena_cleanup+0x59d #14 0xffffffff80636cb1 at taskqueue_run_locked+0x181 #15 0xffffffff80637fc2 at taskqueue_thread_loop+0xc2 #16 0xffffffff805981de at fork_exit+0x7e #17 0xffffffff808fb28e at fork_trampoline+0xe Uptime: 50s
lo1: changing name to 'ena0_shadow' ena0_shadow: link state changed to UP panic: m_apply, offset > size of mbuf chain cpuid = 2 time = 1725536196 KDB: stack backtrace: [...] Uptime: 21s
ena0_shadow: link state changed to UP panic: m_apply, offset > size of mbuf chain cpuid = 0 time = 1725536338 KDB: stack backtrace: [...] Uptime: 1m38s Rebooting...
ena0_shadow: link state changed to UP panic: m_apply, offset > size of mbuf chain cpuid = 6 time = 1725536448 KDB: stack backtrace: [...] Uptime: 1m7s Rebooting...
This however is not 100% guaranteed, as it also happens sporadically.
I've added some debug around those lines and it looks like this is a VERY rate event. I just cannot hit that condition intentionally. After doing some digging, it looks like the th_x2 is only set by the few places in the libalias mostly in some rate/obsolete protocols (and proxy feature).
~/projects/freebsd13]$ find sys/ -name \*.c | xargs grep 'th_x2 = [^0]' sys/netinet/libalias/alias_ftp.c: tc->th_x2 = 1; sys/netinet/libalias/alias_irc.c: tc->th_x2 = 1; sys/netinet/libalias/alias_proxy.c: tc->th_x2 = 1; sys/netinet/libalias/alias_skinny.c: tc->th_x2 = 1; sys/netinet/libalias/alias_skinny.c: tc->th_x2 = 1; sys/netinet/libalias/alias_skinny.c: tc->th_x2 = 1; sys/netinet/libalias/alias_smedia.c: tc->th_x2 = 1; sys/netinet/libalias/alias_smedia.c: tc->th_x2 = 1;
I have a big suspicion that we might be hitting some bug in the handlers of those obsolete protocols, corrupting something and triggering the panic. The fact that we put it in a reverse mode (i.e. with the "LAN" facing the open internet) and also the observation that the it happens in "batches" right after the interface is enabled persisting several reboots, might indicate some form of junk traffic triggering that otherwise sleeping bug. For the meantime I am disabling all those handlers in our build and we see if it helps.