Out of order updates to rxd's is fixed in r315217. However, it is not
completely fixed.
While refilling the buffers, iflib is not considering the out of order
descriptors. Hence, it is refilling sequentially.
"idx" variable in _iflib_fl_refill routine is incremented sequentially.
By doing refilling sequentially, it will override the SGEs that
are *IN USE* by other connections.
Fix is to maintain a bitmap of rx descriptors and differentiate the used
one with unused one and refill only at the unused indices.
This patch also fixes a few bugs in bnxt, related to the same feature.
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
I really hate the idea of spreading the linux kpi, but I totally understand why you used it. We should probably move a lot of that stuff to a native interface.
Yes, I agree with you. Even I didn't like it. I also agree of moving all such helper functions to a native file.
Having said that, IMO, moving the helper functions to the native file in a separate patch would be more cleaner.
I will submit a separate patch for that based on your comments.
Hi Sean,
Could you please commit this patch, if there are no further review comments.
I will submit a different patch for Andrew Gallatin's comments.
Thanks,
Venkat.
Possibly i should be renamed since it needs to exist beyond the loop. But either that line isn't needed, or it needs to be fixed in other places.
sys/dev/bnxt/if_bnxt.c | ||
---|---|---|
553 โ | (On Diff #28241) | Why isn't i already this value, and if it isn't, why isn't the same thing needed below for line 561? |
I'm confused by phabricator's behavior. I thought I inserted an inline comment about the bitops file.
Original comment that I added:
"Can we not use the linuxkpi version here and use sys/bitops.h instead?"
sys/dev/bnxt/if_bnxt.c | ||
---|---|---|
1011 โ | (On Diff #28241) | This patch seems to be enabling TPA (hardware LRO). That's not listed in the review commit message. Is that something we want to indicate in this commit when it hits head? |
So the reason to not use bitset(9) is because the maximum size of the set is prohibitively large? I'm pretty sure it's limited by type to 16-bits, so 8k per list which doesn't seem too crazy, or am I missing something else?
So the reason to not use bitset(9) is because the maximum size of the set is prohibitively large? I'm pretty sure it's limited by type to 16-bits, so 8k per list which doesn't seem too crazy, or am I missing something else?
It looks like bitstring(3) is available in the kernel as well. I no longer see any reason to use the Linux KPI here.
Chenna,
Please commandeer this revision and work on the review comments.
Thanks,
Venkat.
bnxt: Enable HW LRO and Fix out-of-order updates to rxd's completely.
Out of order updates to rxd's is fixed in r315217. However, it is not
completely fixed.
While refilling the buffers, iflib is not considering the out of order
descriptors. Hence, it is refilling sequentially.
"idx" variable in _iflib_fl_refill routine is incremented sequentially.
By doing refilling sequentially, it will override the SGEs that
are *IN USE* by other connections.
Fix is to maintain a bitmap of rx descriptors and differentiate the used
one with unused one and refill only at the unused indices.
This patch also fixes a few bugs in bnxt, related to the same feature.
Version 2 changes:-
- Based on Stephen Hurd's comment, used bitstring API instead of Linux's bitops.
- Addressed following Stephen Hurd's comment.
- sys/dev/bnxt/if_bnxt.c
- 553
- Why isn't i already this value, and if it isn't,
- why isn't the same thing needed below for line 561?
- Addressed following Sean Bruno's comment.
- sys/dev/bnxt/if_bnxt.c
- 1011
- This patch seems to be enabling TPA (hardware LRO). That's not listed
- in the review commit message. Is that something we want to indicate in
- this commit when it hits head?
bit_free API doesn't exist as of today, seems like.
So, either you can introduce bit_free which essentially would a wrapper for "free".
or you can use "free" directly to free the bitmap memory.
Yeah, no need for a new wrapper. The reason to have a separate alloc function is so you can specify the size in bits, there's nothing a free() wrapper would add.
Venkat / Stephen Hurd,
Thanks, Taken care of your review comment by doing free(fl->ifl_rx_bitmap, M_IFLIB); Can you please review and commit the patch..
Hi,
Can you please commit this patch if no further reviews?
Thanks,
Bhargava Chenna, Marreddy
I have reverted this at svn revision 320059.
This is generating startup Lock Order Reversal errors (that I should have noted) and a kernel panic when adding elements to a bridge device.
LOR with GENERIC:
uma_zalloc_arg: zone "128" with the following non-sleepable locks held: exclusive sleep mutex em0 (iflib ctx lock) r = 0 (0xfffff80003fc2558) locked @ /home/sbruno/bsd/fbsd_head/sys/net/iflib.c:3611 stack backtrace: #0 0xffffffff80ab9700 at witness_debugger+0x70 #1 0xffffffff80abab2e at witness_warn+0x45e #2 0xffffffff80d4b48b at uma_zalloc_arg+0x3b #3 0xffffffff80a331e0 at malloc+0x1e0 #4 0xffffffff80b6413a at iflib_init_locked+0x6ba #5 0xffffffff80b6a745 at iflib_if_ioctl+0x765 #6 0xffffffff80bc4c84 at in_control+0x864 #7 0xffffffff80b4eda3 at ifioctl+0xe43 #8 0xffffffff8096bf78 at nfs_mount+0xbb8 #9 0xffffffff80b170a9 at vfs_donmount+0xef9 #10 0xffffffff80b19ce2 at kernel_mount+0x62 #11 0xffffffff80b1c214 at parse_mount+0x414 #12 0xffffffff80b1a6da at vfs_mountroot+0x5da #13 0xffffffff809f2cae at start_init+0x5e #14 0xffffffff80a19e14 at fork_exit+0x84 #15 0xffffffff80ec10ae at fork_trampoline+0xe uma_zalloc_arg: zone "128" with the following non-sleepable locks held: exclusive sleep mutex em0 (iflib ctx lock) r = 0 (0xfffff80003fc2558) locked @ /home/sbruno/bsd/fbsd_head/sys/net/iflib.c:3611 stack backtrace: #0 0xffffffff80ab9700 at witness_debugger+0x70 #1 0xffffffff80abab2e at witness_warn+0x45e #2 0xffffffff80d4b48b at uma_zalloc_arg+0x3b #3 0xffffffff80a331e0 at malloc+0x1e0 #4 0xffffffff80b6413a at iflib_init_locked+0x6ba #5 0xffffffff80b6a745 at iflib_if_ioctl+0x765 #6 0xffffffff80bc4c84 at in_control+0x864 #7 0xffffffff80b4eda3 at ifioctl+0xe43 #8 0xffffffff8096bf78 at nfs_mount+0xbb8 #9 0xffffffff80b170a9 at vfs_donmount+0xef9 #10 0xffffffff80b19ce2 at kernel_mount+0x62 #11 0xffffffff80b1c214 at parse_mount+0x414 #12 0xffffffff80b1a6da at vfs_mountroot+0x5da #13 0xffffffff809f2cae at start_init+0x5e #14 0xffffffff80a19e14 at fork_exit+0x84 #15 0xffffffff80ec10ae at fork_trampoline+0xe
Panic when adding a device to an active bridge device:
root@nfstest:~ # ifconfig bridge0 create bridge0: Ethernet address: 02:10:c9:5c:ea:00 root@nfstest:~ # ifconfig bridge0 addm em0 link state changed to down uma_zalloc_arg: zone "128" with the following non-sleepable locks held: exclusive sleep mutex em0 (iflib ctx lock) r = 0 (0xfffff80003fc2558) locked @ /home/sbruno/bsd/fbsd_head/sys/net/iflib.c:3848 stack backtrace: #0 0xffffffff80ab9700 at witness_debugger+0x70 #1 0xffffffff80abab2e at witness_warn+0x45e #2 0xffffffff80d4b48b at uma_zalloc_arg+0x3b #3 0xffffffff80a331e0 at malloc+0x1e0 #4 0xffffffff80b6413a at iflib_init_locked+0x6ba #5 0xffffffff80b6a6c7 at iflib_if_ioctl+0x6e7 #6 0xffffffff82424f66 at bridge_set_ifcap+0x66 #7 0xffffffff82424ea8 at bridge_mutecaps+0xe8 #8 0xffffffff824214a0 at bridge_ioctl_add+0x4a0 #9 0xffffffff824230be at bridge_ioctl+0x28e #10 0xffffffff80b4eda3 at ifioctl+0xe43 #11 0xffffffff80abf2ed at kern_ioctl+0x2cd #12 0xffffffff80abefaf at sys_ioctl+0x16f #13 0xffffffff80edef69 at amd64_syscall+0x549 #14 0xffffffff80ec0e5b at Xfast_syscall+0xfb uma_zalloc_arg: zone "128" with the following non-sleepable locks held: exclusive sleep mutex em0 (iflib ctx lock) r = 0 (0xfffff80003fc2558) locked @ /home/sbruno/bsd/fbsd_head/sys/net/iflib.c:3848 stack backtrace: #0 0xffffffff80ab9700 at witness_debugger+0x70 #1 0xffffffff80abab2e at witness_warn+0x45e #2 0xffffffff80d4b48b at uma_zalloc_arg+0x3b #3 0xffffffff80a331e0 at malloc+0x1e0 #4 0xffffffff80b6413a at iflib_init_locked+0x6ba #5 0xffffffff80b6a6c7 at iflib_if_ioctl+0x6e7 #6 0xffffffff82424f66 at bridge_set_ifcap+0x66 #7 0xffffffff82424ea8 at bridge_mutecaps+0xe8 #8 0xffffffff824214a0 at bridge_ioctl_add+0x4a0 #9 0xffffffff824230be at bridge_ioctl+0x28e #10 0xffffffff80b4eda3 at ifioctl+0xe43 #11 0xffffffff80abf2ed at kern_ioctl+0x2cd #12 0xffffffff80abefaf at sys_ioctl+0x16f #13 0xffffffff80edef69 at amd64_syscall+0x549 #14 0xffffffff80ec0e5b at Xfast_syscall+0xfb root@nfstest:~ # Link state changed to up Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff80b6507b stack pointer = 0x28:0xfffffe00f55a4a40 frame pointer = 0x28:0xfffffe00f55a4b20 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_0) [ thread pid 0 tid 100018 ] Stopped at _task_fn_rx+0x1bb: movq $0,(%rbx) db> bt Tracing pid 0 tid 100018 td 0xfffff80003cf1560 _task_fn_rx() at _task_fn_rx+0x1bb/frame 0xfffffe00f55a4b20 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x129/frame 0xfffffe00f55a4b80 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe00f55a4bb0 fork_exit() at fork_exit+0x84/frame 0xfffffe00f55a4bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00f55a4bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Sean Bruno, Thank you!! I'm able to reproduce those warning messages when compile Kernel with DEBUG option. I'll be looking into the issue and get back.
bnxt: Enable HW LRO and Fix out-of-order updates to rxd's completely.
Correction to earlier patch:
Realized that we have been invoking bit_alloc() by holding CTX_LOCK , i.e. from iflib_init_locked().
Corrected alloc and free invocations in following way, i.e. one from _register() and other from _deregister().
iflib_device_register() => iflib_queues_alloc() ==> bit_alloc(ifl_rx_bitmap)
iflib_device_deregister() ==> free(ifl_rx_bitmap)
Thanks for resolving this LOR.
Have you looked at the kernel crash when adding a device to a ethernet bridge?
Have you looked at the kernel crash when adding a device to a ethernet bridge?
<Chenna>
Yes, I tried Ethernet bridge also, no issues seen.
#ifconfig bridge0 create
#ifconfig bridge0 addm bnxt0
No Stacktrace / Crash seen.
Note: I do not have em0 in my server so tried with bnxt0, bge0 and bge1, no Crash.
I'm still seeing a panic with my em0
I'll have to do a backtrace here for you. This is what I see from a textdump only
6>tap0: promiscuous mode enabled
<5>bridge0: link state changed to DOWN
Link state changed to up
<5>em0: link state changed to UP
<5>bridge0: link state changed to UP
Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address = 0x0
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff80b6b10b
stack pointer = 0x28:0xfffffe07bb6de900
frame pointer = 0x28:0xfffffe07bb6de9e0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (if_io_tqg_7)
From a kernel + disassembly I get:
static struct mbuf *
iflib_rxd_pkt_get(iflib_rxq_t rxq, if_rxd_info_t ri)
{
struct if_rxsd sd; struct mbuf *m; /* should I merge this back in now that the two paths are basically dupl
icated? */
if (ri->iri_nfrags == 1 && ri->iri_frags[0].irf_len <= IFLIB_RX_COPY_THRESH) { rxd_frag_to_sd(rxq, &ri->iri_frags[0], FALSE, &sd);
0xffffffff80b6b0f5 <_task_fn_rx+421>: callq 0xffffffff80b70ee0 <rxd_frag_to_s
d>
m = *sd.ifsd_m;
0xffffffff80b6b0fa <_task_fn_rx+426>: mov -0xc8(%rbp),%rax
0xffffffff80b6b101 <_task_fn_rx+433>: mov (%rax),%rbx
*sd.ifsd_m = NULL;
0xffffffff80b6b104 <_task_fn_rx+436>: movq $0x0,(%rax)
0xffffffff80b6b10b <_task_fn_rx+443>: movq $0x0,(%rbx)
And from kgdb
#10 <signal handler called>
#11 m_init (how=1, type=1, m=<optimized out>, flags=<optimized out>)
at /home/db/svn/system/head/sys/sys/mbuf.h:721
#12 iflib_rxd_pkt_get (rxq=<optimized out>, ri=<optimized out>)
at /home/db/svn/system/head/sys/net/iflib.c:2432
#13 iflib_rxeof (budget=16, rxq=<optimized out>)
at /home/db/svn/system/head/sys/net/iflib.c:2529
#14 _task_fn_rx (context=0xfffff80007a72c00)
at /home/db/svn/system/head/sys/net/iflib.c:3510
#15 0xffffffff80a9ebc9 in gtaskqueue_run_locked (queue=0xfffff800076bdd00)
at /home/db/svn/system/head/sys/kern/subr_gtaskqueue.c:329
#16 0xffffffff80a9e958 in gtaskqueue_thread_loop (arg=<optimized out>)
at /home/db/svn/system/head/sys/kern/subr_gtaskqueue.c:504
#17 0xffffffff80a1fcd4 in fork_exit (
callout=0xffffffff80a9e8d0 <gtaskqueue_thread_loop>, arg=0xfffffe00010290b0, frame=0xfffffe07bb6deac0) at /home/db/svn/system/head/sys/kern/kern_fork.c:1038
#18 <signal handler called>
(kgdb)
bnxt(4) and em(4) are the only drivers that have implemented the IFLIB api. bge(4) does not. I'm assuming that there is a data structure element being dereferenced here that we need to initialize in em(4).
sys/dev/bnxt/bnxt_txrx.c | ||
---|---|---|
276 โ | (On Diff #29856) | What's this new field for? Are other drivers supposed to use it for things? |
Sean,
- I'm able to recreate the Crash with igb0 on my other setup.
- Root caused as fl->ifl_fragidx wasn't resetting (to 0) on iflib_fl_bufs_free().
- I quickly made the patch with that change. It would be great if you can verify it on your setup
- I need a day or Two to take care of other review comments.
Thanks for quick response, Diane.
I'll start working to address other review comments and submit the final patch soon.
Hopefully this is final patch, I have addressed all review comments.
sys/dev/bnxt/bnxt_txrx.c | ||
---|---|---|
276 โ | (On Diff #29856) | Driver really need this only when it can receive out-of-order completions, otherwise not needed. |
sys/net/iflib.c | ||
1803 โ | (On Diff #29856) | Sure, I'll consider this. |
1803 โ | (On Diff #29856) | Sure, I'll consider this. |
Hi Sean,
Could you please commit this patch, if there are no further review comments.