Page MenuHomeFreeBSD

bnxt: Enable HW LRO and Fix out-of-order updates to rxd's completely.
ClosedPublic

Authored by bhargava.marreddy_broadcom.com on May 11 2017, 2:57 PM.

Details

Summary

Out of order updates to rxd's is fixed in r315217. However, it is not
completely fixed.
While refilling the buffers, iflib is not considering the out of order
descriptors. Hence, it is refilling sequentially.
"idx" variable in _iflib_fl_refill routine is incremented sequentially.
By doing refilling sequentially, it will override the SGEs that
are *IN USE* by other connections.
Fix is to maintain a bitmap of rx descriptors and differentiate the used
one with unused one and refill only at the unused indices.
This patch also fixes a few bugs in bnxt, related to the same feature.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

sbruno requested changes to this revision.May 18 2017, 6:01 PM
This revision now requires changes to proceed.May 18 2017, 6:01 PM
gallatin accepted this revision.May 18 2017, 6:06 PM

I really hate the idea of spreading the linux kpi, but I totally understand why you used it. We should probably move a lot of that stuff to a native interface.

I really hate the idea of spreading the linux kpi, but I totally understand why you used it. We should probably move a lot of that stuff to a native interface.

Yes, I agree with you. Even I didn't like it. I also agree of moving all such helper functions to a native file.
Having said that, IMO, moving the helper functions to the native file in a separate patch would be more cleaner.
I will submit a separate patch for that based on your comments.

Hi Sean,
Could you please commit this patch, if there are no further review comments.
I will submit a different patch for Andrew Gallatin's comments.

Thanks,
Venkat.

shurd requested changes to this revision.May 23 2017, 8:01 PM

Possibly i should be renamed since it needs to exist beyond the loop. But either that line isn't needed, or it needs to be fixed in other places.

sys/dev/bnxt/if_bnxt.c
553 ↗(On Diff #28241)

Why isn't i already this value, and if it isn't, why isn't the same thing needed below for line 561?

I'm confused by phabricator's behavior. I thought I inserted an inline comment about the bitops file.

Original comment that I added:

"Can we not use the linuxkpi version here and use sys/bitops.h instead?"

sbruno added inline comments.May 23 2017, 8:15 PM
sys/dev/bnxt/if_bnxt.c
1011 ↗(On Diff #28241)

This patch seems to be enabling TPA (hardware LRO). That's not listed in the review commit message. Is that something we want to indicate in this commit when it hits head?

shurd added a subscriber: gallatin.May 23 2017, 8:35 PM

I really hate the idea of spreading the linux kpi, but I totally understand why you used it. We should probably move a lot of that stuff to a native interface.

So the reason to not use bitset(9) is because the maximum size of the set is prohibitively large? I'm pretty sure it's limited by type to 16-bits, so 8k per list which doesn't seem too crazy, or am I missing something else?

shurd added a comment.May 23 2017, 8:59 PM

So the reason to not use bitset(9) is because the maximum size of the set is prohibitively large? I'm pretty sure it's limited by type to 16-bits, so 8k per list which doesn't seem too crazy, or am I missing something else?

It looks like bitstring(3) is available in the kernel as well. I no longer see any reason to use the Linux KPI here.

Chenna,
Please commandeer this revision and work on the review comments.

Thanks,
Venkat.

bhargava.marreddy_broadcom.com edited edge metadata.

bnxt: Enable HW LRO and Fix out-of-order updates to rxd's completely.

Out of order updates to rxd's is fixed in r315217. However, it is not
completely fixed.
While refilling the buffers, iflib is not considering the out of order
descriptors. Hence, it is refilling sequentially.
"idx" variable in _iflib_fl_refill routine is incremented sequentially.
By doing refilling sequentially, it will override the SGEs that
are *IN USE* by other connections.
Fix is to maintain a bitmap of rx descriptors and differentiate the used
one with unused one and refill only at the unused indices.
This patch also fixes a few bugs in bnxt, related to the same feature.

Version 2 changes:-

  1. Based on Stephen Hurd's comment, used bitstring API instead of Linux's bitops.
  1. Addressed following Stephen Hurd's comment.
    1. sys/dev/bnxt/if_bnxt.c
    2. 553
    3. Why isn't i already this value, and if it isn't,
    4. why isn't the same thing needed below for line 561?
  1. Addressed following Sean Bruno's comment.
    1. sys/dev/bnxt/if_bnxt.c
    2. 1011
    3. This patch seems to be enabling TPA (hardware LRO). That's not listed
    4. in the review commit message. Is that something we want to indicate in
    5. this commit when it hits head?
venkatkumar.duvvuru_broadcom.com requested changes to this revision.May 25 2017, 12:57 PM

bit_free API doesn't exist as of today, seems like.
So, either you can introduce bit_free which essentially would a wrapper for "free".
or you can use "free" directly to free the bitmap memory.

This revision now requires changes to proceed.May 25 2017, 12:57 PM
shurd added a comment.May 25 2017, 6:33 PM

bit_free API doesn't exist as of today, seems like.
So, either you can introduce bit_free which essentially would a wrapper for "free".
or you can use "free" directly to free the bitmap memory.

Yeah, no need for a new wrapper. The reason to have a separate alloc function is so you can specify the size in bits, there's nothing a free() wrapper would add.

bhargava.marreddy_broadcom.com retitled this revision from bnxt: Fix out-of-order updates to rxd's completely. to bnxt: Enable HW LRO and Fix out-of-order updates to rxd's completely..May 25 2017, 8:02 PM
bhargava.marreddy_broadcom.com edited edge metadata.

Venkat / Stephen Hurd,

Thanks, Taken care of your review comment by doing free(fl->ifl_rx_bitmap, M_IFLIB);
Can you please review and commit the patch..

Hi,

Can you please commit this patch if no further reviews?

Thanks,
Bhargava Chenna, Marreddy

shurd accepted this revision.Jun 7 2017, 8:32 PM
sbruno accepted this revision.Jun 15 2017, 7:16 PM
This revision is now accepted and ready to land.Jun 15 2017, 7:16 PM
This revision was automatically updated to reflect the committed changes.
sbruno reopened this revision.Jun 17 2017, 5:43 PM

I have reverted this at svn revision 320059.

This is generating startup Lock Order Reversal errors (that I should have noted) and a kernel panic when adding elements to a bridge device.

LOR with GENERIC:

uma_zalloc_arg: zone "128" with the following non-sleepable locks held:
exclusive sleep mutex em0 (iflib ctx lock) r = 0 (0xfffff80003fc2558) locked @ /home/sbruno/bsd/fbsd_head/sys/net/iflib.c:3611
stack backtrace:
#0 0xffffffff80ab9700 at witness_debugger+0x70
#1 0xffffffff80abab2e at witness_warn+0x45e
#2 0xffffffff80d4b48b at uma_zalloc_arg+0x3b
#3 0xffffffff80a331e0 at malloc+0x1e0
#4 0xffffffff80b6413a at iflib_init_locked+0x6ba
#5 0xffffffff80b6a745 at iflib_if_ioctl+0x765
#6 0xffffffff80bc4c84 at in_control+0x864
#7 0xffffffff80b4eda3 at ifioctl+0xe43
#8 0xffffffff8096bf78 at nfs_mount+0xbb8
#9 0xffffffff80b170a9 at vfs_donmount+0xef9
#10 0xffffffff80b19ce2 at kernel_mount+0x62
#11 0xffffffff80b1c214 at parse_mount+0x414
#12 0xffffffff80b1a6da at vfs_mountroot+0x5da
#13 0xffffffff809f2cae at start_init+0x5e
#14 0xffffffff80a19e14 at fork_exit+0x84
#15 0xffffffff80ec10ae at fork_trampoline+0xe
uma_zalloc_arg: zone "128" with the following non-sleepable locks held:
exclusive sleep mutex em0 (iflib ctx lock) r = 0 (0xfffff80003fc2558) locked @ /home/sbruno/bsd/fbsd_head/sys/net/iflib.c:3611
stack backtrace:
#0 0xffffffff80ab9700 at witness_debugger+0x70
#1 0xffffffff80abab2e at witness_warn+0x45e
#2 0xffffffff80d4b48b at uma_zalloc_arg+0x3b
#3 0xffffffff80a331e0 at malloc+0x1e0
#4 0xffffffff80b6413a at iflib_init_locked+0x6ba
#5 0xffffffff80b6a745 at iflib_if_ioctl+0x765
#6 0xffffffff80bc4c84 at in_control+0x864
#7 0xffffffff80b4eda3 at ifioctl+0xe43
#8 0xffffffff8096bf78 at nfs_mount+0xbb8
#9 0xffffffff80b170a9 at vfs_donmount+0xef9
#10 0xffffffff80b19ce2 at kernel_mount+0x62
#11 0xffffffff80b1c214 at parse_mount+0x414
#12 0xffffffff80b1a6da at vfs_mountroot+0x5da
#13 0xffffffff809f2cae at start_init+0x5e
#14 0xffffffff80a19e14 at fork_exit+0x84
#15 0xffffffff80ec10ae at fork_trampoline+0xe

Panic when adding a device to an active bridge device:

root@nfstest:~ # ifconfig bridge0 create
bridge0: Ethernet address: 02:10:c9:5c:ea:00
root@nfstest:~ # ifconfig bridge0 addm em0
link state changed to down
uma_zalloc_arg: zone "128" with the following non-sleepable locks held:
exclusive sleep mutex em0 (iflib ctx lock) r = 0 (0xfffff80003fc2558) locked @ /home/sbruno/bsd/fbsd_head/sys/net/iflib.c:3848
stack backtrace:
#0 0xffffffff80ab9700 at witness_debugger+0x70
#1 0xffffffff80abab2e at witness_warn+0x45e
#2 0xffffffff80d4b48b at uma_zalloc_arg+0x3b
#3 0xffffffff80a331e0 at malloc+0x1e0
#4 0xffffffff80b6413a at iflib_init_locked+0x6ba
#5 0xffffffff80b6a6c7 at iflib_if_ioctl+0x6e7
#6 0xffffffff82424f66 at bridge_set_ifcap+0x66
#7 0xffffffff82424ea8 at bridge_mutecaps+0xe8
#8 0xffffffff824214a0 at bridge_ioctl_add+0x4a0
#9 0xffffffff824230be at bridge_ioctl+0x28e
#10 0xffffffff80b4eda3 at ifioctl+0xe43
#11 0xffffffff80abf2ed at kern_ioctl+0x2cd
#12 0xffffffff80abefaf at sys_ioctl+0x16f
#13 0xffffffff80edef69 at amd64_syscall+0x549
#14 0xffffffff80ec0e5b at Xfast_syscall+0xfb
uma_zalloc_arg: zone "128" with the following non-sleepable locks held:
exclusive sleep mutex em0 (iflib ctx lock) r = 0 (0xfffff80003fc2558) locked @ /home/sbruno/bsd/fbsd_head/sys/net/iflib.c:3848
stack backtrace:
#0 0xffffffff80ab9700 at witness_debugger+0x70
#1 0xffffffff80abab2e at witness_warn+0x45e
#2 0xffffffff80d4b48b at uma_zalloc_arg+0x3b
#3 0xffffffff80a331e0 at malloc+0x1e0
#4 0xffffffff80b6413a at iflib_init_locked+0x6ba
#5 0xffffffff80b6a6c7 at iflib_if_ioctl+0x6e7
#6 0xffffffff82424f66 at bridge_set_ifcap+0x66
#7 0xffffffff82424ea8 at bridge_mutecaps+0xe8
#8 0xffffffff824214a0 at bridge_ioctl_add+0x4a0
#9 0xffffffff824230be at bridge_ioctl+0x28e
#10 0xffffffff80b4eda3 at ifioctl+0xe43
#11 0xffffffff80abf2ed at kern_ioctl+0x2cd
#12 0xffffffff80abefaf at sys_ioctl+0x16f
#13 0xffffffff80edef69 at amd64_syscall+0x549
#14 0xffffffff80ec0e5b at Xfast_syscall+0xfb
root@nfstest:~ # Link state changed to up


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x0
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff80b6507b
stack pointer           = 0x28:0xfffffe00f55a4a40
frame pointer           = 0x28:0xfffffe00f55a4b20
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (if_io_tqg_0)
[ thread pid 0 tid 100018 ]
Stopped at      _task_fn_rx+0x1bb:      movq    $0,(%rbx)
db> bt
Tracing pid 0 tid 100018 td 0xfffff80003cf1560
_task_fn_rx() at _task_fn_rx+0x1bb/frame 0xfffffe00f55a4b20
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x129/frame 0xfffffe00f55a4b80
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe00f55a4bb0
fork_exit() at fork_exit+0x84/frame 0xfffffe00f55a4bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00f55a4bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
sbruno requested changes to this revision.Jun 17 2017, 6:35 PM
This revision now requires changes to proceed.Jun 17 2017, 6:35 PM
Sean Bruno,

Thank you!!
I'm able to reproduce those warning messages when compile Kernel with DEBUG option.
I'll be looking into the issue and get back.
db added a subscriber: db.Jun 19 2017, 6:13 PM
bhargava.marreddy_broadcom.com edited edge metadata.

bnxt: Enable HW LRO and Fix out-of-order updates to rxd's completely.

Correction to earlier patch:
Realized that we have been invoking bit_alloc() by holding CTX_LOCK , i.e. from iflib_init_locked().
Corrected alloc and free invocations in following way, i.e. one from _register() and other from _deregister().

iflib_device_register() => iflib_queues_alloc() ==> bit_alloc(ifl_rx_bitmap)
iflib_device_deregister() ==> free(ifl_rx_bitmap)

bnxt: Enable HW LRO and Fix out-of-order updates to rxd's completely.

Correction to earlier patch:
Realized that we have been invoking bit_alloc() by holding CTX_LOCK , i.e. from iflib_init_locked().
Corrected alloc and free invocations in following way, i.e. one from _register() and other from _deregister().

iflib_device_register() => iflib_queues_alloc() ==> bit_alloc(ifl_rx_bitmap)
iflib_device_deregister() ==> free(ifl_rx_bitmap)

Thanks for resolving this LOR.

Have you looked at the kernel crash when adding a device to a ethernet bridge?

Have you looked at the kernel crash when adding a device to a ethernet bridge?

<Chenna>
Yes, I tried Ethernet bridge also, no issues seen.
#ifconfig bridge0 create
#ifconfig bridge0 addm bnxt0
No Stacktrace / Crash seen.

Note: I do not have em0 in my server so tried with bnxt0, bge0 and bge1, no Crash.

db added a comment.EditedJun 20 2017, 7:51 PM

Have you looked at the kernel crash when adding a device to a ethernet bridge?

<Chenna>
Yes, I tried Ethernet bridge also, no issues seen.
#ifconfig bridge0 create
#ifconfig bridge0 addm bnxt0
No Stacktrace / Crash seen.

Note: I do not have em0 in my server so tried with bnxt0, bge0 and bge1, no Crash.

I'm still seeing a panic with my em0

I'll have to do a backtrace here for you. This is what I see from a textdump only

6>tap0: promiscuous mode enabled
<5>bridge0: link state changed to DOWN
Link state changed to up
<5>em0: link state changed to UP
<5>bridge0: link state changed to UP

Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address = 0x0
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff80b6b10b
stack pointer = 0x28:0xfffffe07bb6de900
frame pointer = 0x28:0xfffffe07bb6de9e0
code segment = base 0x0, limit 0xfffff, type 0x1b

= DPL 0, pres 1, long 1, def32 0, gran 1

processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (if_io_tqg_7)

From a kernel + disassembly I get:

static struct mbuf *
iflib_rxd_pkt_get(iflib_rxq_t rxq, if_rxd_info_t ri)
{

struct if_rxsd sd;
struct mbuf *m;

/* should I merge this back in now that the two paths are basically dupl

icated? */

if (ri->iri_nfrags == 1 &&
    ri->iri_frags[0].irf_len <= IFLIB_RX_COPY_THRESH) {
        rxd_frag_to_sd(rxq, &ri->iri_frags[0], FALSE, &sd);

0xffffffff80b6b0f5 <_task_fn_rx+421>: callq 0xffffffff80b70ee0 <rxd_frag_to_s
d>

m = *sd.ifsd_m;

0xffffffff80b6b0fa <_task_fn_rx+426>: mov -0xc8(%rbp),%rax
0xffffffff80b6b101 <_task_fn_rx+433>: mov (%rax),%rbx

*sd.ifsd_m = NULL;

0xffffffff80b6b104 <_task_fn_rx+436>: movq $0x0,(%rax)
0xffffffff80b6b10b <_task_fn_rx+443>: movq $0x0,(%rbx)

And from kgdb
#10 <signal handler called>
#11 m_init (how=1, type=1, m=<optimized out>, flags=<optimized out>)

at /home/db/svn/system/head/sys/sys/mbuf.h:721

#12 iflib_rxd_pkt_get (rxq=<optimized out>, ri=<optimized out>)

at /home/db/svn/system/head/sys/net/iflib.c:2432

#13 iflib_rxeof (budget=16, rxq=<optimized out>)

at /home/db/svn/system/head/sys/net/iflib.c:2529

#14 _task_fn_rx (context=0xfffff80007a72c00)

at /home/db/svn/system/head/sys/net/iflib.c:3510

#15 0xffffffff80a9ebc9 in gtaskqueue_run_locked (queue=0xfffff800076bdd00)

at /home/db/svn/system/head/sys/kern/subr_gtaskqueue.c:329

#16 0xffffffff80a9e958 in gtaskqueue_thread_loop (arg=<optimized out>)

at /home/db/svn/system/head/sys/kern/subr_gtaskqueue.c:504

#17 0xffffffff80a1fcd4 in fork_exit (

callout=0xffffffff80a9e8d0 <gtaskqueue_thread_loop>, 
arg=0xfffffe00010290b0, frame=0xfffffe07bb6deac0)
at /home/db/svn/system/head/sys/kern/kern_fork.c:1038

#18 <signal handler called>
(kgdb)

Have you looked at the kernel crash when adding a device to a ethernet bridge?

<Chenna>
Yes, I tried Ethernet bridge also, no issues seen.
#ifconfig bridge0 create
#ifconfig bridge0 addm bnxt0
No Stacktrace / Crash seen.

Note: I do not have em0 in my server so tried with bnxt0, bge0 and bge1, no Crash.

bnxt(4) and em(4) are the only drivers that have implemented the IFLIB api. bge(4) does not. I'm assuming that there is a data structure element being dereferenced here that we need to initialize in em(4).

sbruno added inline comments.Jun 20 2017, 9:02 PM
sys/net/iflib.c
385 ↗(On Diff #29856)

There appears to be an additional semi-colon here.

1803 ↗(On Diff #29856)

For readability, please make changes on a new line of declaration instead of inserting the new variable into the middle of the existing line.

erj added a subscriber: erj.Jun 21 2017, 12:28 AM
erj added inline comments.
sys/dev/bnxt/bnxt_txrx.c
276 ↗(On Diff #29856)

What's this new field for? Are other drivers supposed to use it for things?

Sean,

  1. I'm able to recreate the Crash with igb0 on my other setup.
  2. Root caused as fl->ifl_fragidx wasn't resetting (to 0) on iflib_fl_bufs_free().
  3. I quickly made the patch with that change. It would be great if you can verify it on your setup
  4. I need a day or Two to take care of other review comments.
db added a comment.Jun 21 2017, 5:18 PM

And that last diff works fine. No panic.

In D10681#233855, @db wrote:

And that last diff works fine. No panic.

Thanks for quick response, Diane.
I'll start working to address other review comments and submit the final patch soon.

Hopefully this is final patch, I have addressed all review comments.

sys/dev/bnxt/bnxt_txrx.c
276 ↗(On Diff #29856)

Driver really need this only when it can receive out-of-order completions, otherwise not needed.
frag_idxs will be used to construct 'opaque' value in rxbd, and later to retrieve in Receive path.

sys/net/iflib.c
1803 ↗(On Diff #29856)

Sure, I'll consider this.

1803 ↗(On Diff #29856)

Sure, I'll consider this.

Hi Sean,
Could you please commit this patch, if there are no further review comments.

sbruno accepted this revision.Jul 3 2017, 3:03 PM
This revision is now accepted and ready to land.Jul 3 2017, 3:03 PM
This revision was automatically updated to reflect the committed changes.