Bnxt: Add support for displaying Hardware port queue stats using sysctl.
ClosedPublic

Authored by bhargava.marreddy_broadcom.com on Aug 7 2017, 11:29 AM.

Details

Summary

Added support for displaying HW port queue stats using sysctl.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
sbruno requested changes to this revision.Aug 7 2017, 4:05 PM

IFLIB provides an ifdi_update_admin_status() function that is called periodically do that you don't have to manage the callout yourself. sys/dev/e1000/if_em.c does this as an example. Can this be changed to match the IFLIB implementation?

This revision now requires changes to proceed.Aug 7 2017, 4:05 PM
IFLIB provides an ifdi_update_admin_status() function that is called periodically do that you don't have to manage the callout yourself. sys/dev/e1000/if_em.c does this as an example. Can this be changed to match the IFLIB implementation?

Sean Bruno,

Thanks for the review.

I payed with ifdi_update_admin_status() and noticed that iflib is invoking that only 4 times (my requirement is to invoke once in every second), can you please help?

Thanks,
Chenna.

Sean Bruno,

  1. Even with e1000 driver, ifdi_update_admin_status() not getting invoked continuously at uniform intervals. I verified it by modifying e1000 driver (of a bit old freeBSD kernel).
  1. ifdi_timer() seems to be serving that purpose. Can I use ifdi_timer()?
  1. If Yes, ifdi_timer() with (qid == 0) getting invoked Twice in every second, Since my requirement is only once every second, can I add checks using jiffies? For that I have to include <linux/jiffies.h> and will be using time_before() & msecs_to_jiffies() APIs.

Thanks

Latest e1000 driver seems to be following this ...

  1. em_if_timer() invokes iflib_admin_intr_deferred()
  2. iflib_admin_intr_deferred() triggers em_if_update_admin_status() to be invoked.

Can I follow same in my driver?

sbruno added a comment.Aug 9 2017, 4:07 PM

Latest e1000 driver seems to be following this ...

  1. em_if_timer() invokes iflib_admin_intr_deferred()
  2. iflib_admin_intr_deferred() triggers em_if_update_admin_status() to be invoked.

    Can I follow same in my driver?

Yes please. See if that does what you want it to do.

Yes please. See if that does what you want it to do.

Thanks for the confirmation.
Another question I have is -
ifdi_timer() with (qid == 0) getting invoked Twice in every second, Since my requirement is only once every second, can I add checks using jiffies?
For that I have to include <linux/jiffies.h> and also use time_before() & msecs_to_jiffies() APIs.

bhargava.marreddy_broadcom.com edited edge metadata.
  1. ifdi_timer() still getting invoked Twice per Sec
  2. I couldn't use jiffies in bnxt driver (unable to include <linux/jiffies.h>).
  3. Ignoring second call (of ifdi_timer) and thus taking care of getting HW stats once in Sec.

Hi Sean,

Could you please commit this patch, if there are no further review comments.
shurd edited edge metadata.Aug 16 2017, 9:14 PM

I think it would be much better to do the stats query at most once per second rather than every second. If you only update when the first sysctl is hit, and use the cached values for the next second, there would be at most the same number of queries and if nobody is querying them, there would be no unnecessary HWRM calls and DMAs.

It would also avoid the case where the query occurs while a value is being read, so there wouldn't be a possibility of inconsistent/incorrect values when the DMA is in progress.

sys/dev/bnxt/bnxt_hwrm.c
808 ↗(On Diff #31867)

Why 512 here? Should this be a sizeof() or offsetof() or XXX - sizeof(struct rx_port_stats) instead?

sys/dev/bnxt/if_bnxt.c
487 ↗(On Diff #31867)

Why allocate an extra 1k here? Is there an expectation that these structures will grow? If so, why not round up to the nearest page?

Stephen,

Thank you for the review.

It would also avoid the case where the query occurs while a value is being read,
so there wouldn't be a possibility of inconsistent/incorrect values when the DMA is in progress.

<Chenna>
I agree with your point, can you please clarify these..

  1. Can I use jiffies / ticks to maintain minimum 1 sec delay between HWRM command invocations?
  1. I’m currently using SYSCTL_ADD_QUAD() api for individual stat members for which there will be no

*callback handler* registered, and driver will not know when user invokes sysctl command to read stat's.
Thus, I end-up with no context in driver for the HWRM command invocation.

Can you please share your thoughts to overcome this?

How would it be if I use SYSCTL_ADD_PROC() api (instead of SYSCTL_ADD_QUAD) for all Stat members
(there are around 120 stat members) with *same handler* and invoke HWRM (from that handler) based
on Jififes / ticks check for minimum 1 sec delay?

  1. Can you please point me if there are any references for this implementation?

Thanks,
Chenna.

shurd added a comment.Aug 17 2017, 4:10 PM

Stephen,

I agree with your point, can you please clarify these..

  1. Can I use jiffies / ticks to maintain minimum 1 sec delay between HWRM command invocations?

You can use ticks and hz to do so. You'll likely want to track the expiration yourself (expire = ticks + hz). Do not use jiffies.

  1. I’m currently using SYSCTL_ADD_QUAD() api for individual stat members for which there will be no *callback handler* registered, and driver will not know when user invokes sysctl command to read stat's. Thus, I end-up with no context in driver for the HWRM command invocation.

    Can you please share your thoughts to overcome this?

    How would it be if I use SYSCTL_ADD_PROC() api (instead of SYSCTL_ADD_QUAD) for all Stat members (there are around 120 stat members) with *same handler* and invoke HWRM (from that handler) based on Jififes / ticks check for minimum 1 sec delay?

Exactly, though it may be cleaner to have 120 small functions that all call the same function, something like this for each value:

static int
bnxt_get_rx_1519b_2047b_frames(SYSCTL_HANDLER_ARGS)
{
    struct bnxt_softc *softc = arg1;
    uint64_t val;
    int rc;

    rc = bnxt_sysctl_get_stat(softc, &val, offsetof(struct rx_port_stats, rx_1519b_2047b_frames), sizeof(val));
    if (rc == 0) {
        val = le64toh(val);
        rc = sysctl_handle_64(oidp, &val, 0, req)
    }
    return rc;
}

These can likely be done with a macro... then this:

static int
bnxt_sysctl_get_stat(struct bnxt_softc softc, void *ptr, size_t offset, size_t sz)
{
    int rc = 0;

    mtx_lock(softc->stats_mtx);
    if (ticks > softc->stats_expire) {
        rc = bnxt_hwrm_port_qstats(softc);
        softc->stats_expire = ticks + hz;
    }
    memcpy(ptr, ((uint8_t *)softc->rx_port_stats + offset, sz);
    mtx_unlock(softc->stats_mtx);

    return rc;
}

Initializing the mutex and the stats_expire etc would need to be done when the softc is initialized. The offset calculation would need to be more complex because of the padding.

  1. Can you please point me if there are any references for this implementation?

I can't think of any off-hand.

shurd added a comment.Aug 17 2017, 4:15 PM

Actually, you still want to use sysctl_handle_*:

static int
bnxt_get_rx_1519b_2047b_frames(SYSCTL_HANDLER_ARGS)
{
    struct bnxt_softc *softc = arg1;
    uint64_t val = UINT64_MAX;

    if (bnxt_sysctl_get_stat(softc, &val, offsetof(struct rx_port_stats, rx_1519b_2047b_frames), sizeof(val)) == 0)
        val = le64toh(val);
    return sysctl_handle_64(oidp, &val, 0, req)
}

Stephen,

Thank you for the detailed explanation.
Let me revise patch and get back.

Thanks,
Chenna.

Hi Stephen,

Thanks again for the review and details.

  1. Brcm's other NIC/L2 drivers (like Linux & Esx) will invoke stats HWRM *every second*. The idea behind this design is - driver will come to know within 1 sec whenever firmware goes unresponsive (HWRM timeouts), thus driver can do precautions to avoid Crashes due to firmware stall.

    Based on internal discussion, Brcm prefers to have common model across OS'es.
  1. I quickly went through the implementations of other FreeBSD drivers. Even e1000 and Oce are also doing stats copy every second (instead of on demand).
  1. If possible, we would like to go with "invoke stats HWRM every second" model for now and revisit in next phase.

if nobody is querying them, there would be no unnecessary HWRM calls and DMAs.

<Chenna> Any way we need some HWRM for health-check, also that is light weight HWRM.
There should be no impact on performance because of that.

Why 512 here? Should this be a sizeof() or offsetof() or XXX - sizeof(struct rx_port_stats) instead?
Why allocate an extra 1k here? Is there an expectation that these structures will grow? If so, why not round up to the nearest page?

<Chenna> Thanks, I'll address these once design is finalized.

Can you kindly help us in this regard.

Thanks,
Chenna.

shurd added a comment.Aug 21 2017, 4:42 PM
  1. Brcm's other NIC/L2 drivers (like Linux & Esx) will invoke stats HWRM *every second*. The idea behind this design is - driver will come to know within 1 sec whenever firmware goes unresponsive (HWRM timeouts), thus driver can do precautions to avoid Crashes due to firmware stall.

What precautions are being taken by the driver when the firmware stalls? I didn't think there was anything the driver could do other than take down the interface, which most would prefer to delay until it affects traffic (such as not being able to apply a setting).

  1. I quickly went through the implementations of other FreeBSD drivers. Even e1000 and Oce are also doing stats copy every second (instead of on demand).

I believe those are flying the hardware stats flag, so the stats are not only being consumed via the sysctl interface.

  1. If possible, we would like to go with "invoke stats HWRM every second" model for now and revisit in next phase.

It's absolutely possible, but a technical reason holds more weight with the community than internal vendor preferences. If you actually do something about the firmware stalls for example, that would make the argument for a once-per-second poll stronger.

if nobody is querying them, there would be no unnecessary HWRM calls and DMAs.

<Chenna> Any way we need some HWRM for health-check, also that is light weight HWRM.

Again, a health check would need to be more than a firmware error in the log every second, which is what it appears would happen if the firmware actually did stall with this change. Should we expect something like that soon?

There should be no impact on performance because of that.

Possibly these stats and checks were intended to be limited to the PFs only then? In a worst-case scenario where max VFs are all configured to forward messages, that would generate over 512 HWRM requests per second. That alone would likely swamp the HWRM and cause it to appear to be stalled (ie: it will take longer than the timeout to service requests).

Maybe have PFs query once per second, and the VFs use the timeout mechanism? Or simply make the sysctls not available for VFs and have the VFs not query?

I didn't think there was anything the driver could do other than take down the interface,
which most would prefer to delay until it affects traffic (such as not being able to apply a setting).

<Chenna>

  1. Detecting the problem and take down the interface ASAP can avoid disconnections especially in case of teaming / bonding configurations, Am I correct?
  2. In case of recoverable firmware errors, recovering it ASAP is preferable, Right?

I believe those are flying the hardware stats flag, so the stats are not only being consumed via the sysctl interface.

<Chenna> All I mean to say is, those drivers are not invoking HWRM / _READ_REG in sysctl read context, but doing it every second.

Possibly these stats and checks were intended to be limited to the PFs only then?

that would generate over 512 HWRM requests per second.

<Chenna>
I totally agree with your point, thanks for the heads-up.
But currently we are going to support only PFs, no plans for SRIOV currently.
For sure we will consider this point in future when implement VF support.

If you actually do something about the firmware stalls for example, that would make the argument for a once-per-second poll stronger.

<Chenna> Teaming / Bonding is one valid case (as explained above) for once-per-second poll, and other plan we have about is Recoverable FW errors, we are working with firmware team regarding that.

shurd added a comment.Aug 22 2017, 4:59 PM

I didn't think there was anything the driver could do other than take down the interface,
which most would prefer to delay until it affects traffic (such as not being able to apply a setting).

  1. Detecting the problem and take down the interface ASAP can avoid disconnections especially in case of teaming / bonding configurations, Am I correct?

I don't really understand what you mean, what disconnections would be avoided by taking down the interface? Generally when the firmware stalls, the fastpath continues to run with the last configuration, and it's only HWRM requests that fail.

  1. In case of recoverable firmware errors, recovering it ASAP is preferable, Right?

In general, yes. However, if recovery involves resetting the interface, it's possible that delaying recovery until an HWRM request is needed is preferable. It really depends on how disruptive the recovery mechanism is, and if the chip continues passing traffic correctly when the firmware stalls.

I believe those are flying the hardware stats flag, so the stats are not only being consumed via the sysctl interface.

<Chenna> All I mean to say is, those drivers are not invoking HWRM / _READ_REG in sysctl read context, but doing it every second.

Right, my point was that they may have to because they use hardware instead of software statistics, so sysctl may not be the only consumer of the information.

Possibly these stats and checks were intended to be limited to the PFs only then?

that would generate over 512 HWRM requests per second.

I totally agree with your point, thanks for the heads-up.
But currently we are going to support only PFs, no plans for SRIOV currently.

The driver currently supports running on a VF (for instance, in a qemu VM with PCI pass-through on a Linux host) and has been tested in this configuration. While it doesn't support creating VFs when it is attached to the PF, VF behaviour still needs to be taken into account.

If you actually do something about the firmware stalls for example, that would make the argument for a once-per-second poll stronger.

Teaming / Bonding is one valid case (as explained above) for once-per-second poll, and other plan we have about is Recoverable FW errors, we are working with firmware team regarding that.

That's fine. Doing the once-per-second poll for now and adding recovery later is fine for now. But you likely want to wrap the sysctl creation and polling in an

if (BNXT_PF(softc)) {
}

block.

Hi Stephen,

Generally when the firmware stalls, the fastpath continues to run

<Chenna>
Based on my experience with Oce NIC, fastpath also will get impacted (stalls) as part of firmware stall.
Also seen same behavior with bnxt NIC, may be it depends on what kind of firmware stall it is.

But you likely want to wrap the sysctl creation and polling in an

<Chenna>
Sure, I'll do the changes and update the patch.

Thanks,
Chenna.

sys/dev/bnxt/bnxt_hwrm.c
808 ↗(On Diff #31867)

Stephen,

Those (additional 512 bytes on each struct) are intended for future expansion.

Let me explain, newer firmware may add additional counters to these structures and the older driver may not have the new counters defined in bnxt_hsi.h. So having some extra room at the end of tx_port_stats and rx_port_stats will prevent any corruption..

Can I add a comment in code?

shurd added inline comments.Aug 31 2017, 6:35 PM
sys/dev/bnxt/bnxt_hwrm.c
808 ↗(On Diff #31867)

So why not have two separate allocations, each with a minimal padding amount then rounded up to the nearest page size? Combining them into one allocation seems like a needlessly awkward way to save 4k.

If you do want to use a single allocation, you can create a structure for it:

struct bnxt_port_stats {
        struct rx_port_stats rx;
        char rx_padding[512];
        struct tx_port_stats tx;
        char tx_padding[512];
}

And avoid most of the weird pointer math and casting (leaving just req.tx_stat_host_addr = req.rx_stat_host_addr + offsetof(struct bnxt_port_stats, tx);

shurd requested changes to this revision.Aug 31 2017, 6:42 PM
This revision now requires changes to proceed.Aug 31 2017, 6:42 PM

Combining them into one allocation seems like a needlessly awkward way to save 4k.

<Chenna> I agree with this, separated the allocations.

But you likely want to wrap the sysctl creation and polling in an

if (BNXT_PF(softc)) {
}
block.

<Chenna> Taken care!!

You can use ticks and hz to do so. You'll likely want to track the
expiration yourself (expire = ticks + hz). Do not use jiffies.

<Chenna> Taken care!!

Corrected patch by removing debug code.

shurd requested changes to this revision.Sep 6 2017, 5:46 PM
shurd added inline comments.
sys/dev/bnxt/if_bnxt.c
1523 ↗(On Diff #32723)

How about:

if (ticks_now - softc->admin_ticks >= hz) {

Get rid of the macro and simplify the test.

This revision now requires changes to proceed.Sep 6 2017, 5:46 PM
bhargava.marreddy_broadcom.com marked an inline comment as done.

Taken care of all review comments.

sys/dev/bnxt/if_bnxt.c
1523 ↗(On Diff #32723)

Thanks, I'll do that.

bhargava.marreddy_broadcom.com marked an inline comment as done.Sep 7 2017, 6:49 AM

Taken care of all review comments.

shurd accepted this revision.Sep 7 2017, 5:40 PM

Looks good to me

Thanks a lot, Stephen.

Sean Bruno,

If there are no further review comments, can you please commit this change,
This revision is now accepted and ready to land.Sep 8 2017, 4:41 PM
This revision was automatically updated to reflect the committed changes.
shurd added a comment.Sep 8 2017, 6:04 PM

You may want to consider adding a .reset sysctl as well to allow users to reset the port stats to zero.