- Choose correct Firmware options for HW LRO for best performance
- Delete TBD and other comments which are not required.
- Added sysctl interface to enable / disable / modify different factors of HW LRO.
- Disabled HW LRO by default to avoid issues with .packet forwarding'
Details
Ran several performance tests.
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
if_bnxt.c | ||
---|---|---|
1022 ↗ | (On Diff #32637) | Since you're doing changes here, please add a sysctl to disable hardware LRO and enable software LRO (default to HW is fine). This will allow people to use software LRO if it fits their use cases better, as well as verify hardware LRO performance vs software in their own benchmarks. |
It would also be good if you make the various values tunable and just have these as defaults rather than hard-coded values. Network patterns vary wildly between sites, it's highly unlikely that a single set of hard-coded values provides "best performance" for all users.
Thank you, Stephen.
Please add a sysctl to disable hardware LRO and enable software LRO
<Chenna>
- In FreeBSD, Do we have separate IFCAP_xxx for HW LRO and SW LRO?
- “ifconfig bnxt0 –lro” is intended for SW LRO, I’m I correct?
- Capability IFCAP_LRO represents SW LRO, I’m I correct?
- In sysctl context, Is it suggestible to just invoke hwrm_vnic_tpa_cfg () (or) to do bnxt_stop() followed by bnxt_init() with updated flags?
It would also be good if you make the various values tunable
and just have these as defaults rather than hard-coded values.
<Chenna>
Sure, there are Two modes of TPA (Hw Lro)
A. TPA according to GRO rules.
B. TPA according to RSC rules
By default we always configure it in RSC mode.
I can make them as configurable on fly using sysctl.
No. AFAIK, nobody has really made a strong case backed up by a wide range of benchmarks for hardware LRO being as good as software LRO across a range of common use cases. Generally speaking, aside from some hardware's ACK aggregation, hardware LRO has always proven to be fairly bounded by artificial undocumented limits so software LRO is almost universally used by default.
- “ifconfig bnxt0 –lro” is intended for SW LRO, I’m I correct?
It's not really intended for either. It's mostly to disable/enable LRO in general. That said, most users understand it to control software LRO for the reasons mentioned in #1.
- Capability IFCAP_LRO represents SW LRO, I’m I correct?
As above, it's generally understood that way, but that's not a strict requirement of that capability. It's basically a driver capability, not a hardware one.
- In sysctl context, Is it suggestible to just invoke hwrm_vnic_tpa_cfg () (or) to do bnxt_stop() followed by bnxt_init() with updated flags?
You should be able to just invoke hwrm_vnic_tpa_cfg() unless there's some firmware requirement to do more. You would need to consult with the firmware team regarding that. From a driver perspective, if hwrm_vnic_tpa_cfg() works on an active queue, that would be preferable.
It would also be good if you make the various values tunable
and just have these as defaults rather than hard-coded values.<Chenna>
Sure, there are Two modes of TPA (Hw Lro)
A. TPA according to GRO rules.
B. TPA according to RSC rules
By default we always configure it in RSC mode.
I can make them as configurable on fly using sysctl.
I was mostly referring to the max_agg_segs, max_aggs, and min_agg_len values that are currently hardcoded for a specific network load profile. The flags look like something that a user may want to configure as well though. I don't think the FreeBSD stack can deal well with GRO aggregation at this time, so until someone wants to do that work, I don't think the rule set is worth making tunable.
I honestly think the best default would be to have lro/-lro toggle software LRO by default as it does with other drivers, and have an option to make it control hardware LRO (ie: make the default be software LRO). If you actually have reproducible numbers showing that hardware LRO is usually better across common load profiles, and are willing to help people benchmark/tune their systems, I don't have a strong objection to hardware LRO that works being the default.
If it's a matter of a required feature for $work, defaulting to software LRO with a sysctl to enable hardware LRO would be the easiest thing to do. If it's that the bnxt hardware LRO is actually better for a majority of use cases (including thousands of active flows), making a case for hardware LRO would be best. The latter would involve making the benchmarks public, and helping others reproduce the results, but it's the sort of feature that users want.
To boil it down further, making software LRO the default with a sysctl to enable hardware LRO would be accepted without much question. Keeping hardware LRO the default would require a fair amount of work to defend the decision. As it currently sits, software LRO is completely unavailable, and only hardware LRO can be used... which is not really acceptable long term.
Stephen,
Thanks a lot for such detailed explanation, It really helped me for better understanding.
- FreeBSD Stack does the SW LRO (but not bnxt driver), Right? Now, I'm trying to understand the switch which enables/disables SW LRO. Is it IFCAP_LRO alone or different?
- If I create new sysctl node called HW_LRO, should I add checks like HW LRO and SW LRO should not co-exist?
I was mostly referring to the max_agg_segs, max_aggs, and min_agg_len
<Chenna>
Since we have not tested all possible combinations of different options, we are not ready yet to give such provision. Can we defer it?
Thanks,
Chenna.
Right. Actually, when using iflib, software LRO is handled by iflib, not through the driver using tcp_lro_rx()/tcp_lro_flush_all(). I had forgotten about this. So right now, the driver will either use *both* software and hardware LRO together or no LRO at all. Having an independent sysctl to toggle hardware LRO would be great. Since iflib provides software LRO based on the IFCAP_LRO capability, making the HW LRO sysctl separate allows all combinations to be selected:
Capability Sysctl HW+SW On On HW Off On SW On Off None Off Off
If you make SW the default, you can postpone collecting/publishing benchmark data until you have time, and change the default once you have the data to share.
- If I create new sysctl node called HW_LRO, should I add checks like HW LRO and SW LRO should not co-exist?
No, let the user decide. They obviously work together or your tests wouldn't have worked. :-)
I was mostly referring to the max_agg_segs, max_aggs, and min_agg_len
<Chenna>
Since we have not tested all possible combinations of different options, we are not ready yet to give such provision. Can we defer it?
You only need to test the defaults, not all possible combinations. If a users changes from the defaults and things stop working, that's the users fault. It would be nice to clamp the values to ones the firmware allows, but even that is not required.
If the default is SW LRO with an option to enable HW LRO, you can defer adding tunables for HW LRO. If the default is HW+SW LRO as at present however, the hardware LRO settings should be tunables.
when using iflib, software LRO is handled by iflib, not through the driver using tcp_lro_rx()/tcp_lro_flush_all().
<Chenna> Thanks for the info.
the driver will either use *both* software and hardware LRO together or no LRO at all.
<Chenna> That is exactly what I was thinking. Thanks, We are on same page now.
Having an independent sysctl to toggle hardware LRO would be great. Since
<Chenna> Sure!!
No, let the user decide. They obviously work together or your tests wouldn't have worked. :-)
<Chenna> Sure !!
I was mostly referring to the max_agg_segs, max_aggs, and min_agg_len
You only need to test the defaults, not all possible combinations. …
<Chenna>
FYI – These are all the HW LRO properties and possible values for each, There are 5 inputs for that HWRM.
A. req.flags This is Bit vector of different flags like i. FLAGS_TPA ii. FLAGS_ENCAP_TPA iii. FLAGS_RSC_WND_UPDATE iv. FLAGS_GRO v. FLAGS_AGG_WITH_ECN vi. FLAGS_AGG_WITH_SAME_GRE_SEQ vii. FLAGS_GRO_IPID_CHECK viii. FLAGS_GRO_TTL_CHECK B. req.enables This is Bit vector of different options i. ENABLES_MAX_AGG_SEGS ii. ENABLES_MAX_AGGS iii. ENABLES_MAX_AGG_TIMER iv. ENABLES_MIN_AGG_LEN C. req.max_agg_segs i. MAX_AGG_SEGS_1 ii. MAX_AGG_SEGS_2 iii. MAX_AGG_SEGS_4 iv. MAX_AGG_SEGS_8 v. MAX_AGG_SEGS_MAX D. req.max_aggs i. MAX_AGGS_1 ii. MAX_AGGS_2 iii. MAX_AGGS_4 iv. MAX_AGGS_8 v. MAX_AGGS_16 vi. MAX_AGGS_MAX E. req.min_agg_len Currently driver setting it as 512
Question #1 : Looks like other OS'es also not supporting this kind of tuning for HW LRO. Is it suggestible to publish Vendor specific (HWRM requests) settings?
Question #2 : Are you suggesting only for max_agg_segs, max_aggs, and min_agg_len? Or even flags / enables also?
Question #3 : Is there a way In sysctl to print some help text like what all are the possible options for given property?
Question #4: Can you please suggest in implementing sysctl ..
Can I do like ..
dev.bnxt.0.hw_lro_enable dev.bnxt.0.hw_lro_max_agg_segs dev.bnxt.0.hw_lro_max_aggs dev.bnxt.0.hw_lro_min_agg_len
(or)
dev.bnxt.0.hw_lro.enable dev.bnxt.0.hw_lro.max_agg_segs dev.bnxt.0.hw_lro.max_aggs dev.bnxt.0.hw_lro.min_agg_len
???
This one isn't obviously useful without some additional knobs and work. It's very likely that a single value will be right for everyone. Only the encap and window update flags look like good targets for a tunable. I would say don't bother making this tunable.
B. req.enables This is Bit vector of different options i. ENABLES_MAX_AGG_SEGS ii. ENABLES_MAX_AGGS iii. ENABLES_MAX_AGG_TIMER iv. ENABLES_MIN_AGG_LEN
This shouldn't be a tunable, turning any of these off would just mean that we're using an unknown value.
C. req.max_agg_segs i. MAX_AGG_SEGS_1 ii. MAX_AGG_SEGS_2 iii. MAX_AGG_SEGS_4 iv. MAX_AGG_SEGS_8 v. MAX_AGG_SEGS_MAX
This looks like a great target for a tunable. Making this tunable would allow performance tweaking by the user for their load profile.
D. req.max_aggs i. MAX_AGGS_1 ii. MAX_AGGS_2 iii. MAX_AGGS_4 iv. MAX_AGGS_8 v. MAX_AGGS_16 vi. MAX_AGGS_MAX
This also looks like a great tunable.
E. req.min_agg_len Currently driver setting it as 512
This one looks like the best tunable to play with. If you want ACK aggregation for example (and a lot of people do), this is the first thing you would need to change. Making it so people can perform that experiment without needing to rebuild the kernel would be great.
Question #1 : Looks like other OS'es also not supporting this kind of tuning for HW LRO. Is it suggestible to publish Vendor specific (HWRM requests) settings?
Yes. That's the whole idea of sysctls. If your device does something other devices don't do, and you want it to be easily tunable, a sysctl is just the tool for the job. A rich set of well documented tunables can be a big competitive advantage. They also provide guidance to the iflib developers, if a large number of iflib drivers have a tunable that does a specific thing, that can be moved to a standard sysctl at the iflib layer.
Question #2 : Are you suggesting only for max_agg_segs, max_aggs, and min_agg_len? Or even flags / enables also?
It looks like the first option (only for max_agg_segs, max_aggs, and min_agg_len) is best.
Question #3 : Is there a way In sysctl to print some help text like what all are the possible options for given property?
Each sysctl has a description field. If the options are short enough, they can be listed in there (sysctl -d shows them). Here's an example:
> sysctl -d dev.uart.0.pps_mode dev.uart.0.pps_mode: pulse mode: 0/1/2=disabled/CTS/DCD; add 0x10 to invert, 0x20 for narrow pulse
sysctls should also be documented in the man page which is located in share/man/man4/bnxt.4 ("man bnxt" to read the current one).
Question #4: Can you please suggest in implementing sysctl ..
Can I do like ..dev.bnxt.0.hw_lro_enable dev.bnxt.0.hw_lro_max_agg_segs dev.bnxt.0.hw_lro_max_aggs dev.bnxt.0.hw_lro_min_agg_len(or)
dev.bnxt.0.hw_lro.enable dev.bnxt.0.hw_lro.max_agg_segs dev.bnxt.0.hw_lro.max_aggs dev.bnxt.0.hw_lro.min_agg_len
The second option looks a lot nicer and allows checking the entire config using "sysctl dev.bnxt.0.hw_lro" the same way that "sysctl dev.bnxt.0.nvram" works now.
Thank you, Stephen.
Can you please review the final patch,
Snip ==>
#####sysctl -d dev.bnxt.0.hw_lro dev.bnxt.0.hw_lro: hardware lro dev.bnxt.0.hw_lro.min_agg_len: Min Agg Len: 1 to 9000 dev.bnxt.0.hw_lro.max_aggs: Set Max Aggs Value: 0 / 1 / 2 / 3 / 4 / 7 dev.bnxt.0.hw_lro.max_agg_segs: Set Max Agg Seg Value: 0 / 1 / 2 / 3 / 31 dev.bnxt.0.hw_lro.hw_lro_mode_gro: Set mode: 1 = GRO mode, 0 = RSC mode dev.bnxt.0.hw_lro.hw_lro_enable: Enable or Disable HW LRO: 0 / 1
Stephen,
Please find my few inline explanations which may helps in the review.
Thanks,
Chenna.
sys/dev/bnxt/bnxt_hwrm.c | ||
---|---|---|
951 | Stephen, This is just to avoid junk inputs from user. In case of any non-zero user inputs, we take it as 1. | |
954 | Same as above, to avoid junk inputs from user. | |
982 | Stephen, I assumed that based on max MTU which is 9000. | |
sys/dev/bnxt/bnxt_sysctl.c | ||
802 | Stephen, This is just to avoid code duplication and simplify the patch and ease of review. | |
sys/dev/bnxt/if_bnxt.c | ||
793 | Stephen, |
Another problem with defaulting to HW LRO enabled is packet forwarding. LRO should generally not occur when forwarding is being used. You would need to dig around in the vnet code to figure out if it's even possible to get a callback when V_ipforwarding and V_ip6_forwarding changes and if so, update the settings based on that.
If you default to off however, you can simply mention in the man page that enabling hardware LRO could cause issues when forwarding, and leave it up to the user to Do The Right Thing.
Also, don't forget to update the man page (though it can/should be a separate commit since you'll get a docs review on it).
sys/dev/bnxt/bnxt_hwrm.c | ||
---|---|---|
982 | Should use BNXT_MAX_MTU then. | |
sys/dev/bnxt/if_bnxt.c | ||
788 | Since you're defaulting to enabled, can you share your benchmark configuration and numbers showing that the default is better? Alternatively, default to off and don't worry about collecting/sharing the information. |
Stephen,
I need some more time to get back on 'default hw_lro options' / ip_forwarding.
meanwhile, I'm seeing crash (in iflib code) when change HW_LRO settings on fly while I/O is in progress.
But things are fine when I stop I/O, change HW_LRO settings, then restart traffic.
"ifconfig bnxt0 lro / -lro" command invokes bnxt_stop() & bnxt_init(), is it acceptable to do same from sysctl context?
Can you please share your thoughts to fix this problem?
Thanks,
Chenna.
No rush from this side.
meanwhile, I'm seeing crash (in iflib code) when change HW_LRO settings on fly while I/O is in progress.
But things are fine when I stop I/O, change HW_LRO settings, then restart traffic."ifconfig bnxt0 lro / -lro" command invokes bnxt_stop() & bnxt_init(), is it acceptable to do same from sysctl context?
Can you please share your thoughts to fix this problem?
Not really. It's easiest/best to make it only modifiable when !(if_getdrvflags() & IFF_DRV_RUNNING)
Not really. It's easiest/best to make it only modifiable when !(if_getdrvflags() & IFF_DRV_RUNNING)
<Chenna> Taken care!!
Should use BNXT_MAX_MTU then.
<Chenna> Taken care!!
To keep align with our other OS's drivers, till IP_Forward case being handled, we would like to document it and enable HW LRO by default. Can you please share your thought on this?
sys/dev/bnxt/bnxt_hwrm.c | ||
---|---|---|
982 | Taken care!! |
This is not really possible. Incorrect networking behaviour by default is not acceptable.
Thanks Stephen,
Can you please review latest patch with following changes.
- Disabled HW LRO by default to avoid issues with 'packet forwarding'
- Renamed one of sysctl node
Also, can you please help me in updating the document section? Can I post one more in https://reviews.freebsd.org for doc update?
Looks good, though the other one should be renamed as well.
For a doc change, just submit a review as normal for the changes to share/man/man4/bnxt.4, and Herald should add the docs team automatically. Add sbruno and myself so we can approve the technical side.
sys/dev/bnxt/bnxt_sysctl.c | ||
---|---|---|
78 | Should this one be changed as well? Maybe just "gro_mode"? |
Looks good, though the other one should be renamed as well.
<Chenna> Taken care!!
For a doc change, just submit a review as normal for the changes to
share/man/man4/bnxt.4, and Herald should add the docs team automatically.
Add sbruno and myself so we can approve the technical side.
<Chenna> Thanks!! For the doc change, should I wait till this patch goes in?
<Chenna> Thanks!! For the doc change, should I wait till this patch goes in?
You can start the review before the patch is committed. Just make sure you don't include this patch in the review.
Hi Stephen,
Can you please help me in understanding the exact issue when IP_Forwarding and LRO coexist.
LRO should generally not occur when forwarding is being used.
<Chenna>
- I was under impression that, in case of forwarding, NIC should not modify pkts. Is that correct statement?
- Different documents (in web) mentioned different issues like it (forwarding along with LRO) can cause ‘Kernel Crash’ / ‘less performance’ / ‘Checksum errors’ / etc.. Which one is correct?
Correct. Forwarded packets should not be modified.
- Different documents (in web) mentioned different issues like it (forwarding along with LRO) can cause ‘Kernel Crash’ / ‘less performance’ / ‘Checksum errors’ / etc.. Which one is correct?
The only technical reason to disable LRO when forwarding is because we must not modify the packet stream unless we are the ultimate consumer. Crashes, performance regressions, and checksum errors are all presumably fixable, so are not by themselves a blanket reason to disable LRO when forwarding, they are simply arguments that it may not be worth doing.
Hi Stephen,
Thanks a lot for the clarification. I'll discuss with our firmware team and get back if required further information.
Thanks,
Chenna.