I've been pondering this and it seems pretty broken in the tree and what I have suggest doesn't make sense.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Today
Fri, Nov 29
Tue, Nov 26
Mon, Nov 25
Sun, Nov 24
Add clarification suggested by @markj
Sat, Nov 23
Thu, Nov 21
@tuexen just a note that the X552 is also using the same middle value as 82599
Sun, Nov 10
In D47336#1079637, @fbsd_opal.com wrote:Glad to see this is back under consideration. It will be useful for cases where it is needed.
When D34449 was reverted, we puled a new CAT6 cable where we had had this problem and therefore eliminated our need for this patch.
I am actually at the location in question at the moment, for a few days. I will try to set up a test of this revised patch this weekend.
Nov 1 2024
Oct 31 2024
Oct 30 2024
@franco_opnsense.org can you please see if you can find some users to test this that had the colo setups.
Oct 27 2024
Yes the drm-kmod builds fine now it is just the nvida-drm kmod which is still failing, probably uses some more features like you see.
In D47290#1078609, @junchoon_dec.sakura.ne.jp wrote:In D47290#1078596, @kbowling wrote:@ashafer_badland.io @manu it seems like there might still be some remaining in drm-kmod, this is still failing for me and @junchoon_dec.sakura.ne.jp
@kbowling, does it still fail with rev5 of the patch at Bug 282312?
It includes workaround in graphics/nvidia-drm-kmod/Makefile.common, obtained from (with a change of a variable to fit with this) freebsd-current ML by Benjamin Jacobs. And it doesn't fail on stable/14 (still clang18), so something changed in clang19 on main (any of new default option, maybe) is causing the issue.https://lists.freebsd.org/archives/freebsd-current/2024-October/006558.html
@ashafer_badland.io @manu it seems like there might still be some remaining in drm-kmod, this is still failing for me and @junchoon_dec.sakura.ne.jp
Committed in the above PR
Oct 26 2024
In D32531#1078497, @concussious.bugzilla_runbox.com wrote:This page is already mlinked to ix, which is I think correct, however the sysctls are a great addition if they're real (I don't have the HW). Can we get this section rebased into the current manpage? Kevin, does this look okay to you?
This change looks logically correct to me and matches e1000.
It looks like @manu got this in 2716dbb157611eaae6e578d86202d86910026562
Oct 22 2024
In D4295#1077477, @gallatin wrote:Why is this re-surfacing?
I think the ratelimit code has a better solution, since TCP can query how full the nic queue is and avoid ENOBUFs entirely by avoiding sending on a full queue. The problem with doing this in general is that, outside of ratelimit, the TCP stack has no way to determine what NIC queue a packet will be sent on.
Oct 21 2024
Seems reasonable to me. If there is any work that depends or interacts with these changes it would be nice to see them linked/referenced so we can look at that too but I can't foresee any problems.
Oct 20 2024
In D30155#1075233, @cc wrote:From my test result in testD30155, I didn't find any significant improvement under my eyes:
- no significant difference in ping latency
- no significant iperf3 performance improvement due to bad performance (3.x Gbps) in FreeBSD 15-current vs. (9.x Gbps) in stock Linux kernel 5.15.
Oct 16 2024
In D30155#1075043, @gallatin wrote:The a/b results were not surprising (boring as David likes to say). Just slightly higher CPU on the canary (due to the increased irq rate). But no clear streaming quality changes.
All in all, it seems to work and do no real harm, but we'll not use it due to the increased CPU
In D30155#1074682, @gallatin wrote:In D30155#1074005, @kbowling wrote:In D30155#1073987, @gallatin wrote:In D30155#1073639, @kbowling wrote:@imp @gallatin if you are able to test your workload, setting this to 1 and 2 would be new behavior versus where you are currently:
I can pull this into our tree and make an image for @dhw to run on the A/B cluster. However, we're not using this hardware very much any more, and there is only 1 pair of machines using it in the A/B cluster. Lmk if you're still interested, and I'll try to build the image tomorrow so that David can test it at his leisure.
Sure, it sounds like that is only enough for one experiment so I would focus on the default algorithm the patch will boot with sysctl dev.ix.<N>.enable_aim=1
Its running now. Eyeballing command-line utilities, the CPU is about 5% higher (27% -> 32%) and we have 2x the irq rate (110k vs 55k irq/sec).
When applying this, I wanted to give it a fair shake, and disabled this tunable: hw.ix.max_interrupt_rate=4000. Perhaps that was a mistake? Is there a runtime way to tweak the algorithm so it doesn't interrupt so fast under this level of load?
Oct 15 2024
In D30155#1074152, @cc wrote:In D30155#1073639, @kbowling wrote:Ok this is a bit messy code and comment wise but I have the new algorithm working in what I believe to be the correct way with some bug fixes versus the origin and would like some data to see how to proceed before tidying everything up.
@cc it looks like emulab has ix(4) on d820s nodes, would you be willing to take a look at these 3 options similar to the e1000 test?
- Default in HEAD/STABLE: sysctl dev.ix.<N>.enable_aim=0
- New algorithm (on by default with this patch) sysctl dev.ix.<N>.enable_aim=1
- Old algorithm (FreeBSD <10) sysctl dev.ix.<N>.enable_aim=2
OK. Thanks for letting me know this patch. I will test it on d820s nodes in Emulab.
One question: why do you want to test it for FreeBSD releases < 10? Can I test it only in FreeBSD 15(CURRENT)?
In D30155#1073987, @gallatin wrote:In D30155#1073639, @kbowling wrote:@imp @gallatin if you are able to test your workload, setting this to 1 and 2 would be new behavior versus where you are currently:
I can pull this into our tree and make an image for @dhw to run on the A/B cluster. However, we're not using this hardware very much any more, and there is only 1 pair of machines using it in the A/B cluster. Lmk if you're still interested, and I'll try to build the image tomorrow so that David can test it at his leisure.
Oct 14 2024
Ok this is a bit messy code and comment wise but I have the new algorithm working in what I believe to be the correct way with some bug fixes versus the origin and would like some data to see how to proceed before tidying everything up.
Oct 12 2024
@stallamr_netapp.com are you still able to work on this? Netgate has been gracious to help me get over the finish line sponsoring this work, I just landed default and much improved AIM for e1000 and igc.
Oct 11 2024
Oct 10 2024
Remove unnecessary ixl(4) algorithm. Add intel copyright.
Oct 9 2024
Oct 2 2024
In D46768#1069038, @cc wrote:In D46768#1069027, @kbowling wrote:In D46768#1069015, @cc wrote:In D46768#1067607, @kbowling wrote:@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.
Existing static 8000 int/s behavior (how the driver is in main):
sysctl dev.igb.0.enable_aim=0Suggested new default, you will boot in this mode with this patch:
sysctl dev.igb.0.enable_aim=1Low latency option of above algorithm (up to 70k ints/s):
sysctl dev.igb.0.enable_aim=2ixl(4) algorithm bodged in that would need to be cleaned up:
sysctl dev.igb.0.enable_aim=3I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.
I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.
root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #This looks to me like it is working, the algorithm is dynamic and 20k would be latency reducing idle queue. At enable_aim=0, you would see 8000. 20k looks right for an idle queue, what happens if you place a bulk load through it like iperf3? It should drop down to 4k.
I see. During the iperf traffic, I see interrupt rate 8k@enable_aim=0, 4k@enable_aim=1 or 2 dynamically. I see idle 20k@enable_aim=1 and 71k@enable_aim=2. However, none of the enable_aim=0 or 1 or 2 helps improve the iperf performance (570 Mbits/sec out of 1Gbps line rate) under loaded siftr module.
In D46768#1069015, @cc wrote:In D46768#1067607, @kbowling wrote:@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.
Existing static 8000 int/s behavior (how the driver is in main):
sysctl dev.igb.0.enable_aim=0Suggested new default, you will boot in this mode with this patch:
sysctl dev.igb.0.enable_aim=1Low latency option of above algorithm (up to 70k ints/s):
sysctl dev.igb.0.enable_aim=2ixl(4) algorithm bodged in that would need to be cleaned up:
sysctl dev.igb.0.enable_aim=3I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.
I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.
root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #
Oct 1 2024
In D12142#1068530, @krzysztof.galazka_intel.com wrote:In D12142#1066726, @kbowling wrote:@krzysztof.galazka_intel.com can you rebase this on main?
Did autocompletion tricked you, or am I really the target of your message? I'm not sure if @kmacy will be happy with me messing with his patch.
Sep 28 2024
@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.
Sep 27 2024
Rebase on main and some small improvements and bug fixes. Upon more testing the reimported algorithm is tuned for igb and less governed than intended on lem/em due to a different unit of measure on the ITR register. Need to think a little on how I would like to handle that.
Sep 26 2024
In D44258#1067031, @tuexen wrote:In D44258#1066453, @kbowling wrote:In D44258#1066450, @tuexen wrote:In D44258#1066441, @kbowling wrote:Are you sure this can't be dealt with dynamically in ix_txrx? Admittedly I have no reason to spend a lot of time digging into this but my intuition is you can stuff the offsets into if_pkt_info_t in iflib.c and make the right decisions when constructing the TSO packet descriptor.
Are you saying you would write the register with every TSO packet? I have no experience, but I thought this might be too expensive. Right now, the NIC is using one behavior for all TSO packets for all connections...
No, I was speculating maybe you can accomplish whatever header change is desired in the TSO descriptor rather than globally altering those registers. But I do not know.
I believe by default whatever is in the pseudoheader is copied to all segments.
What are we trying to accomplish.. is a default mask ruining the ECN flag or are you trying to drop/alter the ECN flag in the middles or last segment?
In D44258#1066453, @kbowling wrote:In D44258#1066450, @tuexen wrote:In D44258#1066441, @kbowling wrote:Are you sure this can't be dealt with dynamically in ix_txrx? Admittedly I have no reason to spend a lot of time digging into this but my intuition is you can stuff the offsets into if_pkt_info_t in iflib.c and make the right decisions when constructing the TSO packet descriptor.
Are you saying you would write the register with every TSO packet? I have no experience, but I thought this might be too expensive. Right now, the NIC is using one behavior for all TSO packets for all connections...
No, I was speculating maybe you can accomplish whatever header change is desired in the TSO descriptor rather than globally altering those registers. But I do not know.
I believe by default whatever is in the pseudoheader is copied to all segments.
What are we trying to accomplish.. is a default mask ruining the ECN flag or are you trying to drop/alter the ECN flag in the middles or last segment?
Here is what I'm trying to accomplish:
The TCP header contains 12 flags (see IANA. The Intel NIC uses three 12 bit masks to compute, which flags are copied to the first, the middle, and the last segment when performing TSO. When doing ECN as specified in RFC 3168, the masks should be:
mask.first 0xFF6 mask.middle 0xF76 mask.last 0xF7F This means that the FIN and PSH flag only appear in the last segment and the CWR flag only appear in the first segment. Several Intel NICs behave like this.
However,ix0: <Intel(R) X520 82599ES (SFI/SFP+)> mem 0x6004000000000-0x600400007ffff,0x6004000100000-0x6004000103fff irq 1040376 at device 0.0 numa-domain 0 on pci3 ix0: Using 2048 TX descriptors and 2048 RX descriptors ix0: Using 16 RX queues 16 TX queues ix0: Using MSI-X interrupts with 17 vectors ix0: allocated for 16 queues ix0: allocated for 16 rx queues ix0: Ethernet address: 90:e2:ba:f7:48:74 ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix0: eTrack 0x000161c1reports
dev.ix.0.tso_tcp_flags_mask_first_segment: 0x00000ff6 dev.ix.0.tso_tcp_flags_mask_middle_segment: 0x00000ff6 dev.ix.0.tso_tcp_flags_mask_last_segment: 0x00000f7fThis means that the PSH and FIN flag correctly appear only in the last segment, but the CWR flag does not only appear in the first segment, but also in all middle segments.
Using this patch, this could be fixed.When doing Accurate ECN as currently being specified in Accurate ECN, the masks should be:
mask.first 0xFF6 mask.middle 0xFF6 mask.last 0xFFF The proposed patch would allow a NIC to be configured to do TSO for either classical ECN or accurate ECN.
Does this description make the motivation clear?
Sep 25 2024
For implementation of manipulating the registers I think this is fine. I would still like to understand the usecase before it is committed.
@krzysztof.galazka_intel.com can you rebase this on main?
For what it's worth I think the TSO reset is due to ifnet, iflib, and individual drivers only making various decisions during attach that would better be done in helper functions that are called during init and SIOCSIFCAP and not some real limitation of the network stack or hardware. It would be cool to not need that reset. I left a more relevant comment in https://reviews.freebsd.org/D46186#inline-280367
Seems good.
In D44258#1066450, @tuexen wrote:In D44258#1066441, @kbowling wrote:Are you sure this can't be dealt with dynamically in ix_txrx? Admittedly I have no reason to spend a lot of time digging into this but my intuition is you can stuff the offsets into if_pkt_info_t in iflib.c and make the right decisions when constructing the TSO packet descriptor.
Are you saying you would write the register with every TSO packet? I have no experience, but I thought this might be too expensive. Right now, the NIC is using one behavior for all TSO packets for all connections...
Are you sure this can't be dealt with dynamically in ix_txrx? Admittedly I have no reason to spend a lot of time digging into this but my intuition is you can stuff the offsets into if_pkt_info_t in iflib.c and make the right decisions when constructing the TSO packet descriptor.
Are you sure this can't be dealt with dynamically in em_txrx and igb_txrx? Admittedly I have no reason to spend a lot of time digging into this but my intuition is you can stuff the offsets into if_pkt_info_t in iflib.c and make the right decisions when constructing the TSO packet descriptor.
Sep 24 2024
Sep 20 2024
Thank you this is the right fix. I will commit it now with your authorship as I would otherwise commit the same thing now to fix the tree.
Sep 17 2024
May 25 2024
Sep 21 2023
Aug 24 2023
In D41558#947392, @markj wrote:setting the ifdi_needs_restart default to false will alleviate the need to churn every driver if an odd event is added in the future for specific hardware.
Returning true by default seems like the safe default to me. Yes, unneeded reinits are annoying, but if we don't reinit when an event actually requires it, the result could be harder to debug.
Don't toggle vmxnet3
Sweep the drivers
Aug 23 2023
Aug 22 2023
Aug 18 2023
Makes sense to me to get these in and continue the audit/additions
@bofh do you know if this was something you had to mark BROKEN or work around?
Aug 16 2023
What is triggering this for you?
I see the optional assignment so I guess that part is fine. Ultimately it's not my port but if danfe is not going to comment I have no objections.
It is not necessary to do it like this. There is an implied version contract with this port and x11/linux-nvidia-libs as well. If it is centralized, it needs to accommodate the entire nvidia-driver situation because this will cause build failures.
I think you should remove the distversion thing, that will break the legacy ports and seems unnecessary.
Aug 13 2023
@erj can you post the draft driver update, I'd like to see how it intends to use this patch stack before commentary
Aug 9 2023
Aside from some research notes there is nothing to do here.