Several users reported that they saw the exhaustion of kernel stack and thus panics on the i386 architecture, when IPsec is used with complex network configurations. I.e. when netgraph nodes, pf/ipfw, gif/gre are in use together with IPsec. Panics contain very long call chains.
To reduce such chains, I added new netisr(9) protocol NETISR_IPSEC and modified IPsec code to be able use netisr queue for deferred IPsec processing for outbound packets.
The feature is disabled by default on arches where kstack_pages is greater or equal to 4. That seems enough to work without netisr and without stack exhaustion.
Details
Diff Detail
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 7625 Build 7773: arc lint + arc unit
Event Timeline
sys/netipsec/ipsec.c | ||
---|---|---|
1421 | !! ? |
sys/netipsec/ipsec.c | ||
---|---|---|
1421 | Yes, in this case it is useless. I'll remove it. |
If you can wait a day or so I'd love to have Olivier, now added to the review, run one of his perf tests on this.
This seems like a sensible general change. I'm quite surprised it wasn't this way already (.. and sort of misremembered that it was -- IPSEC should always have been using the netisr...).
Here is the benefit on standard server:
Impact of D9721 on IPSec performance (null encryption)
- IBM System x3550 M3 with quad cores (Intel Xeon L5630 2.13GHz, hyper-threading disabled)
- Quad port Intel 82580
- FreeBSD 12-head r314642
- harvest.mask=351
- 2000 flows of clear smallest-size UDP packets, multiplexed in one IPSec flow
x r314642 packet-per-seconds + r314642 with D9721 packet-per-seconds +--------------------------------------------------------------------------+ | x + | |x x x + x + + + | | |_________________A___M____________| | | |____________A____M______|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 546152 550381 549167 548753.4 1564.3483 + 5 550273.5 552629.5 552169.5 551659.7 1129.4238 Difference at 95.0% confidence 2906.3 +/- 1989.79 0.529619% +/- 0.363866% (Student's t, pooled s = 1364.33)
Do you want flamegraph too?
Do you have enabled net.inet.ipsec.use_netisr=1?
Since the main goal of this patch is reducing the kernel stack requirements, it is enabled by default only when KSTACK_PAGES is less than 4.
On amd64 it is not enabled by default.
You may also need to increase NETISR queue length in case of growed number of drops.
Look at the netstat -Q output. And use net.inet.ipsec.intr_queue_maxlen variable if number of drops grows.
With netisr is enabled on this 64bits platform, performance decrease about 22%:
x r314642 packet-per-seconds + r314642 with D9721 packet-per-seconds * r314642 with D9721 and net.inet.ipsec.use_netisr=1 packet-per-seconds +--------------------------------------------------------------------------+ | * x++| |* * * * x x*+| | |A| | | AM| ||__M_A____| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 546152 550381 549167 548753.4 1564.3483 + 5 550273.5 552629.5 552169.5 551659.7 1129.4238 Difference at 95.0% confidence 2906.3 +/- 1989.79 0.529619% +/- 0.363866% (Student's t, pooled s = 1364.33) * 5 416795 440933.5 423271 426451.8 9207.3367 Difference at 95.0% confidence -122302 +/- 9631.37 -22.2872% +/- 1.74536% (Student's t, pooled s = 6603.87)
Have you observed the packet loss in the netstat -Q statistics? I guess it is possible on such packet rate with default queue length.
Yes a lot's :
[root@R2]~# netstat -Q Configuration: Setting Current Limit Thread count 1 1 Default queue limit 256 10240 Dispatch policy direct n/a Threads bound to CPUs disabled n/a Protocols: Name Proto QLimit Policy Dispatch Flags ip 1 256 flow default --- igmp 2 256 source default --- rtsock 3 2048 source default --- arp 4 256 source default --- ether 5 256 source direct --- ip6 6 256 flow default --- ipsec 11 256 source default --- Workstreams: WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled 0 0 ip 0 0 89370650 0 0 0 89370650 0 0 igmp 0 0 0 0 0 0 0 0 0 rtsock 0 2 0 0 0 114 114 0 0 arp 0 0 24 0 0 0 24 0 0 ether 0 0 89370674 0 0 0 89370674 0 0 ip6 0 0 0 0 0 0 0 0 0 ipsec 0 256 0 0 616976559 221804972 221804972
Then, following your advice I've increased net.inet.ipsec.intr_queue_maxlen to 8192, but peformance are lot's worst:
x r314642 with D9721 and net.inet.ipsec.use_netisr=1 (packet-per-seconds) + r314642 with D9721 and net.inet.ipsec.use_netisr=1 and net.inet.ipsec.intr_queue_maxlen=8192 (packet-per-seconds) +--------------------------------------------------------------------------+ | + x | | + + + xx * x| | |MA_| | ||______________M__________A________________________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 416795 440933.5 423271 426451.8 9207.3367 + 5 132107 432864 163939 213309.4 123845.92 Difference at 95.0% confidence -213142 +/- 128072 -49.9804% +/- 29.97% (Student's t, pooled s = 87814)
statistics with this 8k queue:
[root@R2]~# netstat -Q Configuration: Setting Current Limit Thread count 1 1 Default queue limit 256 10240 Dispatch policy direct n/a Threads bound to CPUs disabled n/a Protocols: Name Proto QLimit Policy Dispatch Flags ip 1 256 flow default --- igmp 2 256 source default --- rtsock 3 2048 source default --- arp 4 256 source default --- ether 5 256 source direct --- ip6 6 256 flow default --- ipsec 11 8192 source default --- Workstreams: WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled 0 0 ip 0 0 24726004 0 0 0 24726004 0 0 igmp 0 0 0 0 0 0 0 0 0 rtsock 0 2 0 0 0 114 114 0 0 arp 0 0 5 0 0 0 5 0 0 ether 0 0 24726009 0 0 0 24726009 0 0 ip6 0 0 0 0 0 0 0 0 0 ipsec 8192 8192 0 0 87097837 11805690 11789306
I never play with netisr before, then I'm not used to tune it.
Hiya,
Those drops are the biggest problem, and (without reading the patch too carefully) it may be that it's running only one netisr thread right now, rather than net.isr.maxthreads=<some number greater than 1>.
When you run "top -HS -s 1" during the test, is a whole CPU core being chewed up doing netisr work?
If so, try putting say, "net.isr.maxthreads=8" and reboot and re-test.
oh hm, one ipsec flow? so ok. So, can you also dump out the NIC statistics so we can see if the NIC is seeing all the traffic on one RX ring, or whether it's being load balanced between multiple RX rings.
Ie, I would like to know if the non-netisr-ipsec method is simply getting parallelism by having the RX path being run on multiple RX NIC threads (and then serialising for some kind of data access, but parallelising for say, crypto handling) and the simple netisr conversion is squeezing it all through one CPU netisr thread.
Adrian, please, note that this netisr queue is for outbound traffic. It queues packets that are going to be encrypted. And these packets are ordered by SPI value, so for one flow there is no parallelism. Also there are no parallelism in the crypto(9) subsystem, so you wont get any benefits from many isr threads, they all will be blocked on the crypto thread.
Actually that is not true it will run in parallell in the netisr thread.
Though two things that are problematic:
- Mbuf tag
- bouncing CPU by keeping the flow by SPI, most probably you should tell netisr to launch on same cpu there is some way for that IIRC
Interesting. Ok. Well the existing ipsec code for transmit just runs to completion to ip output right?
So there's some natural transmit flow control that doesn't exist here. The sending side now just keeps submitting quickly to the transmit queue and quickly fails rather than taking time to process into the nic queue.
Ill have another look in a couple days. Those drops mean a lot of mbuf allocator churn right? We are allocating tags and an mbuf then freeing it upon netisr queue fail. Can you verify that theory via vmstat?
I was sending 2000 UDP flows at gigabit line rate (1.48Mpps):
[root@R2]~# vmstat -i interrupt total rate irq4: uart0 525 3 irq16: atapci0 140 1 irq17: uhci0 uhci2+ 33 0 cpu0:timer 31124 169 cpu1:timer 28883 157 cpu2:timer 29053 158 cpu3:timer 32384 176 irq256: bce0 118 1 irq263: igb1:rxq0 1 0 irq268: igb2:rxq0 134280 729 irq269: igb2:rxq1 136486 741 irq270: igb2:rxq2 136184 739 irq271: igb2:rxq3 140669 764 irq272: igb2:aq 2 0 irq273: igb3:rxq0 3 0 irq277: igb3:aq 2 0 irq278: mpt0 3645 20 Total 673532 3656 [root@R2]~# top -HSb last pid: 1162; load averages: 1.84, 0.55, 0.20 up 0+00:03:29 07:33:41 109 processes: 8 running, 83 sleeping, 18 waiting Mem: 10M Active, 688K Inact, 216M Wired, 15M Buf, 15G Free Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 0 root -76 - 0K 240K CPU1 1 0:36 76.17% kernel{softirq_1} 0 root -76 - 0K 240K - 3 0:36 76.07% kernel{softirq_3} 0 root -76 - 0K 240K CPU2 2 0:36 75.98% kernel{softirq_2} 0 root -76 - 0K 240K - 0 0:36 75.20% kernel{softirq_0} 12 root -72 - 0K 304K CPU3 3 0:26 55.37% intr{swi1: netisr 0} 11 root 155 ki31 0K 64K RUN 1 3:09 14.79% idle{idle: cpu1} 11 root 155 ki31 0K 64K RUN 2 2:44 14.79% idle{idle: cpu2} 11 root 155 ki31 0K 64K RUN 3 3:05 14.60% idle{idle: cpu3} 11 root 155 ki31 0K 64K RUN 0 3:08 13.77% idle{idle: cpu0} 0 root -16 - 0K 240K swapin 1 3:38 0.00% kernel{swapper} 4 root -16 - 0K 32K - 3 0:04 0.00% cam{scanner} 12 root -60 - 0K 304K WAIT 2 0:01 0.00% intr{swi4: clock (0)} 0 root 8 - 0K 240K - 3 0:01 0.00% kernel{thread taskq} 12 root -88 - 0K 304K WAIT 2 0:01 0.00% intr{irq16: atapci0} 4 root -16 - 0K 32K - 0 0:00 0.00% cam{doneq0} 14 root -68 - 0K 576K - 1 0:00 0.00% usb{usbus3} 14 root -68 - 0K 576K - 1 0:00 0.00% usb{usbus0} 14 root -68 - 0K 576K - 0 0:00 0.00% usb{usbus2} [root@R2]~# netstat -iw 1 input (Total) output packets errs idrops bytes packets errs bytes colls 1489807 0 0 178775646 202564 0 18230912 0 1487955 0 0 178554606 202680 0 18241284 0 1488185 0 0 178582026 203322 0 18299154 0 1488197 0 0 178583586 201830 0 18164694 0 1488577 0 0 178629966 203828 0 18344604 0 1487709 0 0 178524966 202929 0 18263694 0 1488501 0 0 178619166 204997 0 18449814 0 1487926 0 0 178551726 203724 0 18335244 0
All igb2 4 input queues are consomming all my 4 cores, preventing netisr to gain enought CPU time slot: It can output only about 200Kpps.
Theses 2000 unencrypted flows are multiplexed into one encrypted IPSec flow: Then only one output queue of igb3 is used.
Notice that about netstat input statistic: It correcly display 1.48Mpps, but "err" and "idrop" are empty
Then I lowered the pkt generator to 600Kpps only, it should allow netisr to gain more CPU time slot:
[root@R2]~# top -HSb last pid: 1171; load averages: 2.91, 2.12, 1.04 up 0+00:08:17 07:38:29 109 processes: 6 running, 85 sleeping, 18 waiting Mem: 10M Active, 704K Inact, 216M Wired, 15M Buf, 15G Free Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 12 root -72 - 0K 304K CPU2 2 3:00 76.37% intr{swi1: netisr 0} 11 root 155 ki31 0K 64K CPU0 0 4:45 57.18% idle{idle: cpu0} 11 root 155 ki31 0K 64K RUN 2 4:23 56.40% idle{idle: cpu2} 11 root 155 ki31 0K 64K RUN 3 4:40 55.27% idle{idle: cpu3} 11 root 155 ki31 0K 64K CPU1 1 4:43 51.17% idle{idle: cpu1} 0 root -76 - 0K 240K - 1 3:10 30.66% kernel{softirq_1} 0 root -76 - 0K 240K - 3 3:09 30.57% kernel{softirq_3} 0 root -76 - 0K 240K - 2 3:10 30.47% kernel{softirq_2} 0 root -76 - 0K 240K - 0 3:08 30.27% kernel{softirq_0} 0 root -16 - 0K 240K swapin 0 3:38 0.00% kernel{swapper} 4 root -16 - 0K 32K - 3 0:04 0.00% cam{scanner} 12 root -60 - 0K 304K WAIT 0 0:01 0.00% intr{swi4: clock (0)} 0 root 8 - 0K 240K - 2 0:01 0.00% kernel{thread taskq} 12 root -88 - 0K 304K WAIT 2 0:01 0.00% intr{irq16: atapci0} 4 root -16 - 0K 32K - 3 0:00 0.00% cam{doneq0} 14 root -68 - 0K 576K - 0 0:00 0.00% usb{usbus3} 14 root -68 - 0K 576K - 0 0:00 0.00% usb{usbus0} 14 root -68 - 0K 576K - 0 0:00 0.00% usb{usbus2} [root@R2]~# netstat -iw 1 input (Total) output packets errs idrops bytes packets errs bytes colls 616416 0 0 73969866 316102 0 28449312 0 600070 0 0 72008346 308140 0 27732774 0 599587 0 0 71950386 308187 0 27736824 0 584671 0 0 70160466 299747 0 26977314 0 600065 0 0 72007746 308091 0 27728274 0 615425 0 0 73850946 316062 0 28445664 0 600065 0 0 72007746 308358 0 27752304 0 600065 0 0 72007746 308221 0 27739974 0
Now that netisr have more CPU time slot, output packet performance increase from 200Kpps to 300Kppps (FYI it was about 550Kpps without netisr involved).
Then, following your advice, I've bump net.isr.maxthreads and bind them:
- net.isr.bindthreads=1
- net.isr.maxthreads=4 (because I've got only 4 cores here)
[root@R2]~# top -HSb last pid: 1188; load averages: 1.84, 0.68, 0.28 up 0+00:24:49 08:08:13 112 processes: 6 running, 85 sleeping, 21 waiting Mem: 10M Active, 688K Inact, 216M Wired, 15M Buf, 15G Free Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 12 root -72 - 0K 352K CPU1 1 1:14 99.76% intr{swi1: netisr 1} 11 root 155 ki31 0K 64K CPU0 0 24:47 77.10% idle{idle: cpu0} 11 root 155 ki31 0K 64K CPU2 2 24:21 76.86% idle{idle: cpu2} 11 root 155 ki31 0K 64K RUN 3 24:48 76.27% idle{idle: cpu3} 0 root -76 - 0K 240K - 2 0:22 27.98% kernel{softirq_2} 0 root -76 - 0K 240K - 0 0:22 27.88% kernel{softirq_0} 0 root -76 - 0K 240K - 3 0:22 27.59% kernel{softirq_3} 11 root 155 ki31 0K 64K RUN 1 23:53 0.00% idle{idle: cpu1} 0 root -16 - 0K 240K swapin 0 3:39 0.00% kernel{swapper} 0 root -76 - 0K 240K - 1 0:04 0.00% kernel{softirq_1} 4 root -16 - 0K 32K - 2 0:04 0.00% cam{scanner} 0 root 8 - 0K 240K - 3 0:01 0.00% kernel{thread taskq} 14 root -68 - 0K 576K - 0 0:01 0.00% usb{usbus3} 12 root -60 - 0K 352K WAIT 2 0:01 0.00% intr{swi4: clock (0)} 4 root -16 - 0K 32K - 0 0:00 0.00% cam{doneq0} 6 root -16 - 0K 16K idle 3 0:00 0.00% mpt_raid0 14 root -68 - 0K 576K - 3 0:00 0.00% usb{usbus1} 14 root -68 - 0K 576K - 0 0:00 0.00% usb{usbus4} [root@R2]~# netstat -iw 1 input (Total) output packets errs idrops bytes packets errs bytes colls 448772 0 0 53852586 444061 0 39965706 0 449147 0 0 53897586 444188 0 39977004 0 449146 0 0 53897466 444030 0 39962784 0 449146 0 0 53897466 444192 0 39977364 0 449148 0 0 53897706 444040 0 39963684 0 449150 0 0 53897946 444049 0 39964584 0 449140 0 0 53896746 444160 0 39974394 0 449159 0 0 53899026 444154 0 39973944 0 448772 0 0 53852586 444016 0 39961524 0 449147 0 0 53897586 443997 0 39959814 0 449146 0 0 53897466 444210 0 39978984 0 449146 0 0 53897466 444093 0 39968454 0 449148 0 0 53897706 444085 0 39967824 0 449147 0 0 53897586 444145 0 39973044 0 449112 0 0 53893506 444050 0 39964584 0 449190 0 0 53902626 443928 0 39953604 0 448772 0 0 53852586 444041 0 39963774 0
and performance increase from 290Kpps to about 440Kpps (still not the 550Kpps without netisr, but lot's better).
Notice the "netstat -iw 1" behavior change here since net.isr.maxthreads=4: the pkt generator still generate 600Kpps, but netstat show only 449Kpps as input (!?)
But are we agrea than on amd64 (KSTACK_PAGES=4) arch I don't need to enable netisr (net.inet.ipsec.use_netisr=1) ?
Peraps it should be better to found a 32bit platform for benching this patch ?
About vmstat output:
[root@R2]~# vmstat -z ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP UMA Kegs: 384, 0, 206, 4, 206, 0, 0 UMA Zones: 1152, 0, 207, 0, 207, 0, 0 UMA Slabs: 80, 0, 6529, 21, 7099, 0, 0 UMA Hash: 256, 0, 10, 5, 15, 0, 0 4 Bucket: 32, 0, 89, 1036, 1169, 0, 0 6 Bucket: 48, 0, 9, 1070, 157, 0, 0 8 Bucket: 64, 0, 97, 1205, 8061, 11, 0 12 Bucket: 96, 0, 21, 512, 51790, 0, 0 16 Bucket: 128, 0, 83, 444, 32953, 0, 0 32 Bucket: 256, 0, 72, 303, 8760, 137, 0 64 Bucket: 512, 0, 111, 121, 440, 137, 0 128 Bucket: 1024, 0, 93, 119, 897, 0, 0 256 Bucket: 2048, 0, 214, 118, 1140785, 56, 0 vmem btag: 56, 0, 6887, 639, 6887, 54, 0 (...) mbuf_packet: 256, 6505365, 2040, 1502, 2921, 0, 0 mbuf: 256, 6505365, 13507, 5721,468037402, 0, 0 mbuf_cluster: 2048, 1016464, 8150, 98, 9175, 0, 0 mbuf_jumbo_page: 4096, 508231, 0, 3, 7, 0, 0 mbuf_jumbo_9k: 9216, 150587, 0, 0, 0, 0, 0 mbuf_jumbo_16k: 16384, 84705, 0, 0, 0, 0, 0 DMAR_MAP_ENTRY: 120, 0, 0, 0, 0, 0, 0 g_bio: 376, 0, 0, 290, 20672, 0, 0 ttyinq: 160, 0, 195, 130, 555, 0, 0 ttyoutq: 256, 0, 101, 94, 287, 0, 0 nvme_request: 128, 0, 0, 0, 0, 0, 0 cryptop: 88, 0, 0, 135,147266110, 0, 0 cryptodesc: 72, 0, 1, 164,147266110, 0, 0 vtnet_tx_hdr: 24, 0, 0, 0, 0, 0, 0
Yes, I was not fully correct. Crypto(9) worker thread does its work only when direct crypto dispatch fails.
Though two things that are problematic:
- Mbuf tag
I don't see the another way to pass several needed arguments to netisr_queue().
- bouncing CPU by keeping the flow by SPI, most probably you should tell netisr to launch on same cpu there is some way for that IIRC
As I said the main goal of the patch is reducing of kernel stack usage. This is solved by queuing and deferred processing. With direct processing we will still use the same thread context and thus stack overflow will not solved.
Yes.
So there's some natural transmit flow control that doesn't exist here. The sending side now just keeps submitting quickly to the transmit queue and quickly fails rather than taking time to process into the nic queue.
When ip_output decides that we need IPsec processing, it calls IPSEC_OUTPUT method. When use_netisr is disabled, IPsec does crypto processing and invokes ip_output() again. In case when you have bundled security policy, you does this several times. This is all called in the same thread context. There is comment that describes this:
https://svnweb.freebsd.org/base/head/sys/netipsec/ipsec_output.c?view=annotate#l756
Also note, that IPsec code can invoke pfil processing, that also have own stack usage.
Ill have another look in a couple days. Those drops mean a lot of mbuf allocator churn right? We are allocating tags and an mbuf then freeing it upon netisr queue fail. Can you verify that theory via vmstat?
I don't think so, the drops occurs when netisr queue is overflowed just after ip_output() decides that it needs IPsec processing.
The only allocation there is mbug tag allocation to pass additional arguments to deferred processing handler.
I think we have three options:
- do not change anything and require kern.kstack_pages=4 for working IPsec without panics;
- apply this patch to slowly work without panics when we have kern.kstack_pages below 4;
- rework the patch to use different solution.
Right.
So the latest experiments seem to confirm the theory - the softirq threads are getting more CPU time than the netisr threads. There's no multiplexing into multiple destination netisr threads either.
Ill have another look in a couple days. Those drops mean a lot of mbuf allocator churn right? We are allocating tags and an mbuf then freeing it upon netisr queue fail. Can you verify that theory via vmstat?
I don't think so, the drops occurs when netisr queue is overflowed just after ip_output() decides that it needs IPsec processing.
The only allocation there is mbug tag allocation to pass additional arguments to deferred processing handler.
Well, it's allocating an mbuf, adding mtags, then freeing it if it fails to queue. A lot of our network stack does this because of well, history reasons.
Ok, so on the output path once the netisr path is called, how much of that work is done without a global lock (eg crypto?) I wonder how much of this change in throughput is because the netisr thread != the NIC input CPU path, and RX is saturating (more) cores. Versus, say, some parallel processing.
Because my guess is that the RX -> IP stuff -> IP TX stuff -> ipsec encrypt -> NIC if_transmit() path is happening in one complete thread, to completion each time, so there's backpressure (the NIC for example will drop frames early if the CPU is saturated for example, and/or send pause frames, etc.)
Can we say, run the netisr thread at a higher priority than the NIC softirq, instead of the other way around? That way netisr isn't preempted by the NIC RX softirq?
After ipsec encrypt there is yet another "IP TX stuff" and then if_transmit :)
You can easily see the same behavior with dropped inbound frame with pfil.
When firewall have too many rules that are not optimized and took too much CPU time, such direct call to completion processing will produce drops on RX. And Olivier has graphs that show this.
Can we say, run the netisr thread at a higher priority than the NIC softirq, instead of the other way around? That way netisr isn't preempted by the NIC RX softirq?
I'm not sure, we usually use direct dispatching, and seems this can affect only hybrid mode.
Yeah, but here's the rub:
- if you do 100% direct dispatch, the NIC RX is (somewhat) paced by the drain rate of the rest of the stack handling. Now, some NIC drivers will do batched RX, most of them will just call if_input() on each packet without any refill or queuing
- then there's no netisr requeuing going on; 99% of the time it'll just be direct dispatched up
- so if the rest of the stack handling is taking up a lot of CPU time or latency (eg lock contention, context switching, etc), it'll end up causing the NIC to run out of RX slots, and that'll cause backpressure.
If that's going on then hopefully (!) we can see it with say, an increase in tx'ed pause frames on the RX side, and/or if it counts "packets received with no descriptors", that'll help. oliver, can you do a sysctl dump before/after these tests so we can see?
But in the netisr side, you're immediately finishing the enqueue and letting the NIC driver continue to run RX, so it has more opportunity to just suck in more packets that'll get discarded by the netisr side of things. There's no backpressure telling the RX side "hey, stop trying to RX stuff into this queue, I'm just having to toss it in later processing." The TX side somewhat has the same problem too.
Olliver - interestingly in your test, netisr chews a whole core and it doesn't allow any other NIC processing to go on on that CPU. Hm! That may be why the input pps is lower, we're dropping 1/4 of the flows really early. CAn you post the igb sysctl output? That is from the hardware..
The problem is the igb sysctl counters seems useless (I don't see any rx "drop" counters), here is the output during a bench (with 4 netisr threads, pkt-gen generate 600kpps):
[root@R2]~# netstat -iw 1 input (Total) output packets errs idrops bytes packets errs bytes colls 450735 0 0 54088146 442452 0 39820812 0 449970 0 0 53996346 441985 0 39778734 0 450342 0 0 54040986 442182 0 39796464 0 450351 0 0 54042066 442190 0 39797184 0 450347 0 0 54041586 442021 0 39781974 0 450350 0 0 54041946 442082 0 39787464 0 450349 0 0 54041826 442024 0 39782334 0 ^C [root@R2]~# echo "Only 450Kpps received, where are the missing 150Kpps (pkt-gen generate 600Kpps) ?" Only 450Kpps, where are the missing 150Kpps (pkt-gen generate 600Kpps) ? [root@R2]~# sysctl dev.igb.2. dev.igb.2.interrupts.rx_overrun: 0 dev.igb.2.interrupts.rx_desc_min_thresh: 0 dev.igb.2.interrupts.tx_queue_min_thresh: 0 dev.igb.2.interrupts.tx_queue_empty: 0 dev.igb.2.interrupts.tx_abs_timer: 0 dev.igb.2.interrupts.tx_pkt_timer: 0 dev.igb.2.interrupts.rx_abs_timer: 0 dev.igb.2.interrupts.rx_pkt_timer: 53504000 dev.igb.2.interrupts.asserts: 1225972 dev.igb.2.mac_stats.tso_ctx_fail: 0 dev.igb.2.mac_stats.tso_txd: 0 dev.igb.2.mac_stats.tx_frames_1024_1522: 0 dev.igb.2.mac_stats.tx_frames_512_1023: 0 dev.igb.2.mac_stats.tx_frames_256_511: 0 dev.igb.2.mac_stats.tx_frames_128_255: 0 dev.igb.2.mac_stats.tx_frames_65_127: 0 dev.igb.2.mac_stats.tx_frames_64: 0 dev.igb.2.mac_stats.mcast_pkts_txd: 0 dev.igb.2.mac_stats.bcast_pkts_txd: 0 dev.igb.2.mac_stats.good_pkts_txd: 0 dev.igb.2.mac_stats.total_pkts_txd: 0 dev.igb.2.mac_stats.good_octets_txd: 0 dev.igb.2.mac_stats.good_octets_recvd: 3424256000 dev.igb.2.mac_stats.rx_frames_1024_1522: 0 dev.igb.2.mac_stats.rx_frames_512_1023: 0 dev.igb.2.mac_stats.rx_frames_256_511: 0 dev.igb.2.mac_stats.rx_frames_128_255: 0 dev.igb.2.mac_stats.rx_frames_65_127: 0 dev.igb.2.mac_stats.rx_frames_64: 53504000 dev.igb.2.mac_stats.mcast_pkts_recvd: 0 dev.igb.2.mac_stats.bcast_pkts_recvd: 0 dev.igb.2.mac_stats.good_pkts_recvd: 53504000 dev.igb.2.mac_stats.total_pkts_recvd: 53504000 dev.igb.2.mac_stats.xoff_txd: 0 dev.igb.2.mac_stats.xoff_recvd: 0 dev.igb.2.mac_stats.xon_txd: 0 dev.igb.2.mac_stats.xon_recvd: 0 dev.igb.2.mac_stats.coll_ext_errs: 0 dev.igb.2.mac_stats.alignment_errs: 0 dev.igb.2.mac_stats.crc_errs: 0 dev.igb.2.mac_stats.recv_errs: 0 dev.igb.2.mac_stats.recv_jabber: 0 dev.igb.2.mac_stats.recv_oversize: 0 dev.igb.2.mac_stats.recv_fragmented: 0 dev.igb.2.mac_stats.recv_undersize: 0 dev.igb.2.mac_stats.recv_no_buff: 0 dev.igb.2.mac_stats.missed_packets: 0 dev.igb.2.mac_stats.defer_count: 0 dev.igb.2.mac_stats.sequence_errors: 0 dev.igb.2.mac_stats.symbol_errors: 0 dev.igb.2.mac_stats.collision_count: 0 dev.igb.2.mac_stats.late_coll: 0 dev.igb.2.mac_stats.multiple_coll: 0 dev.igb.2.mac_stats.single_coll: 0 dev.igb.2.mac_stats.excess_coll: 0 dev.igb.2.queue_rx_3.rx_irq: 0 dev.igb.2.queue_rx_3.rxd_tail: 951 dev.igb.2.queue_rx_3.rxd_head: 953 dev.igb.2.queue_rx_2.rx_irq: 0 dev.igb.2.queue_rx_2.rxd_tail: 325 dev.igb.2.queue_rx_2.rxd_head: 327 dev.igb.2.queue_rx_1.rx_irq: 0 dev.igb.2.queue_rx_0.rx_irq: 0 dev.igb.2.queue_rx_0.rxd_tail: 324 dev.igb.2.queue_rx_0.rxd_head: 326 dev.igb.2.queue_tx_3.no_desc_avail: 0 dev.igb.2.queue_tx_3.tx_irq: 0 dev.igb.2.queue_tx_3.txd_tail: 0 dev.igb.2.queue_tx_3.txd_head: 0 dev.igb.2.queue_tx_2.no_desc_avail: 0 dev.igb.2.queue_tx_2.tx_irq: 0 dev.igb.2.queue_tx_2.txd_tail: 0 dev.igb.2.queue_tx_2.txd_head: 0 dev.igb.2.queue_tx_1.no_desc_avail: 0 dev.igb.2.queue_tx_1.tx_irq: 0 dev.igb.2.queue_tx_1.txd_tail: 0 dev.igb.2.queue_tx_1.txd_head: 0 dev.igb.2.queue_tx_0.no_desc_avail: 0 dev.igb.2.queue_tx_0.tx_irq: 0 dev.igb.2.queue_tx_0.txd_tail: 0 dev.igb.2.queue_tx_0.txd_head: 0 dev.igb.2.fc_low_water: 34800 dev.igb.2.fc_high_water: 34816 dev.igb.2.rx_control: 67141662 dev.igb.2.device_control: 13632065 dev.igb.2.watchdog_timeouts: 0 dev.igb.2.rx_overruns: 0 dev.igb.2.tx_dma_fail: 0 dev.igb.2.mbuf_defrag_fail: 0 dev.igb.2.link_irq: 3 dev.igb.2.dropped: 0 dev.igb.2.eee_control: 1 dev.igb.2.rx_processing_limit: -1 dev.igb.2.itr: 488 dev.igb.2.tx_abs_int_delay: 66 dev.igb.2.rx_abs_int_delay: 66 dev.igb.2.tx_int_delay: 66 dev.igb.2.rx_int_delay: 0 dev.igb.2.reg_dump: General Registers CTRL 00d00241 STATUS 0008078b CTRL_EXIT 00100c80 Interrupt Registers ICR 00000000 RX Registers RCTL 0400801e RDLEN 00004000 RDH 000001c7 RDT 000001c5 RXDCTL 02040808 RDBAL 7d3b4000 RDBAH 00000000 TX Registers TCTL b50400fa TDBAL 7d3a4000 TDBAH 00000000 TDLEN 00004000 TDH 00000000 TDT 00000000 TXDCTL 0341011f TDFH 00000001 TDFT 00000000 TDFHS 00000000 TDFPC 00000000 dev.igb.2.fc: 0 dev.igb.2.debug: -1 dev.igb.2.nvm: -1 dev.igb.2.iflib.rxq3.rxq_fl0.credits: 1023 dev.igb.2.iflib.rxq3.rxq_fl0.cidx: 56 dev.igb.2.iflib.rxq3.rxq_fl0.pidx: 55 dev.igb.2.iflib.rxq2.rxq_fl0.credits: 1023 dev.igb.2.iflib.rxq2.rxq_fl0.cidx: 456 dev.igb.2.iflib.rxq2.rxq_fl0.pidx: 455 dev.igb.2.iflib.rxq1.rxq_fl0.credits: 1023 dev.igb.2.iflib.rxq1.rxq_fl0.cidx: 100 dev.igb.2.iflib.rxq1.rxq_fl0.pidx: 99 dev.igb.2.iflib.rxq0.rxq_fl0.credits: 1023 dev.igb.2.iflib.rxq0.rxq_fl0.cidx: 455 dev.igb.2.iflib.rxq0.rxq_fl0.pidx: 454 (...txq stats skipped...) dev.igb.2.iflib.override_nrxds: 0 dev.igb.2.iflib.override_ntxds: 0 dev.igb.2.iflib.override_qs_enable: 0 dev.igb.2.iflib.override_nrxqs: 0 dev.igb.2.iflib.override_ntxqs: 0 dev.igb.2.iflib.driver_version: 7.6.1-k (...)
Can you compare the sender counters to the receiver counters? ie, is the sender actually queuing and TX'ing 600kpps, or is it something stupid like pause frames causing it to be dropped on output?
Flow control (pause frame) is totally disabled (rx and tx):
dev.igb.2.fc: 0
And the switch port statistics confirm that:
- Ethernet flow control correctly disabled
- it sends about 600Kpps to the server
BSDRP-bench#sh int Gi1/0/15 GigabitEthernet1/0/15 is up, line protocol is up (connected) Hardware is Gigabit Ethernet, address is b0fa.eb14.d20f (bia b0fa.eb14.d20f) Description: BSDRP2-igb2 MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec, reliability 255/255, txload 78/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive not set Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX input flow-control is off, output flow-control is unsupported ARP type: ARPA, ARP Timeout 04:00:00 Last input never, output 18w5d, output hang never Last clearing of "show interface" counters 00:00:25 Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 30 second input rate 0 bits/sec, 0 packets/sec 30 second output rate 307182000 bits/sec, 599967 packets/sec 0 packets input, 0 bytes, 0 no buffer Received 0 broadcasts (0 multicasts) 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog, 0 multicast, 0 pause input 0 input packets with dribble condition detected 14993035 packets output, 959552000 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 0 unknown protocol drops 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier, 0 pause output 0 output buffer failures, 0 output buffers swapped out
But I will create a different PR for this problem: I don't want to pollute this review ;-)
ok. Yes, create a different PR to figure out whether there are any missing counters in igb / iflib+igb. That'll make diagnosing/fixing issues with the netisr work easier.
As for the netisr work and performance delta - I think it's okay to commit it as-is for now as this is a more generic netisr problem, not specifically for how ipsec is doing it. The fact we don't have a nice way to signal back pressure up the stack (because as I said, lots of things expect to just see packets drop at the queue point, rather than signal up to the original producer there's an issue) is a general problem, and we're not going to solve it here.
btw - if it doesn't introduce a perf regression when disabled on 32 bit platforms, I'm ok with this landing if it hasn't yet.