Paths

Table of Contentst

Add netisr queue for deferred IPsec processing to reduce kernel stack requirements
AcceptedPublic
Actions

Authored by ae on Feb 22 2017, 10:23 AM.

Details

Reviewers

adrian
olivier

Group Reviewers

network

Summary

Several users reported that they saw the exhaustion of kernel stack and thus panics on the i386 architecture, when IPsec is used with complex network configurations. I.e. when netgraph nodes, pf/ipfw, gif/gre are in use together with IPsec. Panics contain very long call chains.
To reduce such chains, I added new netisr(9) protocol NETISR_IPSEC and modified IPsec code to be able use netisr queue for deferred IPsec processing for outbound packets.
The feature is disabled by default on arches where kstack_pages is greater or equal to 4. That seems enough to work without netisr and without stack exhaustion.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 7741
Build 7888: arc lint + arc unit

Event Timeline

ae updated this revision to Diff 25518.Feb 22 2017, 10:23 AM

ae retitled this revision from to Add netisr queue for deferred IPsec processing to reduce kernel stack requirements.

ae updated this object.

ae edited the test plan for this revision. (Show Details)

ae added a reviewer: network.

ae set the repository for this revision to rS FreeBSD src repository - subversion.

Herald added a subscriber: imp. · View Herald TranscriptFeb 22 2017, 10:23 AM

gnn added a subscriber: gnn.Feb 23 2017, 12:45 PM

gnn added inline comments.

sys/netipsec/ipsec.c
1421	!! ?

ae marked an inline comment as done.Feb 24 2017, 7:08 AM

ae added inline comments.

sys/netipsec/ipsec.c
1421	Yes, in this case it is useless. I'll remove it.

Remove useless !!.

So, if there is no objection, I'll commit this patch.

If you can wait a day or so I'd love to have Olivier, now added to the review, run one of his perf tests on this.

This seems like a sensible general change. I'm quite surprised it wasn't this way already (.. and sort of misremembered that it was -- IPSEC should always have been using the netisr...).

Here is the benefit on standard server:

Impact of D9721 on IPSec performance (null encryption)

IBM System x3550 M3 with quad cores (Intel Xeon L5630 2.13GHz, hyper-threading disabled)
Quad port Intel 82580
FreeBSD 12-head r314642
harvest.mask=351
2000 flows of clear smallest-size UDP packets, multiplexed in one IPSec flow

x r314642 packet-per-seconds
+ r314642 with D9721 packet-per-seconds
+--------------------------------------------------------------------------+
|                                 x                                     +  |
|x                             x  x           + x +                +    +  |
|           |_________________A___M____________|                           |
|                                                |____________A____M______||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        546152        550381        549167      548753.4     1564.3483
+   5      550273.5      552629.5      552169.5      551659.7     1129.4238
Difference at 95.0% confidence
        2906.3 +/- 1989.79
        0.529619% +/- 0.363866%
        (Student's t, pooled s = 1364.33)

Do you want flamegraph too?

In D9721#204240, @olivier wrote:
Difference at 95.0% confidence
2906.3 +/- 1989.79
0.529619% +/- 0.363866%
(Student's t, pooled s = 1364.33)
Do you want flamegraph too?

Do you have enabled net.inet.ipsec.use_netisr=1?
Since the main goal of this patch is reducing the kernel stack requirements, it is enabled by default only when KSTACK_PAGES is less than 4.
On amd64 it is not enabled by default.

In D9721#204240, @olivier wrote:
Difference at 95.0% confidence
2906.3 +/- 1989.79
0.529619% +/- 0.363866%
(Student's t, pooled s = 1364.33)
Do you want flamegraph too?

You may also need to increase NETISR queue length in case of growed number of drops.
Look at the netstat -Q output. And use net.inet.ipsec.intr_queue_maxlen variable if number of drops grows.

With netisr is enabled on this 64bits platform, performance decrease about 22%:

x r314642 packet-per-seconds
+ r314642 with D9721 packet-per-seconds
* r314642 with D9721 and net.inet.ipsec.use_netisr=1 packet-per-seconds
+--------------------------------------------------------------------------+
|   *                                                                   x++|
|*  *   *     *                                                       x x*+|
|                                                                      |A| |
|                                                                        AM|
||__M_A____|                                                               |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        546152        550381        549167      548753.4     1564.3483
+   5      550273.5      552629.5      552169.5      551659.7     1129.4238
Difference at 95.0% confidence
        2906.3 +/- 1989.79
        0.529619% +/- 0.363866%
        (Student's t, pooled s = 1364.33)
*   5        416795      440933.5        423271      426451.8     9207.3367
Difference at 95.0% confidence
        -122302 +/- 9631.37
        -22.2872% +/- 1.74536%
        (Student's t, pooled s = 6603.87)

In D9721#204264, @olivier wrote:
5 416795 440933.5 423271 426451.8 9207.3367

Difference at 95.0% confidence
-122302 +/- 9631.37
-22.2872% +/- 1.74536%
(Student's t, pooled s = 6603.87)

Have you observed the packet loss in the netstat -Q statistics? I guess it is possible on such packet rate with default queue length.

In D9721#204270, @ae wrote:
In D9721#204264, @olivier wrote:
5 416795 440933.5 423271 426451.8 9207.3367

Difference at 95.0% confidence
-122302 +/- 9631.37
-22.2872% +/- 1.74536%
(Student's t, pooled s = 6603.87)
Have you observed the packet loss in the netstat -Q statistics? I guess it is possible on such packet rate with default queue length.

Yes a lot's :

[root@R2]~# netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         1            1
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs         disabled          n/a
 
Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1    256   flow  default   ---
igmp       2    256 source  default   ---
rtsock     3   2048 source  default   ---
arp        4    256 source  default   ---
ether      5    256 source   direct   ---
ip6        6    256   flow  default   ---
ipsec     11    256 source  default   ---
 
Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0     0 89370650        0        0        0 89370650
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     2        0        0        0      114      114
   0   0   arp        0     0       24        0        0        0       24
   0   0   ether      0     0 89370674        0        0        0 89370674
   0   0   ip6        0     0        0        0        0        0        0
   0   0   ipsec      0   256        0        0 616976559 221804972 221804972

Then, following your advice I've increased net.inet.ipsec.intr_queue_maxlen to 8192, but peformance are lot's worst:

x r314642 with D9721 and net.inet.ipsec.use_netisr=1 (packet-per-seconds)
+ r314642 with D9721 and net.inet.ipsec.use_netisr=1 and net.inet.ipsec.intr_queue_maxlen=8192 (packet-per-seconds)
+--------------------------------------------------------------------------+
|               +                                                     x    |
|         +     +  +                                                 xx * x|
|                                                                    |MA_| |
||______________M__________A________________________|                      |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        416795      440933.5        423271      426451.8     9207.3367
+   5        132107        432864        163939      213309.4     123845.92
Difference at 95.0% confidence
        -213142 +/- 128072
        -49.9804% +/- 29.97%
        (Student's t, pooled s = 87814)

statistics with this 8k queue:

[root@R2]~# netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         1            1
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs         disabled          n/a
 
Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1    256   flow  default   ---
igmp       2    256 source  default   ---
rtsock     3   2048 source  default   ---
arp        4    256 source  default   ---
ether      5    256 source   direct   ---
ip6        6    256   flow  default   ---
ipsec     11   8192 source  default   ---
 
Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0     0 24726004        0        0        0 24726004
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     2        0        0        0      114      114
   0   0   arp        0     0        5        0        0        0        5
   0   0   ether      0     0 24726009        0        0        0 24726009
   0   0   ip6        0     0        0        0        0        0        0
   0   0   ipsec   8192  8192        0        0 87097837 11805690 11789306

I never play with netisr before, then I'm not used to tune it.

Hiya,

Those drops are the biggest problem, and (without reading the patch too carefully) it may be that it's running only one netisr thread right now, rather than net.isr.maxthreads=<some number greater than 1>.

When you run "top -HS -s 1" during the test, is a whole CPU core being chewed up doing netisr work?

If so, try putting say, "net.isr.maxthreads=8" and reboot and re-test.

oh hm, one ipsec flow? so ok. So, can you also dump out the NIC statistics so we can see if the NIC is seeing all the traffic on one RX ring, or whether it's being load balanced between multiple RX rings.

Ie, I would like to know if the non-netisr-ipsec method is simply getting parallelism by having the RX path being run on multiple RX NIC threads (and then serialising for some kind of data access, but parallelising for say, crypto handling) and the simple netisr conversion is squeezing it all through one CPU netisr thread.

In D9721#204330, @adrian wrote:

oh hm, one ipsec flow? so ok. So, can you also dump out the NIC statistics so we can see if the NIC is seeing all the traffic on one RX ring, or whether it's being load balanced between multiple RX rings.

Adrian, please, note that this netisr queue is for outbound traffic. It queues packets that are going to be encrypted. And these packets are ordered by SPI value, so for one flow there is no parallelism. Also there are no parallelism in the crypto(9) subsystem, so you wont get any benefits from many isr threads, they all will be blocked on the crypto thread.

In D9721#204349, @ae wrote:

In D9721#204330, @adrian wrote:

oh hm, one ipsec flow? so ok. So, can you also dump out the NIC statistics so we can see if the NIC is seeing all the traffic on one RX ring, or whether it's being load balanced between multiple RX rings.

Adrian, please, note that this netisr queue is for outbound traffic. It queues packets that are going to be encrypted. And these packets are ordered by SPI value, so for one flow there is no parallelism. Also there are no parallelism in the crypto(9) subsystem, so you wont get any benefits from many isr threads, they all will be blocked on the crypto thread.

Actually that is not true it will run in parallell in the netisr thread.
Though two things that are problematic:

Mbuf tag
bouncing CPU by keeping the flow by SPI, most probably you should tell netisr to launch on same cpu there is some way for that IIRC

Interesting. Ok. Well the existing ipsec code for transmit just runs to completion to ip output right?

So there's some natural transmit flow control that doesn't exist here. The sending side now just keeps submitting quickly to the transmit queue and quickly fails rather than taking time to process into the nic queue.

Ill have another look in a couple days. Those drops mean a lot of mbuf allocator churn right? We are allocating tags and an mbuf then freeing it upon netisr queue fail. Can you verify that theory via vmstat?

I was sending 2000 UDP flows at gigabit line rate (1.48Mpps):

[root@R2]~# vmstat -i
interrupt                          total       rate
irq4: uart0                          525          3
irq16: atapci0                       140          1
irq17: uhci0 uhci2+                   33          0
cpu0:timer                         31124        169
cpu1:timer                         28883        157
cpu2:timer                         29053        158
cpu3:timer                         32384        176
irq256: bce0                         118          1
irq263: igb1:rxq0                      1          0
irq268: igb2:rxq0                 134280        729
irq269: igb2:rxq1                 136486        741
irq270: igb2:rxq2                 136184        739
irq271: igb2:rxq3                 140669        764
irq272: igb2:aq                        2          0
irq273: igb3:rxq0                      3          0
irq277: igb3:aq                        2          0
irq278: mpt0                        3645         20
Total                             673532       3656

[root@R2]~# top -HSb
last pid:  1162;  load averages:  1.84,  0.55,  0.20  up 0+00:03:29    07:33:41
109 processes: 8 running, 83 sleeping, 18 waiting

Mem: 10M Active, 688K Inact, 216M Wired, 15M Buf, 15G Free
Swap:


  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
    0 root       -76    -     0K   240K CPU1    1   0:36  76.17% kernel{softirq_1}
    0 root       -76    -     0K   240K -       3   0:36  76.07% kernel{softirq_3}
    0 root       -76    -     0K   240K CPU2    2   0:36  75.98% kernel{softirq_2}
    0 root       -76    -     0K   240K -       0   0:36  75.20% kernel{softirq_0}
   12 root       -72    -     0K   304K CPU3    3   0:26  55.37% intr{swi1: netisr 0}
   11 root       155 ki31     0K    64K RUN     1   3:09  14.79% idle{idle: cpu1}
   11 root       155 ki31     0K    64K RUN     2   2:44  14.79% idle{idle: cpu2}
   11 root       155 ki31     0K    64K RUN     3   3:05  14.60% idle{idle: cpu3}
   11 root       155 ki31     0K    64K RUN     0   3:08  13.77% idle{idle: cpu0}
    0 root       -16    -     0K   240K swapin  1   3:38   0.00% kernel{swapper}
    4 root       -16    -     0K    32K -       3   0:04   0.00% cam{scanner}
   12 root       -60    -     0K   304K WAIT    2   0:01   0.00% intr{swi4: clock (0)}
    0 root         8    -     0K   240K -       3   0:01   0.00% kernel{thread taskq}
   12 root       -88    -     0K   304K WAIT    2   0:01   0.00% intr{irq16: atapci0}
    4 root       -16    -     0K    32K -       0   0:00   0.00% cam{doneq0}
   14 root       -68    -     0K   576K -       1   0:00   0.00% usb{usbus3}
   14 root       -68    -     0K   576K -       1   0:00   0.00% usb{usbus0}
   14 root       -68    -     0K   576K -       0   0:00   0.00% usb{usbus2}

[root@R2]~# netstat -iw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
   1489807     0     0  178775646     202564     0   18230912     0
   1487955     0     0  178554606     202680     0   18241284     0
   1488185     0     0  178582026     203322     0   18299154     0
   1488197     0     0  178583586     201830     0   18164694     0
   1488577     0     0  178629966     203828     0   18344604     0
   1487709     0     0  178524966     202929     0   18263694     0
   1488501     0     0  178619166     204997     0   18449814     0
   1487926     0     0  178551726     203724     0   18335244     0

All igb2 4 input queues are consomming all my 4 cores, preventing netisr to gain enought CPU time slot: It can output only about 200Kpps.
Theses 2000 unencrypted flows are multiplexed into one encrypted IPSec flow: Then only one output queue of igb3 is used.
Notice that about netstat input statistic: It correcly display 1.48Mpps, but "err" and "idrop" are empty

Then I lowered the pkt generator to 600Kpps only, it should allow netisr to gain more CPU time slot:

[root@R2]~# top -HSb
last pid:  1171;  load averages:  2.91,  2.12,  1.04  up 0+00:08:17    07:38:29
109 processes: 6 running, 85 sleeping, 18 waiting

Mem: 10M Active, 704K Inact, 216M Wired, 15M Buf, 15G Free
Swap:


  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   12 root       -72    -     0K   304K CPU2    2   3:00  76.37% intr{swi1: netisr 0}
   11 root       155 ki31     0K    64K CPU0    0   4:45  57.18% idle{idle: cpu0}
   11 root       155 ki31     0K    64K RUN     2   4:23  56.40% idle{idle: cpu2}
   11 root       155 ki31     0K    64K RUN     3   4:40  55.27% idle{idle: cpu3}
   11 root       155 ki31     0K    64K CPU1    1   4:43  51.17% idle{idle: cpu1}
    0 root       -76    -     0K   240K -       1   3:10  30.66% kernel{softirq_1}
    0 root       -76    -     0K   240K -       3   3:09  30.57% kernel{softirq_3}
    0 root       -76    -     0K   240K -       2   3:10  30.47% kernel{softirq_2}
    0 root       -76    -     0K   240K -       0   3:08  30.27% kernel{softirq_0}
    0 root       -16    -     0K   240K swapin  0   3:38   0.00% kernel{swapper}
    4 root       -16    -     0K    32K -       3   0:04   0.00% cam{scanner}
   12 root       -60    -     0K   304K WAIT    0   0:01   0.00% intr{swi4: clock (0)}
    0 root         8    -     0K   240K -       2   0:01   0.00% kernel{thread taskq}
   12 root       -88    -     0K   304K WAIT    2   0:01   0.00% intr{irq16: atapci0}
    4 root       -16    -     0K    32K -       3   0:00   0.00% cam{doneq0}
   14 root       -68    -     0K   576K -       0   0:00   0.00% usb{usbus3}
   14 root       -68    -     0K   576K -       0   0:00   0.00% usb{usbus0}
   14 root       -68    -     0K   576K -       0   0:00   0.00% usb{usbus2}
	
[root@R2]~# netstat -iw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
    616416     0     0   73969866     316102     0   28449312     0
    600070     0     0   72008346     308140     0   27732774     0
    599587     0     0   71950386     308187     0   27736824     0
    584671     0     0   70160466     299747     0   26977314     0
    600065     0     0   72007746     308091     0   27728274     0
    615425     0     0   73850946     316062     0   28445664     0
    600065     0     0   72007746     308358     0   27752304     0
    600065     0     0   72007746     308221     0   27739974     0

Now that netisr have more CPU time slot, output packet performance increase from 200Kpps to 300Kppps (FYI it was about 550Kpps without netisr involved).

Then, following your advice, I've bump net.isr.maxthreads and bind them:

net.isr.bindthreads=1
net.isr.maxthreads=4 (because I've got only 4 cores here)

[root@R2]~# top -HSb
last pid:  1188;  load averages:  1.84,  0.68,  0.28  up 0+00:24:49    08:08:13
112 processes: 6 running, 85 sleeping, 21 waiting

Mem: 10M Active, 688K Inact, 216M Wired, 15M Buf, 15G Free
Swap:


  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   12 root       -72    -     0K   352K CPU1    1   1:14  99.76% intr{swi1: netisr 1}
   11 root       155 ki31     0K    64K CPU0    0  24:47  77.10% idle{idle: cpu0}
   11 root       155 ki31     0K    64K CPU2    2  24:21  76.86% idle{idle: cpu2}
   11 root       155 ki31     0K    64K RUN     3  24:48  76.27% idle{idle: cpu3}
    0 root       -76    -     0K   240K -       2   0:22  27.98% kernel{softirq_2}
    0 root       -76    -     0K   240K -       0   0:22  27.88% kernel{softirq_0}
    0 root       -76    -     0K   240K -       3   0:22  27.59% kernel{softirq_3}
   11 root       155 ki31     0K    64K RUN     1  23:53   0.00% idle{idle: cpu1}
    0 root       -16    -     0K   240K swapin  0   3:39   0.00% kernel{swapper}
    0 root       -76    -     0K   240K -       1   0:04   0.00% kernel{softirq_1}
    4 root       -16    -     0K    32K -       2   0:04   0.00% cam{scanner}
    0 root         8    -     0K   240K -       3   0:01   0.00% kernel{thread taskq}
   14 root       -68    -     0K   576K -       0   0:01   0.00% usb{usbus3}
   12 root       -60    -     0K   352K WAIT    2   0:01   0.00% intr{swi4: clock (0)}
    4 root       -16    -     0K    32K -       0   0:00   0.00% cam{doneq0}
    6 root       -16    -     0K    16K idle    3   0:00   0.00% mpt_raid0
   14 root       -68    -     0K   576K -       3   0:00   0.00% usb{usbus1}
   14 root       -68    -     0K   576K -       0   0:00   0.00% usb{usbus4}
   
[root@R2]~# netstat -iw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
    448772     0     0   53852586     444061     0   39965706     0
    449147     0     0   53897586     444188     0   39977004     0
    449146     0     0   53897466     444030     0   39962784     0
    449146     0     0   53897466     444192     0   39977364     0
    449148     0     0   53897706     444040     0   39963684     0
    449150     0     0   53897946     444049     0   39964584     0
    449140     0     0   53896746     444160     0   39974394     0
    449159     0     0   53899026     444154     0   39973944     0
    448772     0     0   53852586     444016     0   39961524     0
    449147     0     0   53897586     443997     0   39959814     0
    449146     0     0   53897466     444210     0   39978984     0
    449146     0     0   53897466     444093     0   39968454     0
    449148     0     0   53897706     444085     0   39967824     0
    449147     0     0   53897586     444145     0   39973044     0
    449112     0     0   53893506     444050     0   39964584     0
    449190     0     0   53902626     443928     0   39953604     0
    448772     0     0   53852586     444041     0   39963774     0

and performance increase from 290Kpps to about 440Kpps (still not the 550Kpps without netisr, but lot's better).

Notice the "netstat -iw 1" behavior change here since net.isr.maxthreads=4: the pkt generator still generate 600Kpps, but netstat show only 449Kpps as input (!?)

But are we agrea than on amd64 (KSTACK_PAGES=4) arch I don't need to enable netisr (net.inet.ipsec.use_netisr=1) ?
Peraps it should be better to found a 32bit platform for benching this patch ?

About vmstat output:

[root@R2]~# vmstat -z
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP

UMA Kegs:               384,      0,     206,       4,     206,   0,   0
UMA Zones:             1152,      0,     207,       0,     207,   0,   0
UMA Slabs:               80,      0,    6529,      21,    7099,   0,   0
UMA Hash:               256,      0,      10,       5,      15,   0,   0
4 Bucket:                32,      0,      89,    1036,    1169,   0,   0
6 Bucket:                48,      0,       9,    1070,     157,   0,   0
8 Bucket:                64,      0,      97,    1205,    8061,  11,   0
12 Bucket:               96,      0,      21,     512,   51790,   0,   0
16 Bucket:              128,      0,      83,     444,   32953,   0,   0
32 Bucket:              256,      0,      72,     303,    8760, 137,   0
64 Bucket:              512,      0,     111,     121,     440, 137,   0
128 Bucket:            1024,      0,      93,     119,     897,   0,   0
256 Bucket:            2048,      0,     214,     118, 1140785,  56,   0
vmem btag:               56,      0,    6887,     639,    6887,  54,   0
(...)
mbuf_packet:            256, 6505365,    2040,    1502,    2921,   0,   0
mbuf:                   256, 6505365,   13507,    5721,468037402,   0,   0
mbuf_cluster:          2048, 1016464,    8150,      98,    9175,   0,   0
mbuf_jumbo_page:       4096, 508231,       0,       3,       7,   0,   0
mbuf_jumbo_9k:         9216, 150587,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384,  84705,       0,       0,       0,   0,   0
DMAR_MAP_ENTRY:         120,      0,       0,       0,       0,   0,   0
g_bio:                  376,      0,       0,     290,   20672,   0,   0
ttyinq:                 160,      0,     195,     130,     555,   0,   0
ttyoutq:                256,      0,     101,      94,     287,   0,   0
nvme_request:           128,      0,       0,       0,       0,   0,   0
cryptop:                 88,      0,       0,     135,147266110,   0,   0
cryptodesc:              72,      0,       1,     164,147266110,   0,   0
vtnet_tx_hdr:            24,      0,       0,       0,       0,   0,   0

In D9721#204350, @eri wrote:

Actually that is not true it will run in parallell in the netisr thread.

Yes, I was not fully correct. Crypto(9) worker thread does its work only when direct crypto dispatch fails.

Though two things that are problematic:

Mbuf tag

I don't see the another way to pass several needed arguments to netisr_queue().

bouncing CPU by keeping the flow by SPI, most probably you should tell netisr to launch on same cpu there is some way for that IIRC

As I said the main goal of the patch is reducing of kernel stack usage. This is solved by queuing and deferred processing. With direct processing we will still use the same thread context and thus stack overflow will not solved.

In D9721#204352, @adrian wrote:

Interesting. Ok. Well the existing ipsec code for transmit just runs to completion to ip output right?

Yes.

So there's some natural transmit flow control that doesn't exist here. The sending side now just keeps submitting quickly to the transmit queue and quickly fails rather than taking time to process into the nic queue.

When ip_output decides that we need IPsec processing, it calls IPSEC_OUTPUT method. When use_netisr is disabled, IPsec does crypto processing and invokes ip_output() again. In case when you have bundled security policy, you does this several times. This is all called in the same thread context. There is comment that describes this:
https://svnweb.freebsd.org/base/head/sys/netipsec/ipsec_output.c?view=annotate#l756

Also note, that IPsec code can invoke pfil processing, that also have own stack usage.

Ill have another look in a couple days. Those drops mean a lot of mbuf allocator churn right? We are allocating tags and an mbuf then freeing it upon netisr queue fail. Can you verify that theory via vmstat?

I don't think so, the drops occurs when netisr queue is overflowed just after ip_output() decides that it needs IPsec processing.
The only allocation there is mbug tag allocation to pass additional arguments to deferred processing handler.

I think we have three options:

do not change anything and require kern.kstack_pages=4 for working IPsec without panics;
apply this patch to slowly work without panics when we have kern.kstack_pages below 4;
rework the patch to use different solution.

In D9721#204542, @ae wrote:

In D9721#204352, @adrian wrote:

Interesting. Ok. Well the existing ipsec code for transmit just runs to completion to ip output right?

Yes.

So there's some natural transmit flow control that doesn't exist here. The sending side now just keeps submitting quickly to the transmit queue and quickly fails rather than taking time to process into the nic queue.

When ip_output decides that we need IPsec processing, it calls IPSEC_OUTPUT method. When use_netisr is disabled, IPsec does crypto processing and invokes ip_output() again. In case when you have bundled security policy, you does this several times. This is all called in the same thread context. There is comment that describes this:
https://svnweb.freebsd.org/base/head/sys/netipsec/ipsec_output.c?view=annotate#l756

Also note, that IPsec code can invoke pfil processing, that also have own stack usage.

Right.

So the latest experiments seem to confirm the theory - the softirq threads are getting more CPU time than the netisr threads. There's no multiplexing into multiple destination netisr threads either.

Ill have another look in a couple days. Those drops mean a lot of mbuf allocator churn right? We are allocating tags and an mbuf then freeing it upon netisr queue fail. Can you verify that theory via vmstat?

I don't think so, the drops occurs when netisr queue is overflowed just after ip_output() decides that it needs IPsec processing.
The only allocation there is mbug tag allocation to pass additional arguments to deferred processing handler.

Well, it's allocating an mbuf, adding mtags, then freeing it if it fails to queue. A lot of our network stack does this because of well, history reasons.

Ok, so on the output path once the netisr path is called, how much of that work is done without a global lock (eg crypto?) I wonder how much of this change in throughput is because the netisr thread != the NIC input CPU path, and RX is saturating (more) cores. Versus, say, some parallel processing.

Because my guess is that the RX -> IP stuff -> IP TX stuff -> ipsec encrypt -> NIC if_transmit() path is happening in one complete thread, to completion each time, so there's backpressure (the NIC for example will drop frames early if the CPU is saturated for example, and/or send pause frames, etc.)

Can we say, run the netisr thread at a higher priority than the NIC softirq, instead of the other way around? That way netisr isn't preempted by the NIC RX softirq?

In D9721#204562, @adrian wrote:

Well, it's allocating an mbuf, adding mtags, then freeing it if it fails to queue. A lot of our network stack does this because of well, history reasons.

Ok, so on the output path once the netisr path is called, how much of that work is done without a global lock (eg crypto?) I wonder how much of this change in throughput is because the netisr thread != the NIC input CPU path, and RX is saturating (more) cores. Versus, say, some parallel processing.

Because my guess is that the RX -> IP stuff -> IP TX stuff -> ipsec encrypt -> NIC if_transmit() path is happening in one complete thread, to completion each time, so there's backpressure (the NIC for example will drop frames early if the CPU is saturated for example, and/or send pause frames, etc.)

After ipsec encrypt there is yet another "IP TX stuff" and then if_transmit :)
You can easily see the same behavior with dropped inbound frame with pfil.
When firewall have too many rules that are not optimized and took too much CPU time, such direct call to completion processing will produce drops on RX. And Olivier has graphs that show this.

Can we say, run the netisr thread at a higher priority than the NIC softirq, instead of the other way around? That way netisr isn't preempted by the NIC RX softirq?

I'm not sure, we usually use direct dispatching, and seems this can affect only hybrid mode.

Yeah, but here's the rub:

if you do 100% direct dispatch, the NIC RX is (somewhat) paced by the drain rate of the rest of the stack handling. Now, some NIC drivers will do batched RX, most of them will just call if_input() on each packet without any refill or queuing
then there's no netisr requeuing going on; 99% of the time it'll just be direct dispatched up
so if the rest of the stack handling is taking up a lot of CPU time or latency (eg lock contention, context switching, etc), it'll end up causing the NIC to run out of RX slots, and that'll cause backpressure.

If that's going on then hopefully (!) we can see it with say, an increase in tx'ed pause frames on the RX side, and/or if it counts "packets received with no descriptors", that'll help. oliver, can you do a sysctl dump before/after these tests so we can see?

But in the netisr side, you're immediately finishing the enqueue and letting the NIC driver continue to run RX, so it has more opportunity to just suck in more packets that'll get discarded by the netisr side of things. There's no backpressure telling the RX side "hey, stop trying to RX stuff into this queue, I'm just having to toss it in later processing." The TX side somewhat has the same problem too.

Olliver - interestingly in your test, netisr chews a whole core and it doesn't allow any other NIC processing to go on on that CPU. Hm! That may be why the input pps is lower, we're dropping 1/4 of the flows really early. CAn you post the igb sysctl output? That is from the hardware..

In D9721#204567, @adrian wrote:

Olliver - interestingly in your test, netisr chews a whole core and it doesn't allow any other NIC processing to go on on that CPU. Hm! That may be why the input pps is lower, we're dropping 1/4 of the flows really early. CAn you post the igb sysctl output? That is from the hardware..

The problem is the igb sysctl counters seems useless (I don't see any rx "drop" counters), here is the output during a bench (with 4 netisr threads, pkt-gen generate 600kpps):

[root@R2]~# netstat -iw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
    450735     0     0   54088146     442452     0   39820812     0
    449970     0     0   53996346     441985     0   39778734     0
    450342     0     0   54040986     442182     0   39796464     0
    450351     0     0   54042066     442190     0   39797184     0
    450347     0     0   54041586     442021     0   39781974     0
    450350     0     0   54041946     442082     0   39787464     0
    450349     0     0   54041826     442024     0   39782334     0
^C
[root@R2]~# echo "Only 450Kpps received, where are the missing 150Kpps (pkt-gen generate 600Kpps) ?"
Only 450Kpps, where are the missing 150Kpps (pkt-gen generate 600Kpps) ?
[root@R2]~# sysctl dev.igb.2.
dev.igb.2.interrupts.rx_overrun: 0
dev.igb.2.interrupts.rx_desc_min_thresh: 0
dev.igb.2.interrupts.tx_queue_min_thresh: 0
dev.igb.2.interrupts.tx_queue_empty: 0
dev.igb.2.interrupts.tx_abs_timer: 0
dev.igb.2.interrupts.tx_pkt_timer: 0
dev.igb.2.interrupts.rx_abs_timer: 0
dev.igb.2.interrupts.rx_pkt_timer: 53504000
dev.igb.2.interrupts.asserts: 1225972
dev.igb.2.mac_stats.tso_ctx_fail: 0
dev.igb.2.mac_stats.tso_txd: 0
dev.igb.2.mac_stats.tx_frames_1024_1522: 0
dev.igb.2.mac_stats.tx_frames_512_1023: 0
dev.igb.2.mac_stats.tx_frames_256_511: 0
dev.igb.2.mac_stats.tx_frames_128_255: 0
dev.igb.2.mac_stats.tx_frames_65_127: 0
dev.igb.2.mac_stats.tx_frames_64: 0
dev.igb.2.mac_stats.mcast_pkts_txd: 0
dev.igb.2.mac_stats.bcast_pkts_txd: 0
dev.igb.2.mac_stats.good_pkts_txd: 0
dev.igb.2.mac_stats.total_pkts_txd: 0
dev.igb.2.mac_stats.good_octets_txd: 0
dev.igb.2.mac_stats.good_octets_recvd: 3424256000
dev.igb.2.mac_stats.rx_frames_1024_1522: 0
dev.igb.2.mac_stats.rx_frames_512_1023: 0
dev.igb.2.mac_stats.rx_frames_256_511: 0
dev.igb.2.mac_stats.rx_frames_128_255: 0
dev.igb.2.mac_stats.rx_frames_65_127: 0
dev.igb.2.mac_stats.rx_frames_64: 53504000
dev.igb.2.mac_stats.mcast_pkts_recvd: 0
dev.igb.2.mac_stats.bcast_pkts_recvd: 0
dev.igb.2.mac_stats.good_pkts_recvd: 53504000
dev.igb.2.mac_stats.total_pkts_recvd: 53504000
dev.igb.2.mac_stats.xoff_txd: 0
dev.igb.2.mac_stats.xoff_recvd: 0
dev.igb.2.mac_stats.xon_txd: 0
dev.igb.2.mac_stats.xon_recvd: 0
dev.igb.2.mac_stats.coll_ext_errs: 0
dev.igb.2.mac_stats.alignment_errs: 0
dev.igb.2.mac_stats.crc_errs: 0
dev.igb.2.mac_stats.recv_errs: 0
dev.igb.2.mac_stats.recv_jabber: 0
dev.igb.2.mac_stats.recv_oversize: 0
dev.igb.2.mac_stats.recv_fragmented: 0
dev.igb.2.mac_stats.recv_undersize: 0
dev.igb.2.mac_stats.recv_no_buff: 0
dev.igb.2.mac_stats.missed_packets: 0
dev.igb.2.mac_stats.defer_count: 0
dev.igb.2.mac_stats.sequence_errors: 0
dev.igb.2.mac_stats.symbol_errors: 0
dev.igb.2.mac_stats.collision_count: 0
dev.igb.2.mac_stats.late_coll: 0
dev.igb.2.mac_stats.multiple_coll: 0
dev.igb.2.mac_stats.single_coll: 0
dev.igb.2.mac_stats.excess_coll: 0
dev.igb.2.queue_rx_3.rx_irq: 0
dev.igb.2.queue_rx_3.rxd_tail: 951
dev.igb.2.queue_rx_3.rxd_head: 953
dev.igb.2.queue_rx_2.rx_irq: 0
dev.igb.2.queue_rx_2.rxd_tail: 325
dev.igb.2.queue_rx_2.rxd_head: 327
dev.igb.2.queue_rx_1.rx_irq: 0
dev.igb.2.queue_rx_0.rx_irq: 0
dev.igb.2.queue_rx_0.rxd_tail: 324
dev.igb.2.queue_rx_0.rxd_head: 326
dev.igb.2.queue_tx_3.no_desc_avail: 0
dev.igb.2.queue_tx_3.tx_irq: 0
dev.igb.2.queue_tx_3.txd_tail: 0
dev.igb.2.queue_tx_3.txd_head: 0
dev.igb.2.queue_tx_2.no_desc_avail: 0
dev.igb.2.queue_tx_2.tx_irq: 0
dev.igb.2.queue_tx_2.txd_tail: 0
dev.igb.2.queue_tx_2.txd_head: 0
dev.igb.2.queue_tx_1.no_desc_avail: 0
dev.igb.2.queue_tx_1.tx_irq: 0
dev.igb.2.queue_tx_1.txd_tail: 0
dev.igb.2.queue_tx_1.txd_head: 0
dev.igb.2.queue_tx_0.no_desc_avail: 0
dev.igb.2.queue_tx_0.tx_irq: 0
dev.igb.2.queue_tx_0.txd_tail: 0
dev.igb.2.queue_tx_0.txd_head: 0
dev.igb.2.fc_low_water: 34800
dev.igb.2.fc_high_water: 34816
dev.igb.2.rx_control: 67141662
dev.igb.2.device_control: 13632065
dev.igb.2.watchdog_timeouts: 0
dev.igb.2.rx_overruns: 0
dev.igb.2.tx_dma_fail: 0
dev.igb.2.mbuf_defrag_fail: 0
dev.igb.2.link_irq: 3
dev.igb.2.dropped: 0
dev.igb.2.eee_control: 1
dev.igb.2.rx_processing_limit: -1
dev.igb.2.itr: 488
dev.igb.2.tx_abs_int_delay: 66
dev.igb.2.rx_abs_int_delay: 66
dev.igb.2.tx_int_delay: 66
dev.igb.2.rx_int_delay: 0
dev.igb.2.reg_dump: General Registers
        CTRL     00d00241
        STATUS   0008078b
        CTRL_EXIT        00100c80

Interrupt Registers
        ICR      00000000

RX Registers
        RCTL     0400801e
        RDLEN    00004000
        RDH      000001c7
        RDT      000001c5
        RXDCTL   02040808
        RDBAL    7d3b4000
        RDBAH    00000000

TX Registers
        TCTL     b50400fa
        TDBAL    7d3a4000
        TDBAH    00000000
        TDLEN    00004000
        TDH      00000000
        TDT      00000000
        TXDCTL   0341011f
        TDFH     00000001
        TDFT     00000000
        TDFHS    00000000
        TDFPC    00000000
dev.igb.2.fc: 0
dev.igb.2.debug: -1
dev.igb.2.nvm: -1
dev.igb.2.iflib.rxq3.rxq_fl0.credits: 1023
dev.igb.2.iflib.rxq3.rxq_fl0.cidx: 56
dev.igb.2.iflib.rxq3.rxq_fl0.pidx: 55
dev.igb.2.iflib.rxq2.rxq_fl0.credits: 1023
dev.igb.2.iflib.rxq2.rxq_fl0.cidx: 456
dev.igb.2.iflib.rxq2.rxq_fl0.pidx: 455
dev.igb.2.iflib.rxq1.rxq_fl0.credits: 1023
dev.igb.2.iflib.rxq1.rxq_fl0.cidx: 100
dev.igb.2.iflib.rxq1.rxq_fl0.pidx: 99
dev.igb.2.iflib.rxq0.rxq_fl0.credits: 1023
dev.igb.2.iflib.rxq0.rxq_fl0.cidx: 455
dev.igb.2.iflib.rxq0.rxq_fl0.pidx: 454
(...txq stats skipped...)
dev.igb.2.iflib.override_nrxds: 0
dev.igb.2.iflib.override_ntxds: 0
dev.igb.2.iflib.override_qs_enable: 0
dev.igb.2.iflib.override_nrxqs: 0
dev.igb.2.iflib.override_ntxqs: 0
dev.igb.2.iflib.driver_version: 7.6.1-k
(...)

Can you compare the sender counters to the receiver counters? ie, is the sender actually queuing and TX'ing 600kpps, or is it something stupid like pause frames causing it to be dropped on output?

In D9721#204783, @adrian wrote:

Can you compare the sender counters to the receiver counters? ie, is the sender actually queuing and TX'ing 600kpps, or is it something stupid like pause frames causing it to be dropped on output?

Flow control (pause frame) is totally disabled (rx and tx):

dev.igb.2.fc: 0

And the switch port statistics confirm that:

Ethernet flow control correctly disabled
it sends about 600Kpps to the server

BSDRP-bench#sh int Gi1/0/15
GigabitEthernet1/0/15 is up, line protocol is up (connected)
  Hardware is Gigabit Ethernet, address is b0fa.eb14.d20f (bia b0fa.eb14.d20f)
  Description: BSDRP2-igb2
  MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
     reliability 255/255, txload 78/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive not set
  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
  input flow-control is off, output flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input never, output 18w5d, output hang never
  Last clearing of "show interface" counters 00:00:25
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  30 second input rate 0 bits/sec, 0 packets/sec
  30 second output rate 307182000 bits/sec, 599967 packets/sec
     0 packets input, 0 bytes, 0 no buffer
     Received 0 broadcasts (0 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 0 multicast, 0 pause input
     0 input packets with dribble condition detected
     14993035 packets output, 959552000 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 unknown protocol drops
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 pause output
     0 output buffer failures, 0 output buffers swapped out

But I will create a different PR for this problem: I don't want to pollute this review ;-)

In D9721#204945, @olivier wrote:
In D9721#204783, @adrian wrote:

Can you compare the sender counters to the receiver counters? ie, is the sender actually queuing and TX'ing 600kpps, or is it something stupid like pause frames causing it to be dropped on output?

Flow control (pause frame) is totally disabled (rx and tx):
dev.igb.2.fc: 0
And the switch port statistics confirm that:

Ethernet flow control correctly disabled

it sends about 600Kpps to the server

ok. Yes, create a different PR to figure out whether there are any missing counters in igb / iflib+igb. That'll make diagnosing/fixing issues with the netisr work easier.

As for the netisr work and performance delta - I think it's okay to commit it as-is for now as this is a more generic netisr problem, not specifically for how ipsec is doing it. The fact we don't have a nice way to signal back pressure up the stack (because as I said, lots of things expect to just see packets drop at the queue point, rather than signal up to the original producer there's an issue) is a general problem, and we're not going to solve it here.

btw - if it doesn't introduce a perf regression when disabled on 32 bit platforms, I'm ok with this landing if it hasn't yet.

This revision is now accepted and ready to land.Mar 19 2017, 2:49 AM