Details

Reviewers

mjg
bcr

Group Reviewers

network
manpages

Commits

rS341646: pfsync: Performance improvement

Summary

pfsync code is called for every new state, state update and state
deletion in pf. While pf itself can operate on multiple states at the
same time (on different cores, assuming the states hash to a different
hashrow), pfsync only had a single lock.
This greatly reduced throughput on multicore systems.

Address this by splitting the pfsync queues into buckets, based on the
state id. This ensures that updates for a given connection always end up
in the same bucket, which allows pfsync to still collapse multiple
updates into one, while allowing multiple cores to proceed at the same
time.

The number of buckets is tunable, but defaults to 2 x number of cpus.
Initial benchmarking shows a roughly 2x improvement over previous
pfsync.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 21268

Event Timeline

kp created this revision.Nov 28 2018, 10:36 PM

Herald added a subscriber: farrokhi. · View Herald TranscriptNov 28 2018, 10:36 PM

Harbormaster completed remote builds in B21252: Diff 51305.Nov 28 2018, 10:37 PM

kp added reviewers: network, mjg.Nov 28 2018, 10:38 PM

kp set the repository for this revision to rS FreeBSD src repository - subversion.

Herald added a subscriber: imp. · View Herald TranscriptNov 28 2018, 10:38 PM

This is a slightly rough first version. I'm sure there are some remaining style issues (and there might be cleanup issues too).

And here's my flame graph: https://people.freebsd.org/~kp/pfsync/buckets.svg

once more i don't have a full picture so can't give a proper review.

do i read it right it creates a dedicated swi for each bucket? that's probably an overkill and it may cause anomalies. i strongly suspect you would be fine with just one gathering stuff from buckets. general scheme is that you lock the target bucket and transfer entries to a new list. there are _CONCAT macros which can do it. Then you would do the send without having any of the buckets locked.

the general direction is definitely sound

In D18373#390595, @mjg wrote:

once more i don't have a full picture so can't give a proper review.

Well, the remarks you've had have been pretty helpful so far.

do i read it right it creates a dedicated swi for each bucket? that's probably an overkill and it may cause anomalies.

What sort of anomalies?

i strongly suspect you would be fine with just one gathering stuff from buckets. general scheme is that you lock the target bucket and transfer entries to a new list. there are _CONCAT macros which can do it. Then you would do the send without having any of the buckets locked.

I can certainly give that a try.

Single pfsyncintr swi, rather than one per bucket.

Harbormaster completed remote builds in B21268: Diff 51368.Nov 29 2018, 9:16 PM

kp mentioned this in D17993: pfsync: Call pfsyncintr() directly from pfsync_msg_intr() rather than scheduling a swi.Dec 2 2018, 4:48 PM

kp mentioned this in D17992: pfsync: Reduce contention on PFSYNC_LOCK().

kp mentioned this in D17994: pfsync: Insert static trace points.

Minor improvements, style and such.
Small man page addition.

Absent any remarks or objections I intend to commit this in a couple of days.

Herald added a reviewer: manpages. · View Herald TranscriptDec 2 2018, 5:19 PM

Harbormaster completed remote builds in B21310: Diff 51500.Dec 2 2018, 5:20 PM

Can you run "mandoc -Tlint" and textproc/igor on your manpage change? It'll probably tell you a few things (no, not next week's lottery numbers).

I look forward to these improvements. Thanks.

share/man/man4/pfsync.4
136 ↗	(On Diff #51500)	You need to break lines after each sentence stop.

Address man page remark.

kp marked an inline comment as done.Dec 3 2018, 8:36 AM

Harbormaster completed remote builds in B21323: Diff 51519.Dec 3 2018, 8:38 AM

OK from manpages. Don't forget to bump the .Dd at the beginning of the man page to the date of the commit as this is a content change.

This revision is now accepted and ready to land.Dec 3 2018, 8:54 AM

Fix build with ! VIMAGE

This revision now requires review to proceed.Dec 3 2018, 11:02 AM

Harbormaster completed remote builds in B21327: Diff 51525.Dec 3 2018, 11:04 AM

Benches results using Denial-of-Service mode (line-rate at maximum pps) on small hardware with pf+pfysnc (using loopback interface):
PC Engines APU 2 (4 cores AMD GX 1Ghz):

x FreeBSD 13 r341401: inet4 paquets-per-second
+ FreeBSD 13 r341401 with D18373v51500: inet4 paquets-per-second
+--------------------------------------------------------------------------+
|                                                                +         |
|xxxxx                                                          ++ +      +|
||_A_|                                                                     |
|                                                              |_M_A___|   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        269035        272858        271286      271098.8     1664.0444
+   5        333694        343492        333841      336323.6     4212.6452
Difference at 95.0% confidence
        65224.8 +/- 4671.05
        24.0594% +/- 1.78459%
        (Student's t, pooled s = 3202.77)

+24% improvement here.

On a 4 core ATOM:

x FreeBSD 13 r341401: inet4 paquets-per-second
+ FreeBSD 13 r341401 with D18373v51500: inet4 paquets-per-second
+--------------------------------------------------------------------------+
|xx                                                                     +  |
|xx                                                             +      +++ |
||A                                                                        |
|                                                                  |__A_M_||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        312281      314103.5        313678      313292.4     782.45155
+   5        402306        415098        413302      411443.9     5175.6884
Difference at 95.0% confidence
        98151.5 +/- 5398.21
        31.329% +/- 1.73695%
        (Student's t, pooled s = 3701.35)

> +31% improvement here

can you grab a flamegraph from such a test? also, can you compare this against https://reviews.freebsd.org/D17992 ?

there are probably some % to get in a decently easy manner

Can you please measure the latency of syncing states with this change against previous latency?

Not a blocker but:

It would also be nice to have measure if the other side can keep up with the blast of state updates now?
Even better, provide the same bucket mechanism on reception so it can be distributed on the various cores

sys/netpfil/pf/if_pfsync.c
1352	Is this respecting at all the ifqmaxlen that is set when the interface is created?
1434	Same here, is the ifqmaxlen being honored?

In D18373#392610, @eri wrote:

Can you please measure the latency of syncing states with this change against previous latency?

I'm not sure how to best measure this, but there's not really any reason for there to be significant additional latency. We still have the same 1 second timeout on updates: if we don't fill a packet in 1 second whatever updates we do have will then get sent out. This is now per-bucket, so it's possible (mostly in low load situations) that more updates will fall into that delay than before, but that's the only difference I can think of.
In high-load situations there shouldn't be a measurable difference.

In D18373#392613, @eri wrote:

Not a blocker but:

It would also be nice to have measure if the other side can keep up with the blast of state updates now?

Even better, provide the same bucket mechanism on reception so it can be distributed on the various cores

You can switch to defer mode, in which case data-path packets (that would create new state) will be held back until the peer pfsync confirms it has received the state. I'm not sure that's a good idea though. Certainly not for performance.

As for buckets on reception, that's already the case. On pfsync_input() we never take the pfsync lock, only the pf rules read lock and the appropriate PF_HASHROW_LOCK().

sys/netpfil/pf/if_pfsync.c
1352	Yes, in pfsync_sendout() there's a check (_IF_QFULL). There's no change here, other than that the queue is now per-bucket (but so is the maxlen).

This revision was not accepted when it landed; it landed in state Needs Review.Dec 6 2018, 7:27 PM

Closed by commit rS341646: pfsync: Performance improvement (authored by kp). · Explain Why

This revision was automatically updated to reflect the committed changes.

pfsync: Performance improvement
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

> +31% improvement here

Revision Contents
Changeset List

Diff 51368

sys/netpfil/pf/if_pfsync.c

pfsync: Performance improvementClosedPublicActions

Details

Diff Detail

Event Timeline

> +31% improvement here

Revision ContentsChangeset List

Diff 51368

sys/netpfil/pf/if_pfsync.c

pfsync: Performance improvement
ClosedPublic
Actions

Revision Contents
Changeset List