Page MenuHomeFreeBSD

pfsync: Performance improvement
ClosedPublic

Authored by kristof on Wed, Nov 28, 10:36 PM.

Details

Summary

pfsync code is called for every new state, state update and state
deletion in pf. While pf itself can operate on multiple states at the
same time (on different cores, assuming the states hash to a different
hashrow), pfsync only had a single lock.
This greatly reduced throughput on multicore systems.

Address this by splitting the pfsync queues into buckets, based on the
state id. This ensures that updates for a given connection always end up
in the same bucket, which allows pfsync to still collapse multiple
updates into one, while allowing multiple cores to proceed at the same
time.

The number of buckets is tunable, but defaults to 2 x number of cpus.
Initial benchmarking shows a roughly 2x improvement over previous
pfsync.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kristof created this revision.Wed, Nov 28, 10:36 PM
kristof set the repository for this revision to rS FreeBSD src repository.

This is a slightly rough first version. I'm sure there are some remaining style issues (and there might be cleanup issues too).

mjg added a comment.EditedWed, Nov 28, 10:59 PM

once more i don't have a full picture so can't give a proper review.

do i read it right it creates a dedicated swi for each bucket? that's probably an overkill and it may cause anomalies. i strongly suspect you would be fine with just one gathering stuff from buckets. general scheme is that you lock the target bucket and transfer entries to a new list. there are _CONCAT macros which can do it. Then you would do the send without having any of the buckets locked.

the general direction is definitely sound

In D18373#390595, @mjg wrote:

once more i don't have a full picture so can't give a proper review.

Well, the remarks you've had have been pretty helpful so far.

do i read it right it creates a dedicated swi for each bucket? that's probably an overkill and it may cause anomalies.

What sort of anomalies?

i strongly suspect you would be fine with just one gathering stuff from buckets. general scheme is that you lock the target bucket and transfer entries to a new list. there are _CONCAT macros which can do it. Then you would do the send without having any of the buckets locked.

I can certainly give that a try.

kristof updated this revision to Diff 51368.Thu, Nov 29, 9:14 PM

Single pfsyncintr swi, rather than one per bucket.

kristof updated this revision to Diff 51500.Sun, Dec 2, 5:19 PM

Minor improvements, style and such.
Small man page addition.

Absent any remarks or objections I intend to commit this in a couple of days.

bcr added a subscriber: bcr.Mon, Dec 3, 7:56 AM

Can you run "mandoc -Tlint" and textproc/igor on your manpage change? It'll probably tell you a few things (no, not next week's lottery numbers).

I look forward to these improvements. Thanks.

share/man/man4/pfsync.4
136 ↗(On Diff #51500)

You need to break lines after each sentence stop.

kristof updated this revision to Diff 51519.Mon, Dec 3, 8:36 AM

Address man page remark.

kristof marked an inline comment as done.Mon, Dec 3, 8:36 AM
bcr accepted this revision.Mon, Dec 3, 8:54 AM

OK from manpages. Don't forget to bump the .Dd at the beginning of the man page to the date of the commit as this is a content change.

This revision is now accepted and ready to land.Mon, Dec 3, 8:54 AM
kristof updated this revision to Diff 51525.Mon, Dec 3, 11:02 AM

Fix build with ! VIMAGE

This revision now requires review to proceed.Mon, Dec 3, 11:02 AM
olivier added a subscriber: olivier.Mon, Dec 3, 4:25 PM

Benches results using Denial-of-Service mode (line-rate at maximum pps) on small hardware with pf+pfysnc (using loopback interface):
PC Engines APU 2 (4 cores AMD GX 1Ghz):

x FreeBSD 13 r341401: inet4 paquets-per-second
+ FreeBSD 13 r341401 with D18373v51500: inet4 paquets-per-second
+--------------------------------------------------------------------------+
|                                                                +         |
|xxxxx                                                          ++ +      +|
||_A_|                                                                     |
|                                                              |_M_A___|   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        269035        272858        271286      271098.8     1664.0444
+   5        333694        343492        333841      336323.6     4212.6452
Difference at 95.0% confidence
        65224.8 +/- 4671.05
        24.0594% +/- 1.78459%
        (Student's t, pooled s = 3202.77)

+24% improvement here.

On a 4 core ATOM:

x FreeBSD 13 r341401: inet4 paquets-per-second
+ FreeBSD 13 r341401 with D18373v51500: inet4 paquets-per-second
+--------------------------------------------------------------------------+
|xx                                                                     +  |
|xx                                                             +      +++ |
||A                                                                        |
|                                                                  |__A_M_||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        312281      314103.5        313678      313292.4     782.45155
+   5        402306        415098        413302      411443.9     5175.6884
Difference at 95.0% confidence
        98151.5 +/- 5398.21
        31.329% +/- 1.73695%
        (Student's t, pooled s = 3701.35)

> +31% improvement here

mjg added a comment.Wed, Dec 5, 10:46 AM

can you grab a flamegraph from such a test? also, can you compare this against https://reviews.freebsd.org/D17992 ?

there are probably some % to get in a decently easy manner

eri added a subscriber: eri.Wed, Dec 5, 5:03 PM

Can you please measure the latency of syncing states with this change against previous latency?

eri added a comment.Wed, Dec 5, 5:08 PM

Not a blocker but:

  • It would also be nice to have measure if the other side can keep up with the blast of state updates now?
  • Even better, provide the same bucket mechanism on reception so it can be distributed on the various cores
sys/netpfil/pf/if_pfsync.c
1352 ↗(On Diff #51525)

Is this respecting at all the ifqmaxlen that is set when the interface is created?

1434 ↗(On Diff #51525)

Same here, is the ifqmaxlen being honored?

In D18373#392610, @eri wrote:

Can you please measure the latency of syncing states with this change against previous latency?

I'm not sure how to best measure this, but there's not really any reason for there to be significant additional latency. We still have the same 1 second timeout on updates: if we don't fill a packet in 1 second whatever updates we do have will then get sent out. This is now per-bucket, so it's possible (mostly in low load situations) that more updates will fall into that delay than before, but that's the only difference I can think of.
In high-load situations there shouldn't be a measurable difference.

kristof marked 3 inline comments as done.Wed, Dec 5, 7:33 PM
In D18373#392613, @eri wrote:

Not a blocker but:

  • It would also be nice to have measure if the other side can keep up with the blast of state updates now?
  • Even better, provide the same bucket mechanism on reception so it can be distributed on the various cores

You can switch to defer mode, in which case data-path packets (that would create new state) will be held back until the peer pfsync confirms it has received the state. I'm not sure that's a good idea though. Certainly not for performance.

As for buckets on reception, that's already the case. On pfsync_input() we never take the pfsync lock, only the pf rules read lock and the appropriate PF_HASHROW_LOCK().

sys/netpfil/pf/if_pfsync.c
1352 ↗(On Diff #51525)

Yes, in pfsync_sendout() there's a check (_IF_QFULL). There's no change here, other than that the queue is now per-bucket (but so is the maxlen).

This revision was not accepted when it landed; it landed in state Needs Review.Thu, Dec 6, 7:27 PM
Closed by commit rS341646: pfsync: Performance improvement (authored by kp, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.