pfsync: Reduce contention on PFSYNC_LOCK()
AbandonedPublic
Actions

Authored by kp on Nov 14 2018, 9:25 PM.

Details

Reviewers

None

Group Reviewers

Summary

pfsync code is called for every new state, state update and state
deletion in pf. While pf itself can operate on multiple states at the
same time (on different cores, assuming the states hash to a different
hashrow), pfsync only had a single lock.
This greatly reduced throughput on multicore systems.

Address this by enqueuing updates, rather than executing them
immediately. Using the lockless ck_ring messages can be created and
enqueued without blocking other threads.
These are then handled in a software interrupt, performing the same
operations as before.

States are references as they're enqueued, so they cannot be cleared
until they've been processed by pfsync.

If pfsync cannot allocate or enqueue a new message it will tell pf to
drop the packet, ensuring that state updates are not lost.

The ring size can be tuned through 'net.pfsync.pfsync_max_msgs'. Note
that the value will always be rounded to a power of two.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 20814

Event Timeline

kp created this revision.Nov 14 2018, 9:25 PM

Herald added subscribers: farrokhi, ae, imp. · View Herald TranscriptNov 14 2018, 9:25 PM

Harbormaster completed remote builds in B20814: Diff 50430.Nov 14 2018, 9:25 PM

I think the approach taken here is iffy. Basic problem with this is that even if there is no lock contention anymore, you are still suffering from bouncing cache lines. Also swi_sched probably does not appreciate being called very often.

Since it does not look like the order of messages is of any significance and you are sending them out periodically, it will probably be much faster to have per-cpu queues which you collect to send out instead. I assume pfsync can't do unrelated streams.

In D17992#384518, @mjg wrote:

I think the approach taken here is iffy. Basic problem with this is that even if there is no lock contention anymore, you are still suffering from bouncing cache lines. Also swi_sched probably does not appreciate being called very often.

My performance tests show an improvement of 80-90% with this change. The swi_sched is costly, yes, but it's a pretty significant improvement in throughput.

Since it does not look like the order of messages is of any significance and you are sending them out periodically, it will probably be much faster to have per-cpu queues which you collect to send out instead. I assume pfsync can't do unrelated streams.

The order of updates is relevant, at least within a single state. I.e. we can't have an update arriving before the state insertion message. It's fine for two state insertion messages for different states to be reordered, but not within one state. Per-CPU queues wouldn't work.

I have no doubt there is an improvement, just saying it is still slower than it can be and unless this uncovered a new major bottleneck, the ring manipulation is the new hotspot.

As noted earlier I don't know the constraints here, even if some ordering is required perhaps there is a way to just sort or have a finer-grained locking instead. Perhaps once a state is in, you can add updates without colliding with anyone updating different states.

Do you have a flamegraph or some other cpu profile of the entire thing? just curious, I have no stake in this code

I've been thinking about what we could do to split up the lock, rather than take this approach, and I'm not sure how.

(This is the current state, I'm ignoring this proposed patch here)
Basically, the relevant bits (i.e. where performance matters) are the state insert/update/delete flows. There's a couple other things (like update requests and deferred packets, but we can ignore those for now).
Those currently happen fully in the pf data path, that is, while pf is processing packets, and we hold a pf state lock already.
pf states are locked per hash row, so there's lots of locks to go around there (and not contend). There's only a single pfsync lock, so that's where the bottleneck is now.

The pfsync code doesn't immediately send out a packet. Updates are kept on an internal queue (sc->sc_qs) first so that updates can be coalesced (i.e. a delete means none of the updates before it matter), and so there's more than one update in a packet. That's tracked in a separate variable (sc->sc_len). When there's enough data (to fill an MTU sized packet) we send it out, or if we've had data pending for a second.

It's fairly straightforward to split up that queue. We don't care about the order of event between different connections, only within one connection. We can just hash the state and get as many buckets as we like. We could even (ab)use the pf state lock to protect our buckets.
The issue I have with that is: how do we make sure that we don't wait for a full second to send out the updates, while we can't have a single lock protecting the sc->sc_len.
We could have per-bucket callouts, but we'd probably want as many buckets in pfsync as there are in pf itself, and that starts at 128k and goes up from there. (If we have fewer we're not getting rid of all of the contention in pfsync, we'd very likely want more than a few thousand at least.)

Maybe you've got ideas?

What needs to be considered here is the assumption of pfsync is that a state created in pf will be synched at shortest possible cycle to the cluster member.
By defering that assumption is relaxed so figuring out baselines of before and after this change would make this more easy to reason about.

Here are flame graphs for my test setup:

plain forwarding
pf only
unpatched pfsync
pfsync with this (and the next) patch

https://people.freebsd.org/~kp/pfsync/

Rough performance results (in Mpps):

No pf

8.400

No pfsync

3.877

Unmodified pfsync

1.123

Patches pfsync (on top of b35...)

2.177

In D17992#385087, @eri wrote:

What needs to be considered here is the assumption of pfsync is that a state created in pf will be synched at shortest possible cycle to the cluster member.
By defering that assumption is relaxed so figuring out baselines of before and after this change would make this more easy to reason about.

I'm not sure I understand what you're getting at here. Are you talking about the 'defer' feature in pfsync (which does sort of the opposite. It defers the data packet until the pfsync update has been acked by the peer), or suggesting that we might try less hard to send pfsync updates early to gain throughput?

That last one might be a good path forward. If we ensure that there's at least always a 1 second delay callout, perhaps combined with a counter or something to trigger an earlier sendout if we've added X updates to the queues, we might get at a reasonable compromise between too many packets and too much delay, while still not having to take extra locks.

So both ring and swi kicking code are significant players. I think a simple and probably good enough solution would just add more rings, perhaps based on the number of hardware threads. Assuming the traffic is hashed to distribute among them, the rings could mostly remain unshared with unrelated threads. Sending out of the traffic would just combine data from all rings. Kicking can also be avoided in a simple manner. You can add a var signifying the frequency of wakeups. The increase the frequency based on the traffic and past certain threshold you stop kicking swi. It has to decay so that if there is no traffic, the code goes back to wakeups once a second (or whatever).

In D17992#386171, @mjg wrote:

So both ring and swi kicking code are significant players. I think a simple and probably good enough solution would just add more rings, perhaps based on the number of hardware threads. Assuming the traffic is hashed to distribute among them, the rings could mostly remain unshared with unrelated threads. Sending out of the traffic would just combine data from all rings.

I'm not sure adding rings would help here. One change here is that pfsync can now cause pf to drop packets. Before this change pfsync caused contention, which would cause us to just not keep up with incoming packets, and the NIC would drop them. Now we don't contend (as much) in pfsync, but if the ring is full we tell pf to drop the packet. If this happens enough you'll see a lot of time in pfsync_msg_enqueue(), because we enter, try to insert into the queue, fail and exit. That's certainly the case here, because in the test I inject 25Mpps of traffic, and even without pf our stack tops out at ~8Mpps.

Adding more rings won't really help any more than making this one ring larger. That merely increases the queue length between the multiple pf threads, and the pfsync processing code (which is still single-threaded) in pfsync_msg_intr() and pfsyncintr().

Kicking can also be avoided in a simple manner. You can add a var signifying the frequency of wakeups. The increase the frequency based on the traffic and past certain threshold you stop kicking swi. It has to decay so that if there is no traffic, the code goes back to wakeups once a second (or whatever).

That would probably be helpful, because there's little point in triggering the pfsync_msg_intr() swi 4000 times.
I'm not quite sure how to best approach this though. I don't really want to have (e.g.) a 1Hz callout kicking the swi, but I can't quite convince myself that mitigations in pfsync_msg_enqueue() (e.g. don't swi_sched() if the ring is more than half-full) don't carry a slight risk that we won't schedule at all.

In D17992#387922, @kristof wrote:

Adding more rings won't really help any more than making this one ring larger. That merely increases the queue length between the multiple pf threads, and the pfsync processing code (which is still single-threaded) in pfsync_msg_intr() and pfsyncintr().

Adding more rings with proper hashing will significantly reduce ping ponging, giving smaller CPU use and potential for higher throughput provided other problems get sorted out.

Kicking can also be avoided in a simple manner. You can add a var signifying the frequency of wakeups. The increase the frequency based on the traffic and past certain threshold you stop kicking swi. It has to decay so that if there is no traffic, the code goes back to wakeups once a second (or whatever).

That would probably be helpful, because there's little point in triggering the pfsync_msg_intr() swi 4000 times.
I'm not quite sure how to best approach this though. I don't really want to have (e.g.) a 1Hz callout kicking the swi, but I can't quite convince myself that mitigations in pfsync_msg_enqueue() (e.g. don't swi_sched() if the ring is more than half-full) don't carry a slight risk that we won't schedule at all.

There is no risk. You take a look at wakeup rate. If you don't decide to kick, that's roughly the upper bound for the delay + the time to get thread on cpu. If you do kick, it is the overhead of kicking + the time to get thread on cpu. You can decide to always kick below a given threshold, including when the thread does not get on CPU at all. The thread regulates wakeup interval based on the amount of traffic it sees.

Single-threaded processing is definitely a bottle neck. Perhaps you can have more than one thread pushing stuff and/or some of it can be done by submitting threads instead.

In D17992#387985, @mjg wrote:

Single-threaded processing is definitely a bottle neck. Perhaps you can have more than one thread pushing stuff and/or some of it can be done by submitting threads instead.

It might be possible to separate the currently single-threaded processing work out into a number of different concurrently runnable blocks.
Say we create ncpu different queues (well, ck_rings, as well as the existing sc_qs, and sc_len), and hash into those based on the state id (basically PF_IDHASH(state) % ncpu), that should let us process these in parallel.
We'd have a timeout per queue, and presumably also swi cookies per queue, but as we'd only have ncpu of those, that should be okay.

I'm going to experiment with that.

I have an alternative version in https://reviews.freebsd.org/D18373.
It splits pfsync up into buckets, with their own lock. Performance is slightly better than this, and it's in many ways a simpler change.