if_epair: implement fanout
ClosedPublic
Actions

Authored by kp on Jan 3 2022, 8:45 PM.

Details

Reviewers

None

Group Reviewers

Commits

rG2e0bee4c7f81: if_epair: implement fanout
rG092da35a0d80: if_epair: implement fanout
rG24f0bfbad57b: if_epair: implement fanout

Summary

Allow multiple cores to be used to process if_epair traffic. We do this
(if RSS is enabled) based on the RSS hash of the incoming packet. This
allows us to distribute the load over multiple cores, rather than
sending everything to the same one.

We also switch from swi_sched() to taskqueues, which also contributes to
better throughput.

Benchmark results:
With net.isr.maxthreads=-1

Setup A: (cc0 - bridge0 - epair0a) (epair0b - bridge1 - cc1)

Before 627 Kpps
After (no RSS) 1.198 Mpps
After (RSS) 3.148 Mpps

Setup B: (cc0 - bridge0 - epaira0) (epair0b - vnet jail - epair1a) (epair1b - bridge1 - cc1)

Before 7.705 Kpps
After (no RSS) 1.017 Mpps
After (RSS) 2.083 Mpps

MFC after: 3 weeks
Sponsored by: Orange Business Services

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kp created this revision.Jan 3 2022, 8:45 PM

Herald added subscribers: glebius, melifaro, ae, imp. · View Herald TranscriptJan 3 2022, 8:45 PM

kp requested review of this revision.Jan 3 2022, 8:45 PM

Harbormaster completed remote builds in B43687: Diff 100897.Jan 3 2022, 8:45 PM

Where do your Setup B, After (RSS) numbers come from? Why are they worse than no RSS? That smells like something is bouncing around rather than sticking.

sys/net/if_epair.c
116	Do you still need this now?

bz added inline comments.Jan 3 2022, 9:30 PM

sys/net/if_epair.c
551	This doesn't scale if you will have 500 epairs on a system. Your probably really want to have a global set of tasks, CPU bound, and balance all epairs over those? (a scenary your test cases do not consider at all)
573	Likewise...

In D33731#762671, @bz wrote:

Where do your Setup B, After (RSS) numbers come from? Why are they worse than no RSS? That smells like something is bouncing around rather than sticking.

That's the observed result. It's not clear to me why it's significantly worse than the simple setup, but your suspicion seems plausible.

sys/net/if_epair.c
116	No, that can go.
551	Right, so we'd create num_queues task queue threads, presumably from MOD_LOAD, and then assign tasks to them from individual epairs. We pass the epair_queue, so all relevant information should be present. I'll see if I can work up a patch to do that, and also how well that works.

bz added inline comments.Jan 3 2022, 10:57 PM

sys/net/if_epair.c
551	For simplicity I'd just create a MIN(maxncpu, num_queues) threads but really for even more simplicity I'd just do either 1 or maxncpu and then have a epair queue per cpu; that means 500 queues will eventually fight for 1 task but a CPU can only do so much as it can and assuming RSS is working well the overall should balance out over all CPUs. The former will get you back to where we have been and might be a sane default for people who run one vnet with low traffic or very few, on a low end system; on the other systems for crypto and all kinds of other things we do create a full set of threads per CPU which with 256(+) threads/cores/cpus will also eventually be interesting to see scale but that'll be someone else's problem to generally solve I assume.

Single set of taskqueues

Harbormaster completed remote builds in B43729: Diff 101022.Jan 5 2022, 11:09 PM

In D33731#763409, @kp wrote:

Single set of taskqueues

Something like this?

This still has at least some issues, because enabling RSS doesn't improve (and may actually reduce) throughput. I'm not quite clear on why that would be though.

mjg added a subscriber: mjg.Jan 10 2022, 9:49 PM

mjg added inline comments.

sys/net/if_epair.c
176–177	should be fcmpset and the atomic load hoisted out of the loop
808	You should walk all CPUs in order to obtain correct memory locality vs numa. I'm not aware of any sleep-friendly way to execute a callback on all CPUs, but you can just bind yourself like quiesce_cpus.

Thanks for the advice! I'm going to experiment with that.

In the mean time I've posted a slightly simplified version of this (basically this patch without RSS) because it somewhat improves the basic case, and vastly improves the pathological case. See D33853

I'll eventually rebase this patch on top of that, but need to work on these suggestions first.

.. if you're distributing workload with RSS and you set the numebr of RSS netisr contexts, that's where you can run your parallelism. The whole point here is to whack all the net processing in netisr contexts rather than keeping adding taskqueues and inventing new/fun ways to map cpus and figure out how many.

we still don't suitably autoconfigure netisr contexts based on how many cpus we have ... :P maybe we should finally fix that.

In D33731#766000, @adrian wrote:

.. if you're distributing workload with RSS and you set the numebr of RSS netisr contexts, that's where you can run your parallelism. The whole point here is to whack all the net processing in netisr contexts rather than keeping adding taskqueues and inventing new/fun ways to map cpus and figure out how many.

we still don't suitably autoconfigure netisr contexts based on how many cpus we have ... :P maybe we should finally fix that.

epair in it's inital incarnation was able to use the netisr contexts and bind epairs to CPUs (still a problem on multi-socket). But the problem was that netisr wasn't keeping up. If you created an epair per CPU the system was long toast. Now the dimensions (the bindings) have changed given Kristof has implemented having a fanout over all CPUs so that even a single epair could use all of them and the balancing is different; but the end effect even with
{{{
net.isr.bindthreads=1
net.isr.maxthreads=256 # net.isr will always reduce it to mp_cpus
}}}
will be the same I bet.

Review remarks

Harbormaster completed remote builds in B43956: Diff 101558.Jan 17 2022, 9:46 PM

In D33731#766000, @adrian wrote:

.. if you're distributing workload with RSS and you set the numebr of RSS netisr contexts, that's where you can run your parallelism. The whole point here is to whack all the net processing in netisr contexts rather than keeping adding taskqueues and inventing new/fun ways to map cpus and figure out how many.

we still don't suitably autoconfigure netisr contexts based on how many cpus we have ... :P maybe we should finally fix that.

That was a useful pointer, and with net.isr.maxthreads=-1 we get a bit more benefit from enabling RSS and attempting to spread the load.
I'm still not excited about these numbers, but they are significantly better than they used to be (spectacularly so for setup B), and both scenarios benefit from RSS now.

I've updated the commit message with my latest test results.

gbe added a subscriber: gbe.Jan 23 2022, 7:36 PM

This revision was not accepted when it landed; it landed in state Needs Review.Feb 15 2022, 8:04 AM

Closed by commit rG24f0bfbad57b: if_epair: implement fanout (authored by kp). · Explain Why

This revision was automatically updated to reflect the committed changes.

kp added a commit: rG24f0bfbad57b: if_epair: implement fanout.

kp added a commit: rG092da35a0d80: if_epair: implement fanout.Feb 23 2022, 5:10 PM

kp added a commit: rG2e0bee4c7f81: if_epair: implement fanout.

kp added a reverting change: rG56dc95b249dc: Revert "if_epair: rework".Mar 24 2022, 1:15 PM