Page MenuHomeFreeBSD

x86: Add dynamic interrupt rebalancing
ClosedPublic

Authored by cem on Apr 20 2017, 3:53 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 19, 9:26 PM
Unknown Object (File)
Sun, Jan 19, 7:34 PM
Unknown Object (File)
Sat, Jan 18, 9:36 AM
Unknown Object (File)
Mon, Jan 13, 5:01 PM
Unknown Object (File)
Sun, Jan 5, 8:47 AM
Unknown Object (File)
Fri, Jan 3, 7:57 AM
Unknown Object (File)
Fri, Jan 3, 5:56 AM
Unknown Object (File)
Thu, Jan 2, 4:13 PM

Details

Summary

Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.

The goal is to minimize preemption. By placing interrupt sources on
distinct CPUs, ithreads get preferentially scheduled on distinct CPUs.
Overall preemption is reduced and latency is reduced. In our workflow
it reduced "fighting" between two high-frequency interrupt sources.
Reduced latency was proven by, e.g., SPEC2008.

Test Plan

Weighting and algorithm are very dumb. Just take the interrupt counts and
round-robin across cores in descending order.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 9346
Build 9809: arc lint + arc unit

Event Timeline

x86/x86/intr_machdep.c
337 ↗(On Diff #27567)

Shouldn't this assignment done under the lock ?

x86/x86/intr_machdep.c
337 ↗(On Diff #27567)

Not sure. In our branch (written against a ~BSD10.0 branch) it is not, but that may be wrong.

Any thoughts? Anything commit blocking? Should this be opened to a wider audience for review? Thanks.

I'm not sure I fully understand this, but my reading is that it moves intr threads that are typically round-robin distributed once during boot/attach around at runtime at 1Hz? I do have some concerns if that is the case, particularly for networking. We rely heavily on prefetching data, so intr movement and intr preemption are both bad IMO (but to what extent is actual research, and should include generic scheduling policy). An overarching concern of mine is keeping things (locks and data) vertically aligned on a core between layers.. from tx/rx intrs up through transport layer and application (via RSS). There are things in motion that seem to move away rather than toward that goal.

Hi Kevin,

In D10435#217911, @kevin.bowling_kev009.com wrote:

I'm not sure I fully understand this, but my reading is that it moves intr threads that are typically round-robin distributed once during boot/attach around at runtime at 1Hz?

It is configurable (frequency will be 1/sysctl hw.intrbalance Hz) and off by default. Suggested hw.intrbalance value to enable the feature is 60, i.e., it would update once every minute.

I do have some concerns if that is the case, particularly for networking. We rely heavily on prefetching data, so intr movement and intr preemption are both bad IMO (but to what extent is actual research, and should include generic scheduling policy).

Yeah, we actually created this for heavy networking workloads. Initial CPU interrupt assignment placed two high-traffic interrupt sources (a NIC and Intel's I/OAT) on the same core. This resulted in a lot of preemption. Until runtime, the system can't really know what sources will have a lot of interrupt activity. So, I think dynamically rebalancing at runtime makes some sense.

An overarching concern of mine is keeping things (locks and data) vertically aligned on a core between layers.. from tx/rx intrs up through transport layer and application (via RSS). There are things in motion that seem to move away rather than toward that goal.

Not sure if you're talking about this or not. Note that this feature is off by default — i.e., no change in behavior from today. And enabling it at a frequency of 1/60 Hz results in a very low interrupt movement rate.

Does that answer your questions? I'm happy to answer more questions / help address concerns. Thanks for taking a look.

I have no specific objections, but I agree with your thoughts how it could be improved on different fronts.

Plus I think better handling of wired interrupts could be good. Now they are just skipped from migration, but that completely excludes them from the optimization logic, and may cause unneeded collisions. Some weighted packing could help with this too.

I don't have the credibility to slow anything down, I'll table the broader discussion for BSDCan dev summit.

One idea that comes to mind is a kevent when rebalancing, so things like software TCP pacers (kernel threads) and RSS aware user threads can re-affine themselves. But since this defaults to off, it might just be a todo along with the other notes.

In D10435#217965, @mav wrote:

I have no specific objections, but I agree with your thoughts how it could be improved on different fronts.

Plus I think better handling of wired interrupts could be good. Now they are just skipped from migration, but that completely excludes them from the optimization logic, and may cause unneeded collisions.

Yep. I'd like to wait on such improvements to a future commit, if that's ok.

Some weighted packing could help with this too.

Do you have an inexpensive algorithm in mind?

In D10435#217974, @kevin.bowling_kev009.com wrote:

I don't have the credibility to slow anything down, I'll table the broader discussion for BSDCan dev summit.

I won't be at BSDCan, unfortunately. Not sure if this will see any discussion there or not :-).

One idea that comes to mind is a kevent when rebalancing, so things like software TCP pacers (kernel threads) and RSS aware user threads can re-affine themselves. But since this defaults to off, it might just be a todo along with the other notes.

Sure. I think that could be a requirement for this were to become on by default and escape the confinement of x86 machine-dependent code.

This could be done in userland, right? By using cpuset to rewire interrupt CPUs?

This could be done in userland, right? By using cpuset to rewire interrupt CPUs?

I don't know enough about cpuset to answer that.

This could be done in userland, right? By using cpuset to rewire interrupt CPUs?

Looks like no:

982 int
983 kern_cpuset_setid(struct thread *td, cpuwhich_t which,
984     id_t id, cpusetid_t setid)
985 {
986         struct cpuset *set;
987         int error;
988
989         /*
990          * Presently we only support per-process sets.
991          */
992         if (which != CPU_WHICH_PID)
993                 return (EINVAL);

Hm, what's cpuset -x <irq> doing then?

Hm, what's cpuset -x <irq> doing then?

Looking at usr.bin/cpuset, 'x' sets xflag and mostly causes usage() to be invoked. It looks like it may do a cpuset_setaffinity() with the -l flag as well; otherwise it just does nothing.

With -x -l, we land in kern_cpuset_setaffinity():

1224         switch (level) {
...
1254         case CPU_LEVEL_WHICH:
1255                 switch (which) {
...
1270                 case CPU_WHICH_IRQ:
1271                         error = intr_setaffinity(id, mask);
1272                         break;

That binds the interrupts to a given CPU, which is slightly different from what we do here. Not sure if that would be acceptable or not.

hi,

It looks like you're setting the PIC<->CPU wiring, for the hardware IRQ routing, right? The cpuset -x is setting where the ithread runs?

It may be nice to expose being able to control the hardware IRQ routing per CPU from userland and then we can implement various schedulers to rebalance this. I'd even really like that for NUMA for reasons.

It looks like you're setting the PIC<->CPU wiring, for the hardware IRQ routing, right? The cpuset -x is setting where the ithread runs?

It looks like cpuset -x is setting the PIC as well, via intr_setaffinity -> intr_event_bind -> ie_assign_cpu -> intr_assign_cpu (if I'm following the indirection correctly). The only difference is this patch doesn't bind the ithread to that CPU.

The other remaining question is: Is the interrupt data we need available to userspace?

It may be nice to expose being able to control the hardware IRQ routing per CPU from userland and then we can implement various schedulers to rebalance this. I'd even really like that for NUMA for reasons.

It looks like we do that already! :-)

hi!

ok, so are you looking to balance the ithreads too, or just the PIC routing?

It may be useful to have an API that just modifies the PIC routing at runtime and another that just modifies the ithread pinning at runtime .. that should do what we need to then do this in userland?

-adrian

-adrian

hi!

ok, so are you looking to balance the ithreads too, or just the PIC routing?

Hi Adrian,

I believe we ($WORK) just want to balance the PIC routing.

(Jeff was under the impression the scheduler would try and place ithreads on the same core as the IRQ/filter, but I believe we ($WORK) are happy to run the ithread on any core rather than wait for a busy IRQ-local core.)

It may be useful to have an API that just modifies the PIC routing at runtime and another that just modifies the ithread pinning at runtime .. that should do what we need to then do this in userland?

Absolutely. I can sketch up something on top of cpuset, if we determine that's the right direction to head. Thanks!

hiya,

yeah. That whole "preempt stuff doing actual work just to handle an interrupt" is unfun. If you could look at the cpuset -x API and maybe create two ones - one for JUST ithread scheduling, one for JUST PIC routing - and then extend cpuset to control both of those (with -x also defaulting to the legacy, do both API) then we can implement all kinds of runtime tweaks of things.

Thanks for digging into it!

-adrian

IIRC scheduler by itself tries to schedule interrupt thread to current CPU, but it may forward it to another one, if this one is busy. I am not sure present cpuset model is the best here, since it hard wires threads to specific CPU cores, which may not be idle, leading to additional preemption.

Sketch out a userspace implementation, adding APIs to cpuset as needed.

cem retitled this revision from x86: Add dynamic interrupt rebalancing to Add irqrebalance(8), a service to distribute IRQ load.May 2 2017, 9:03 PM
cem edited the summary of this revision. (Show Details)
cem edited the test plan for this revision. (Show Details)
This revision is now accepted and ready to land.May 3 2017, 12:48 AM

Why do you move this stuff to userspace ? There is a lot of information which is only avaiable inside the kernel and which might be important to take into the account when making the decision on distribution. E.g., you want both interrupt handler and ithread to execute on a cpu in affinity domain of the device. Or, you might want to base policy decisions based on existing assignments, to not overload specific core with too many high-frequency interrupts, but you would be fine with assigning many rare interrupts to one core etc.

Right now your code only considers interrupt stats, which is fine for the moment and limited goals, but could be problematic in the long run. My point is that all the data is readily available inside the kernel boundaries, but would require non-trivial interfaces to pass that into userspace. I think that Linux runs its re-balancer in the kernel space for this reason.

In D10435#219438, @kib wrote:

Why do you move this stuff to userspace ?

Adrian suggested it. I mean, I'm fine going either way. The kernel way is much simpler.

There is a lot of information which is only avaiable inside the kernel and which might be important to take into the account when making the decision on distribution. E.g., you want both interrupt handler and ithread to execute on a cpu in affinity domain of the device. Or, you might want to base policy decisions based on existing assignments, to not overload specific core with too many high-frequency interrupts, but you would be fine with assigning many rare interrupts to one core etc.

Theoretically this information could be exported to userspace.

Right now your code only considers interrupt stats, which is fine for the moment and limited goals, but could be problematic in the long run. My point is that all the data is readily available inside the kernel boundaries, but would require non-trivial interfaces to pass that into userspace. I think that Linux runs its re-balancer in the kernel space for this reason.

Ok. I can move it back into the kernel (but in MI code instead of x86), if that is what makes sense.

Which statistics aren't exported to the kernel which should be so we can at least prototype this more?

I like the extended cpuset so we can actually tweak this stuff now.

I think that API to set irq and ithread binding to cpu is useful on its own, i.e. the kernel part of the current patch is good and should go in.

But for the (naive) rebalancing, I do not see why should we go with userspace instead of the code that started the discussion.

cem edited edge metadata.
cem edited the summary of this revision. (Show Details)

Moved the unobjectionable API extensions to a separate review: https://reviews.freebsd.org/D10586 .

This revision now requires review to proceed.May 3 2017, 3:48 PM

Go back to kernel implementation of balancing.

cem retitled this revision from Add irqrebalance(8), a service to distribute IRQ load to x86: Add dynamic interrupt rebalancing.May 11 2017, 9:30 PM
cem edited the summary of this revision. (Show Details)
cem removed a subscriber: adrian.
sys/x86/x86/intr_machdep.c
340

This is still done after unlock.

681

Callouts are not allowed to sleep. I suggest to use taskqueue_enqueue_timeout(9).

cem marked 2 inline comments as done.
cem edited the summary of this revision. (Show Details)
  • Move is_cpu update under lock
  • Use taskqueue_enqueue_timeout in place of callout, due to taking sx lock. (Broken in adaptation to CURRENT -- in 10.0 this was a non-sleepable ordinary mutex.)
kib added inline comments.
sys/x86/x86/intr_machdep.c
660

return (0);

This revision is now accepted and ready to land.May 17 2017, 9:01 PM
cem edited edge metadata.

Fix return style

This revision now requires review to proceed.May 17 2017, 9:38 PM
cem marked an inline comment as done.May 17 2017, 9:38 PM
This revision was automatically updated to reflect the committed changes.