Page MenuHomeFreeBSD

Widen EPOCH(9) usage in network drivers (as a pre-step for D23347)
Needs ReviewPublic

Authored by hselasky on Jan 24 2020, 12:16 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Dec 14, 2:21 PM
Unknown Object (File)
Fri, Dec 13, 1:33 AM
Unknown Object (File)
Tue, Dec 3, 6:40 AM
Unknown Object (File)
Thu, Nov 28, 4:37 AM
Unknown Object (File)
Sun, Nov 24, 5:11 PM
Unknown Object (File)
Nov 21 2024, 3:22 AM
Unknown Object (File)
Nov 16 2024, 5:39 PM
Unknown Object (File)
Nov 11 2024, 1:36 PM

Details

Summary

Widen EPOCH(9) usage in network drivers.

Make sure network drivers enter and exit the EPOCH(9) before
forwarding network packets to the network stack. This is part of
widening the EPOCH(9) usage so that entering and exiting the EPOCH(9)
at every packet input to the network stack can be avoided.

This patch address the shortcomings of r357004, that many network
drivers have timers, taskqueues and even own threads that input IP
packets, and not only interrupt threads. Two convenience macros
NET_EPOCH_WRAP() and NET_EPOCH_WRAP_RET() have been added to minimize
the amount of patching needed in each network driver.

Each network driver is now responsible for entering and exiting the
EPOCH(9) before calling the if_input method of struct ifnet. The only
exception is when debugnet is activated. Then EPOCH(9) is not
required. Also netisr_queue(9) can be called outside an EPOCH(9)
section, which is used by sppp_input() for example.

Typically the coalesching parameters or interrupt moderation settings
in existing drivers now indirectly control how frequently the EPOCH(9)
is entered and exited. The more packets that are input at a time, the
more time the epoch_wait(9) function may need which can lead to
temporarily increased memory usage, because of structures waiting to
be freed by an epoch_call(9).

Sponsored by: Mellanox Technologies

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 29300

Event Timeline

Remove "inline" keyword which sneakt in.

I like the idea of using a different ether_input() for epoch and non-epoch safe drivers to get the tests out of the critical path. However, this is a minor optimization at best, as branch predictors are really good these days on modern CPUs.

However, I do not see any need for the flag churn. With the flag churn you essentially removed this optimization for all drivers.

This revision now requires changes to proceed.Jan 24 2020, 2:15 PM
glebius requested changes to this revision.Jan 24 2020, 4:54 PM

Minority of drivers need this flag and the future plan is none will need.

@glebius : Do you have the patches ready to cover the remaining drivers? Shouldn't that be a pre-req before adding needs epoch flag?

hselasky retitled this revision from Get epoch checks out of fast path in network drivers to Add missing EPOCH wrappers as a pre-step for D23347.
hselasky edited the summary of this revision. (Show Details)
sys/arm/allwinner/if_awg.c
1026 ↗(On Diff #67354)

This is a style violation.

sys/arm/allwinner/if_emac.c
723 ↗(On Diff #67354)

And same for all other places.

sys/net/netisr.c
940 ↗(On Diff #67354)

Is it safe ? (I am not sure if the netisr rmlock is sleepable).

hselasky edited the summary of this revision. (Show Details)

Added missing use of EPOCH.

sys/arm/allwinner/if_awg.c
1026 ↗(On Diff #67354)

You mean the variable should be declared on top of the function?

sys/net/netisr.c
940 ↗(On Diff #67354)

From what I can read from the code Gleb's patches put this under epoch aswell.

I will try to give this a spin with WITNESS enabled.

Fix style issues mentioned by @kib and address some minor compilation issues.

glebius requested changes to this revision.Jan 27 2020, 7:58 PM

Do you suggest this change as an alternative to D23242? Cause you added epoch in many places where we already are in epoch, for example interrupt handlers and polling handlers. Or do you claim that all these drivers are broken in head and this change is required to fix them?

This revision now requires changes to proceed.Jan 27 2020, 7:58 PM

@glebius: Or do you claim that all these drivers are broken in head and this change is required to fix them?

I claim that many of the drivers I've patched are broken.

Debugnet is broken.

All USB WLAN is broken.

@glebius: Do you suggest this change as an alternative to D23242

This change together with D23347 should replace D23242, yes.

Already many ethernet drivers are almost C&P of each other and putting the EPOCH into the drivers makes sense, because for many older PCI devices, there are many kinds of interrupt sources, like TX completion interrupts, LINK events, and doing the EPOCH for those cases is not needed. Also some drivers use INTR_TYPE_NET which are not network drivers, for example the LinuxKPI.

Also many drivers have a callout to watchdog the RX interrupt, and that is totally missed by D23242 . Are you planning to add the network EPOCH to the callout subsystem too?

mlx4en and mlx5en will have separate patches from Mellanox (which are not yet part of this patchset, because there are some new features coming which use EPOCH inside the driver code).

@glebius: Or do you claim that all these drivers are broken in head and this change is required to fix them?

I claim that many of the drivers I've patched are broken.
Debugnet is broken.
All USB WLAN is broken.

How exactly is debugnet broken? AFAIU, it swaps if_input() method of an interface to its own method. The drivers keeps using the same interrupt handler it setup at attach.

The USB WLANs were fixed back on Friday in r357093.

Already many ethernet drivers are almost C&P of each other and putting the EPOCH into the drivers makes sense, because for many older PCI devices, there are many kinds of interrupt sources, like TX completion interrupts, LINK events, and doing the EPOCH for those cases is not needed. Also some drivers use INTR_TYPE_NET which are not network drivers, for example the LinuxKPI.

Entering epoch is cheap and doesn't put any constraints on already limited interrupt context. So, there is nothing critical in having epoch for link or TX interrupts.

Let me repeat myself about LinuxKPI. The fact that it puts INTR_TYPE_NET on any interrupt is a mistake in LinuxKPI, not in my patch.

Also many drivers have a callout to watchdog the RX interrupt, and that is totally missed by D23242 . Are you planning to add the network EPOCH to the callout subsystem too?

Yes. The plan is that once we put an epoch marked callout onto cc_expireq, the latter is marked. Then softclock() will enter the epoch for duration of processing the whole expireq. This will allow to mark all TCP callouts as network and when we get a bunch of TCP callouts expiring at one tick, they will be executed at the same epoch.

@glebius: Or do you claim that all these drivers are broken in head and this change is required to fix them?

I claim that many of the drivers I've patched are broken.
Debugnet is broken.
All USB WLAN is broken.

How exactly is debugnet broken? AFAIU, it swaps if_input() method of an interface to its own method. The drivers keeps using the same interrupt handler it setup at attach.

Shouldn't EPOCH be applied before all if_input methods? Documentation?

The USB WLANs were fixed back on Friday in r357093.

Two drivers of many USB WLANs were fixed! And you've left the rest broken. This is not good!

Already many ethernet drivers are almost C&P of each other and putting the EPOCH into the drivers makes sense, because for many older PCI devices, there are many kinds of interrupt sources, like TX completion interrupts, LINK events, and doing the EPOCH for those cases is not needed. Also some drivers use INTR_TYPE_NET which are not network drivers, for example the LinuxKPI.

Entering epoch is cheap and doesn't put any constraints on already limited interrupt context. So, there is nothing critical in having epoch for link or TX interrupts.

Let me repeat myself about LinuxKPI. The fact that it puts INTR_TYPE_NET on any interrupt is a mistake in LinuxKPI, not in my patch.

I disagree. INTR_TYPE_NET is only an indicator of interrupt thread priority, and should not be bound to network drivers only!

Also many drivers have a callout to watchdog the RX interrupt, and that is totally missed by D23242 . Are you planning to add the network EPOCH to the callout subsystem too?

Yes. The plan is that once we put an epoch marked callout onto cc_expireq, the latter is marked. Then softclock() will enter the epoch for duration of processing the whole expireq. This will allow to mark all TCP callouts as network and when we get a bunch of TCP callouts expiring at one tick, they will be executed at the same epoch.

I'm not sure if this is a good idea. How can you support per-VNET network EPOCH's then?

You would also need to patch taskqueues and group-taskqueues. I think this is not acceptable unless you allow any EPOCH to be specificed.

> Entering epoch is cheap and doesn't put any constraints on already limited interrupt context. So, there is nothing critical in having epoch for link or TX interrupts.

You are right that the reader is cheap, but it is not cheap for the waiter ! Please consider that aswell. Also we have some spin-loops in the epoch wait mechanism and by enlarging the range of the EPOCH's basically there will be more task-switching trying to chase down the EPOCH's during wait and that cause more CPU usage.

sys/net/netisr.c
940 ↗(On Diff #67354)

Currently NETISR_LOCKING is not defined, so the read lock is inactive.

/usr/src/sys/net/netisr.c:/* #define NETISR_LOCKING */

In general I think it is more clever to apply the EPOCH() after any read locks and mutexes, so I'm pushing an update here.

Define a pair of macro functions to nicely wrap packet receive functions into using epoch.

hselasky marked an inline comment as done.

Split function and argument, similar to KASSERT().
Slightly rename function macros.

Use NET_EPOCH_WRAP when possible.

There have been multiple discussions on the certain aspects on the recent network-epoch changes:

All have different subsets of people involved.
Having a discussion is good, but it is a bit hard to grasp a complete picture, as the proposals / comments / concerns are scattered.
Maybe it is worth spinning a separate thread in -arch on the direction&implementation for the network epoch?

Meanwhile, let me try to summarise the discussion(s), at least for my understanding. Please correct me if I’m wrong or providing not accurate representation.

There are no pushback on the net-epoch direction and the existing concerns can be grouped in 2 parts:

  1. Layering violation concern:

The currently-implemented approach has the actual epoch-related business logic implemented in one place instead of hundreds.
However, it does so by adding quite a lot of network-specific bits to the generic subsystems. This raises a concern introducing network-specific calls in the generic interrupt handlers and the future work on the callout / taskqueue. (ian@, kib@, rstone@).

There is an alternative proposal, suggesting explicitly modififying each and every network driver interrupt/callout/taskq part to enter network epoch.
It has the nice property of keeping network-specific logic inside the network-related code. However, it does introduce more boilerplate code in each network driver.

  1. Generic concerns over worst-case epoch duration, which (a) might slowdown epoch_wait() customers (hselasky@), (b) change the access pattern for the memory allocator and (c) result in the increased peak memory consumption for some workloads (jeff@, melifaro@)

My 2 cents on these concerns:
For (1): I like the decision on having the epoch logic in the limited amount of places. Though, It would be ideal to see a non-layering-violating implementation for the interrupt handler (pre-post hooks as kib@ suggested?).
For (2a): the main (current) customers of epoch_wait() are the different control planes (iface, vnet, ipsec). From my networking experience, people typically have ether feature-rich device (bras, ipsec terminator) OR traffic-heavy device, but very rarely both. This is certainly a use case that has to be considered, but I don’t think it should be a primary concern.
For (2) in general: I’d advocate for solving one thing at a time: let’s stabilise the data plane first, so we can have the needed performance and then focus on improving the control plane.

I have a follow-up question to all this stuff which I do not understand yet:

@glebius mentioned (I think in his original commit) that this was a "bandaid" that we want to get rid off before 13.
Now we (are having to) add it to a gazillions of drivers.

What's the long-term plan then? How would we get rid of this again? What prevents us from doing this now instead of this?

Let me repeat myself about LinuxKPI. The fact that it puts INTR_TYPE_NET on any interrupt is a mistake in LinuxKPI, not in my patch.

I disagree. INTR_TYPE_NET is only an indicator of interrupt thread priority, and should not be bound to network drivers only!

No, I disagree. INTR_TYPE_NET says TYPE NET. Sorry, I can't make this letters any bigger, they already are. They say TYPE NET, they don't say INTERRUPT THREAD PRIORITY.

Also many drivers have a callout to watchdog the RX interrupt, and that is totally missed by D23242 . Are you planning to add the network EPOCH to the callout subsystem too?

Yes. The plan is that once we put an epoch marked callout onto cc_expireq, the latter is marked. Then softclock() will enter the epoch for duration of processing the whole expireq. This will allow to mark all TCP callouts as network and when we get a bunch of TCP callouts expiring at one tick, they will be executed at the same epoch.

I'm not sure if this is a good idea. How can you support per-VNET network EPOCH's then?

There is no per-VNET epoch, and there shouldn't be any.

You would also need to patch taskqueues and group-taskqueues. I think this is not acceptable unless you allow any EPOCH to be specificed.

I'll post review for taskqueues in next few minutes.

Two drivers of many USB WLANs were fixed! And you've left the rest broken. This is not good!

Can you please point me to broken USB WLANs?

Let me repeat myself about LinuxKPI. The fact that it puts INTR_TYPE_NET on any interrupt is a mistake in LinuxKPI, not in my patch.

I disagree. INTR_TYPE_NET is only an indicator of interrupt thread priority, and should not be bound to network drivers only!

No, I disagree. INTR_TYPE_NET says TYPE NET. Sorry, I can't make this letters any bigger, they already are. They say TYPE NET, they don't say INTERRUPT THREAD PRIORITY.

If you look at existing code, INTR_TYPE_XXX is a priority indicator, nothing else!

/* Map an interrupt type to an ithread priority. */
u_char
intr_priority(enum intr_type flags)
{

Also many drivers have a callout to watchdog the RX interrupt, and that is totally missed by D23242 . Are you planning to add the network EPOCH to the callout subsystem too?

Yes. The plan is that once we put an epoch marked callout onto cc_expireq, the latter is marked. Then softclock() will enter the epoch for duration of processing the whole expireq. This will allow to mark all TCP callouts as network and when we get a bunch of TCP callouts expiring at one tick, they will be executed at the same epoch.

I'm not sure if this is a good idea. How can you support per-VNET network EPOCH's then?

There is no per-VNET epoch, and there shouldn't be any.

Why can't epochs be global to a VNET?

You would also need to patch taskqueues and group-taskqueues. I think this is not acceptable unless you allow any EPOCH to be specificed.

I'll post review for taskqueues in next few minutes.

I think it is better that drivers are allowed to optimize the use of EPOCH, than forcing it down on every network driver interrupt. For example Mellanox NICs support both infiniband and ethernet on the same interrupt vectors. Forcing infiniband completion callbacks under EPOCH is then not needed for example.

Similarly some network drivers process one IP packet per interrupt while others process multiple IP packets per interrupt. The interrupt handle has zero knowledge about this, so should not be responsible for batching of interrupts.

Two drivers of many USB WLANs were fixed! And you've left the rest broken. This is not good!

Can you please point me to broken USB WLANs?

Please read through this patch and you'll see. USB transfers are executed from an own thread actually and not from the context of the interrupt handler. This has to do with realtime constraints.

In D23348#513267, @bz wrote:

I have a follow-up question to all this stuff which I do not understand yet:

@glebius mentioned (I think in his original commit) that this was a "bandaid" that we want to get rid off before 13.
Now we (are having to) add it to a gazillions of drivers.

What's the long-term plan then? How would we get rid of this again? What prevents us from doing this now instead of this?

If you want to merge EPOCH enter/exit calls in time, the network driver should be the one responsible for that.

Network drivers frequently lock a driver mutex and it is very clever to enter/exit the EPOCH inside the locked section, so that we avoid congestion on the wait for the network EPOCH. In the generic interrupt handler code there is no knowledge about any locking done by underlying network drivers.

The band-aid should be a flag on the network interface structure to say if the network driver does not enter/exit EPOCH on its own, like @glebius has already done.

Add patches for Octeon based network devices. These drivers register the IRQ handler via an API which is not guaranteed to use INTR_TYPE_NET.

LINT builds tested using the current patch: aarch64 armv7 amd64

hselasky retitled this revision from Add missing EPOCH wrappers as a pre-step for D23347 to Widen EPOCH(9) usage in network drivers (as a pre-step for D23347).Jan 29 2020, 5:39 PM
hselasky edited the summary of this revision. (Show Details)
hselasky added a reviewer: network.

Can you please point me to broken USB WLANs?

Please read through this patch and you'll see.

I see that the patch covers almost every driver in the tree, and I know that every driver in the tree isn't broken. I see that patch adds epoch recursions to most drivers. So I repeat my question: can you please point me to broken USB WLANs? If you cooperated, I would already fixed them Monday or yesterday.

Can you please point me to broken USB WLANs?

Please read through this patch and you'll see.

Hi Gleb,

Your question can be answered by reading through this patch. Have you read through this patch?

I see that the patch covers almost every driver in the tree, and I know that every driver in the tree isn't broken.

Yes, in corner cases they are broken. For example one driver calls the so-called rxeof from if_watchdog/if_ioctl .
And several others have taskqueues/threads which are not covered.
And some use interrupt handlers where you cannot control the interrupt type.

Another example: mlx5en calls the so-called rxeof from the if_ioctl as a result from "ifconfig up".
This was not caught by any "review" if any. Just out of luck mlx5en works out of the box, if no data is coming into the card during "ifconfig up".

All in all I'm a bit pissed that you force problems down on other developers, like in the USB/ARM/embedded area, which you probably did not test at all.

I see that patch adds epoch recursions to most drivers.

The plan is to remove the EPOCH() entering inside the kernel's interrupt handler. Then there will be less EPOCH() recursion. Also EPOCH() handles recursion w/o any issues, only the refcount is updated.

So I repeat my question: can you please point me to broken USB WLANs?

All diffs in this patch for sys/dev/usb/wlan point to broken USB WLAN drivers.

If you cooperated, I would already fixed them Monday or yesterday.

See answer above, I think this is obvious if you read this patch.

Rebase patch after some chunks were submitted upstream.

Fix compile issue in if_dme.c.
Reduce some whitespace changes.