Page MenuHomeFreeBSD

De-prioritize network driver ithreads to mitigate livelock
Needs ReviewPublic

Authored by gallatin on Apr 9 2021, 1:25 AM.

Details

Summary

The current priorities came into FreeBSD with the original SMPng work ~20 years ago. Its not clear that there is a good reason to run network interrupts at a higher priority than anything else on the system. With the current priority scheme, network interrupt handlers on multi-queue NICs can consume all CPU in the system when under heavy receive load (such as during a denial of service attack), preventing things like callouts from running. One example of this behavior is the way lagg LACP links "flap" when a machine is under a DOS attack, due to the lacp_tick() callout not running.

Rather than playing whack-a-mole and reworking important callouts to use different mechanisms and priorities, I propose de-prioritizing network interrupt threads to be below SWIs. This has survived testing and production traffic at Netflix, and (with a second patch) allows LACP links to remain up for a DOS attack lasting over an hour, as it allows callouts to run.

I'm looking for comment and testing, and I'm particularly concerned for services like NFS and ISCSI where disks and network intersect, as this makes disks higher priority than network.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

gallatin created this revision.

My first thought about the high network priority was about serial ports without any hardware buffering and flow control, extremely sensitive to interrupt latency, but uart seems to use TTY priority, at least now. Not sure what else could be so time-critical in networking before, but sure there could be NICs with very few RX buffers. Typical disk hardware though should have much higher latency tolerance, since the amount of traffic is usually predetermined being initiated from the system, that is why I guess its priority was set lower.

iSCSI should not care about these priority levels, since at least for software iSCSI all of its receive/transmit is going through separate threads with lower (PVM) kernel priority. I don't remember right now how NFS client works, but I suspect something alike. iSCSI target and NFS server both run at even lower priority, so don't care about this at all.

Speaking about SWI, IIRC there is "intr/swi1: netisr 0". Not sure what it does these days, but you may wish to also check its priority relative to one of "intr/swi4: clock".

I wonder, if your system is so busy that it can't run even callouts, what good use do you have out of it? It obviously can't run user-space or even kernel normally too. Or the goal is to make it just sit comfortably and wait indefinitely for DoS to end?

sys/sys/priority.h
99

Numerically (PRI_MIN_REALTIME - 4) line should be before PRI_MIN_REALTIME.

In D29663#665198, @mav wrote:

My first thought about the high network priority was about serial ports without any hardware buffering and flow control, extremely sensitive to interrupt latency, but uart seems to use TTY priority, at least now. Not sure what else could be so time-critical in networking before, but sure there could be NICs with very few RX buffers. Typical disk hardware though should have much higher latency tolerance, since the amount of traffic is usually predetermined being initiated from the system, that is why I guess its priority was set lower.

It may have been bringing forward of an old hack from when bde made ppp/slip go at line rate. Having the net have a higher priority meant less latency when it was time to send a packet which meant that we'd have fewer dead-times on the line.... though it was a long time ago and I've not done the deep dive...

As I noted to gallatin@ when he asked via email...
I do not think this will negatively affect NFS.

Increased delays handling received RPC messages
could slow NFS RPC response, but if the system
is getting hammered by net traffic, NFS will
probably not work well anyhow, I think.

What about the netisr threads ?

sys/sys/priority.h
99

Indeed, but then it would precede the definition. I thought about defining it in terms of PI_SWI(), but the SWIs are defined in another file (interrupt.h), and I didn't want to entangle things. What I did seems simplest, even if they are out of order numerically.

In D29663#665198, @mav wrote:

My first thought about the high network priority was about serial ports without any hardware buffering and flow control, extremely sensitive to interrupt latency, but uart seems to use TTY priority, at least now. Not sure what else could be so time-critical in networking before, but sure there could be NICs with very few RX buffers. Typical disk hardware though should have much higher latency tolerance, since the amount of traffic is usually predetermined being initiated from the system, that is why I guess its priority was set lower.

iSCSI should not care about these priority levels, since at least for software iSCSI all of its receive/transmit is going through separate threads with lower (PVM) kernel priority. I don't remember right now how NFS client works, but I suspect something alike. iSCSI target and NFS server both run at even lower priority, so don't care about this at all.

Speaking about SWI, IIRC there is "intr/swi1: netisr 0". Not sure what it does these days, but you may wish to also check its priority relative to one of "intr/swi4: clock".

Ideally it does nothing, as net.isr.dispatch=direct is standard. When net.isr.dispatch is not direct, packets are queued from nic drivers to those isr threads. I'd argue that having the netisr threads running at a higher prio than the driver interrupts is good, as that give them more chance to drain and reduces livelock chances.

I wonder, if your system is so busy that it can't run even callouts, what good use do you have out of it? It obviously can't run user-space or even kernel normally too. Or the goal is to make it just sit comfortably and wait indefinitely for DoS to end?

There are always brief gaps when things run (just not clocked at exactly 1hz like the lacp_tick() expects), but if the link is flapping, most of those gaps are taken up with re-establishing the link. With the link not flapping, those gaps can be used to communicate with our control infrastructure to report the DOS and and establish firewall rules to block it.

With advanced features such as NIC KTLS and RATELIMIT that are tied to individual lacp ports, a flap can be expensive, as it can require tearing down tens or hundreds of thousands of per-connection ktls or pacing state on the underlying NICs. While CPU is this scarce, such state management is not the best use of it. Currently, clearing this state involves terminating KTLS connections that might have otherwise not made a request during a brief DOS (and hence would not have been impacted), but now have to be re-established.

So, all in all, I'd like very much to avoid flaps.

What about the netisr threads ?

Are you asking because you think they should be de-prioritized as well? Or because you're concerned that they're now higher priority than NIC ithreads? I'm not sure about the former; we run with netisr direct dispatch, so I don't have a good way to test their behavior in this situation. Under anything but the most moderate load, performance collapses on the floor when netisr is not doing direct dispatch. (which makes think the netisr threads should be removed) In terms of NIC ithreads being lower priority, I think that's good. If we're getting traffic faster than it can be processed, then the earlier we drop things, the better.

I included rwatson on this review so he could chime in about netisr, as that was one of the things I'm concerned about.

multi-queue NICs can consume all CPU

Why can't you reserve a CPU core for other stuff, and just let the N-1 CPUs handle whatever they can?

I mean, playing with IRQ priorities doesn't solve anything. It just moves the problem somewhere else.
Now disk drivers will be the next bottle neck?

--HPS

I think this change should be reviewed wider than just to close a problem with LACP. Why do we prioritize servicing interrupts that are triggered by remote agents higher than our own callout routines? Note that NICs, unlike disks, are source of "foreign" interrupts. Also, the nature of network is lossy, while callouts are supposed to be executed in time.

multi-queue NICs can consume all CPU

Why can't you reserve a CPU core for other stuff, and just let the N-1 CPUs handle whatever they can?

What if I don't have enough cores for that?

I mean, playing with IRQ priorities doesn't solve anything. It just moves the problem somewhere else.
Now disk drivers will be the next bottle neck?

--HPS

In any normal workload, there is no problem. That's why the priorities have been like this for 20+ years. This stuff seems like it all came from SPLs on the PDP-11. I kind of wonder why we even need to have different priorities for different types of ithreads at all. It seems like if something has tight latency requirements, it should be handled in a primary, non-threaded, interrupt context and/or use a higher priority taskqueue or thread.

Note that disks are mostly self limiting and are not exposed to potential attackers.. Eg, my kernel has requested every interrupt it gets from a disk. So if I run my machine into the ground with disk interrupts, I deserve whatever trouble I've created. Contrast this to NICs which may be on the public facing internet and can receive unsolicited traffic from any host in the world.

I think hps@'s idea is a good one, if only
done when there are enough cpus.
Maybe something like:

if ncpus > 4
   - only allow ncpus - 1 to be assigned to
      a certain thread type (net interrupt or...)

--> To try and avoid starvation in general.

To cover the case of fewer cores, I think the
current patch is reasonable and the above
is worth considering for a future patch.

As noted, a sustained flood of disk interrupts
seems far less likely than a sustained flood
of network interrupts during something like
a DOS attack.

My first thought was much closer to Gleb's comment. That is, I think the mistake is probably to be lumping callouts in with other soft interrupts. I think probably what we want is something about like this:

- timer interrupt threads (callouts)
- device interrupt threads (and I'm not really sure if we need distinct priorities per type anymore)
- software interrupt threads (which mostly shouldn't exist.. I've wanted to make busdma just make its own kthread and retire swi_vm() but have never gotten around to it for example)

In general the priorities were set based on the relative priorities of the SPL masks from pre-SMPng and then never changed. Even in pre-SMPng splclock() was kind of magic in that it was "below" splnet() and splbio() in some ways, but it was as high as splhigh() in some other ways. I think those oddities were in fact a way a recognition of the importance of timer events. I don't know what the relative priorities of different classes of ithreads should really be at this point and if the interrupt types even make sense anymore (other than the epoch abuse for INTR_TYPE_NET). For this particular change, I think what I would suggest is to move all the hardware ithread values down by 4 and define a new PI_TIMERS or some such that has the current value of PI_AV. I wonder how hard it would be to just kill PI_SWI and pass an explicit priority when creating SWIs to make this easier. I can take a look at that.

My only other comment is that SWIs in general are an archaic concept. In pre-SMPng they were triggered via splx() so we had to have bits for them in the intrmask_t, but now they are just threads. taskqueue_swi is quite pointless for example compared to just the plain thread taskqueue. Most of the SWIs should really just become kthreads. It might remove some overhead from callouts if they used dedicated kthreads that didn't loop around a single handler but just ran the callout handler directly as their main.

I think hps@'s idea is a good one, if only
done when there are enough cpus.
Maybe something like:

if ncpus > 4
   - only allow ncpus - 1 to be assigned to

I'd very much prefer not to do that. We try to run one NIC rx ithread per core because a lot of our actual work is done in the nic rx context, and loosing a core would not be optimal..