refactoring in support of *future* change to cope with slow configuration path on INTC and BRCM drivers
AbandonedPublic

Authored by kmacy on Aug 10 2017, 10:42 PM.

Details

Reviewers
sbruno
gallatin
jtl
scottl
Group Reviewers
Intel Networking
Summary

High level summary:
A) FreeBSD network interface configuration requires that device setup happens synchronously.
B) The INTC and BRCM drivers poll hardware for up to hundreds of ms waiting for firmware setting a completion bit when the driver interacts with hardware for anything other than transmit / receive. drewg encountered problems caused by this at NFLX when iflib increased the stats update from hz to hz/2.
C) A logical approach would be to yield the processor while waiting for completion. However, this is not possible while holding a default mutex as is the current convention in essentially all network drivers.

In short, the current locking convention is incompatible with meeting both upper stack requirements and driver requirements while maintaining reasonable latency. This change attempts to deal with the issue by change the softc lock to be an sx lock and making changes that previously happened synchronously with a default mutex held asynchronous.

I've fixed all the problems I've encountered, but make no claim to have tested comprehensively.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 11013
kmacy created this revision.Aug 10 2017, 10:42 PM
kmacy edited the summary of this revision. (Show Details)Aug 10 2017, 10:46 PM
jtl edited edge metadata.Aug 11 2017, 4:03 AM

At a high level...

I'm not sure what the switch from mtx to sx is intended to accomplish. I didn't see any places in the code where you actually took advantage of the ability to sleep. Perhaps, I missed it? If we're not going to sleep, I don't think I understand the advantage of the change.

I'm also unsure about the benefit of deferring some of the work to run asynchronously in a different thread. Again, I may be missing something, but it looks to me like the same code runs, but it is deferred. In some cases, this has user-visible side effects. For example, there could be a period of time after an ioctl call has "successfully" completed when the behavior will be different than expected. This might be OK if the behavior was well-documented and there was a good reason for it. But, I'm not sure the change description explains why this is necessary. (Again, I might be missing something.)

sys/dev/e1000/e1000_80003es2lan.c
411

I'm not sure what this check accomplishes.

There are numerous problems with these hand-rolled locks.

First, precisely because they are NOT enclosed in OS-level locks, you can end up with priority inversions that keep the thread that has the lock from making progress while a higher-priority thread is spinning to try to acquire the lock.

Second, I think it is possible for two different threads on two different CPUs to run this code simultaneously and both think they end up owning the lock.

(There may be more problems...)

I think that what we are acquiring here is a lock that is shared by both the operating system and the firmware on the NIC. If so, it seems to me that the answer is to wrap any of these shared HW/SW locks in OS-level locks. That makes sure that the OS side of things behaves roughly as expected (priority propagation, avoiding races, etc.).

sys/net/iflib.c
529

These should probably just be deleted.

gallatin edited edge metadata.Aug 11 2017, 1:18 PM

It has been years since I've had to think about this, but I remember that the problem that all drivers fight is that you wind up being called in your ioctl routine with potentially some lock held by something that is calling you, so you cannot sleep. Is that still the issue? Can you remind me what lock is held?

sys/dev/e1000/e1000_80003es2lan.c
411

Are all these horrific swfw_sync routines now guaranteed to be called from the same taskq context?

sys/net/iflib.c
5237

How does the actual error status get returned via ifioctl? If I'm reading this right, all you can return via this mechanism is the whether you were able to enqueue the task, not the status of the task's work.

erj added a subscriber: erj.Aug 11 2017, 3:58 PM

Can you remind me what lock is held?

I don't know about other ioctls, but I remember there being a lock held when IPv6 multicast MAC address are added.

Unrelated to that, for ixl(4), nvmupdate uses ioctls to the update, but it needs the returned error code from the ioctl call (from the driver) to work. So ixl(4) wouldn't work with this change.

kmacy added inline comments.Aug 11 2017, 7:06 PM
sys/dev/e1000/e1000_80003es2lan.c
411

No. They still *also* need to be called from init, which *has* to be synchronous - hence the changes.

411

This checks that non-sleepable locks aren't held across this function. It's alerted me to a number of issues already. I agree with the need for a more complete fix. However, this is *incremental*. Any real fix will build on this.

sys/net/iflib.c
529

yup.

5237

Currently these are only used for actions that can't fail or that the driver won't report failure on. The model is currently busted to begin with in that these hardware transactions can take longer than we should be holding non-sleepable locks. Ideally multicast would be reworked, but it's not in anyone's critical path.

kmacy added a comment.Aug 11 2017, 7:15 PM

I talked to Kevin. I may need to clarify my thinking a bit more.

I want to decouple these softc locking changes from the "bogolock" rewrite. Moving to an sx lock for the softc runs counter to long established convention. The risk of fallout from it is non-zero. Additionally, this bogolock has been in there for so long that I'm afraid of Chesterfield's fence. I don't want to straight up rip out something that is undocumented and I don't fully understand. I myself said the same thing about rewriting the bogolock itself in an earlier mail. It's not a question of what needs to be done, but a question of how to proceed from a SW engineering perspective.

kmacy added a comment.Aug 11 2017, 7:21 PM

I'm also unsure about the benefit of deferring some of the work to run asynchronously in a different thread. Again, I may be missing something, but it looks to me like the same code runs, but it is deferred. In some cases, this has user-visible side effects. For example, there could be a period of time after an ioctl call has "successfully" completed when the behavior will be different than expected. This might be OK if the behavior was well-documented and there was a good reason for it. But, I'm not sure the change description explains why this is necessary. (Again, I might be missing something.)

Please review the FreeBSD locking hierarchy. SX locks can't be acquired after default mutexes. The async ioctl isn't an optimization. It's a coping mechanism.

kmacy added a comment.Aug 11 2017, 7:32 PM
In D11969#248529, @erj wrote:

Can you remind me what lock is held?

I don't know about other ioctls, but I remember there being a lock held when IPv6 multicast MAC address are added.

Unrelated to that, for ixl(4), nvmupdate uses ioctls to the update, but it needs the returned error code from the ioctl call (from the driver) to work. So ixl(4) wouldn't work with this change.

The async ioctl is only needed when called from a context holding a default mutex. If you're doing the blocking wait in non-sleepable context and you need the error code then we'll need to re-work the caller.

jtl added a comment.Aug 12 2017, 10:15 AM

Please review the FreeBSD locking hierarchy. SX locks can't be acquired after default mutexes.

I'm familiar with the FreeBSD locking hierarchy. What I don't see (and, since I may be missing it, I'm asking you to point it out) is where the code changes actually require the unbounded sleep that SX locks allow. Understanding this will help me to better understand the rationale for this change.

kmacy added a comment.Aug 12 2017, 7:24 PM
In D11969#248789, @jtl wrote:

Please review the FreeBSD locking hierarchy. SX locks can't be acquired after default mutexes.

I'm familiar with the FreeBSD locking hierarchy. What I don't see (and, since I may be missing it, I'm asking you to point it out) is where the code changes actually require the unbounded sleep that SX locks allow. Understanding this will help me to better understand the rationale for this change.

As mentioned previously I'd like to make the locking change separate from the swfw_sync rewrite.

kmacy retitled this revision from incremental fixes towards coping with slow configuration path on INTC and BRCM drivers to refactoring in support of *future* change to cope with slow configuration path on INTC and BRCM drivers.Aug 12 2017, 7:42 PM
kmacy abandoned this revision.Aug 24 2017, 10:07 PM

Folded in to D12101