Page MenuHomeFreeBSD

mlx5en: do no call if_input() while holding the rq mutex
Needs RevisionPublic

Authored by kib on Sep 23 2024, 1:21 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 21, 10:03 PM
Unknown Object (File)
Wed, Nov 20, 9:32 PM
Unknown Object (File)
Mon, Nov 18, 4:44 AM
Unknown Object (File)
Mon, Nov 18, 2:06 AM
Unknown Object (File)
Fri, Nov 15, 2:10 AM
Unknown Object (File)
Tue, Nov 12, 3:43 AM
Unknown Object (File)
Thu, Oct 31, 10:56 PM
Unknown Object (File)
Thu, Oct 31, 11:32 AM
Subscribers

Details

Reviewers
jhb
gallatin
Summary
Instead, collect a batch of the received mbufs and push them into upper
stack with temporary unlocked rq.

PR:     281469
Reviewed by:    Ariel Ehrenberg <aehrenberg@nvidia.com>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Sep 23 2024, 1:21 PM
sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
516

Note that you can pass a list of packets to ether_input where they are linked by m_nextpkt.

640–643

Could also use an mbufq which takes up less stack space (though this approach is probably fine).

This revision is now accepted and ready to land.Sep 25 2024, 2:13 PM
kib marked an inline comment as done.
kib edited the summary of this revision. (Show Details)

Use mbufq to collect the batch.

This revision now requires review to proceed.Sep 25 2024, 2:42 PM

I'm very afraid there will be performance implications due to new cache misses here from queing mbufs twice. On tens of thounsands of interfaces running over 8 years, we've never hit a deadlock from this lock, and I don't think fixing this is important enough to hurt performance for.

Instead of doing this, can you remove the lock entirely, change it to a sleepable lock, or reduce its scope instead? From some reading I did last week at the bug session, I gather that it exists only to guard against 2 rare conditions (doing work later because something failed to restock,the ring, and bringing the interface down).

This revision now requires changes to proceed.Sep 25 2024, 11:17 PM

No, the lock cannot be sleepable because the processing occurs in the context of the interrupt thread.

I would implemented something with blockcount_t or even epoch, but then I realized that it would not help. blockcount cannot be used because rx memory is not type-stable. Driver-private epoch might work, but then note that the second reported backtrace in the PR 281368 shows ip stack acquiring sleepable lock. So even if I try to fix driver, the stack still tries to sleep in ip_input().

I suspect you (Netflix) did not see the deadlock because you either do not use ipv6 or use it in situation with static network configuration. The problems are visible when multicast group membership is changed, at least this is what I see in the PR.

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
516

Can ether_input() be combined with software lro?

In D46761#1070797, @kib wrote:

No, the lock cannot be sleepable because the processing occurs in the context of the interrupt thread.

I would implemented something with blockcount_t or even epoch, but then I realized that it would not help. blockcount cannot be used because rx memory is not type-stable. Driver-private epoch might work, but then note that the second reported backtrace in the PR 281368 shows ip stack acquiring sleepable lock. So even if I try to fix driver, the stack still tries to sleep in ip_input().

I suspect you (Netflix) did not see the deadlock because you either do not use ipv6 or use it in situation with static network configuration. The problems are visible when multicast group membership is changed, at least this is what I see in the PR.

We do have a static network config in most places.

At any rate, let's see what we can do to remove the lock entirely. What is this mutex protecting?  In looking at the code, the only thing that can be called outside the interrupt context is mlx5e_post_rx_wqes() which is done in a callout if it fails in the interrupt context, so I guess its protecting the state for that?

Rather than than holding a mutex for this rare condition (I have not seen the rq callout called in hours of 100GbE load on a test box), why not just:

  • If we posted *any* wqes: 
    • continue as normal and do not schedule the callout, as we'll be called again
  • If we did not post any wqes: 
    • disable the irq and schedule the callout at EOI 
    • the callout function becomes a wrapper that calls mlx5e_post_rx_wqes() and re-enables irqs if it succeeds