mlx5en: do no call if_input() while holding the rq mutex
Needs RevisionPublic
Actions

Authored by kib on Sep 23 2024, 1:21 PM.

Details

Reviewers

jhb
gallatin

Summary

Instead, collect a batch of the received mbufs and push them into upper
stack with temporary unlocked rq.

PR:     281469
Reviewed by:    Ariel Ehrenberg <aehrenberg@nvidia.com>

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kib created this revision.Sep 23 2024, 1:21 PM

Herald added subscribers: ae, imp. · View Herald TranscriptSep 23 2024, 1:21 PM

kib requested review of this revision.Sep 23 2024, 1:21 PM

kib added a reviewer: gallatin.Sep 25 2024, 8:44 AM

jhb added inline comments.Sep 25 2024, 2:13 PM

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
516	Note that you can pass a list of packets to ether_input where they are linked by m_nextpkt.
640–643	Could also use an mbufq which takes up less stack space (though this approach is probably fine).

jhb accepted this revision.Sep 25 2024, 2:13 PM

This revision is now accepted and ready to land.Sep 25 2024, 2:13 PM

Use mbufq to collect the batch.

This revision now requires review to proceed.Sep 25 2024, 2:42 PM

I'm very afraid there will be performance implications due to new cache misses here from queing mbufs twice. On tens of thounsands of interfaces running over 8 years, we've never hit a deadlock from this lock, and I don't think fixing this is important enough to hurt performance for.

Instead of doing this, can you remove the lock entirely, change it to a sleepable lock, or reduce its scope instead? From some reading I did last week at the bug session, I gather that it exists only to guard against 2 rare conditions (doing work later because something failed to restock,the ring, and bringing the interface down).

gallatin requested changes to this revision.Sep 25 2024, 11:17 PM

This revision now requires changes to proceed.Sep 25 2024, 11:17 PM

No, the lock cannot be sleepable because the processing occurs in the context of the interrupt thread.

I would implemented something with blockcount_t or even epoch, but then I realized that it would not help. blockcount cannot be used because rx memory is not type-stable. Driver-private epoch might work, but then note that the second reported backtrace in the PR 281368 shows ip stack acquiring sleepable lock. So even if I try to fix driver, the stack still tries to sleep in ip_input().

I suspect you (Netflix) did not see the deadlock because you either do not use ipv6 or use it in situation with static network configuration. The problems are visible when multicast group membership is changed, at least this is what I see in the PR.

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
516	Can ether_input() be combined with software lro?

In D46761#1070797, @kib wrote:

No, the lock cannot be sleepable because the processing occurs in the context of the interrupt thread.

I would implemented something with blockcount_t or even epoch, but then I realized that it would not help. blockcount cannot be used because rx memory is not type-stable. Driver-private epoch might work, but then note that the second reported backtrace in the PR 281368 shows ip stack acquiring sleepable lock. So even if I try to fix driver, the stack still tries to sleep in ip_input().

I suspect you (Netflix) did not see the deadlock because you either do not use ipv6 or use it in situation with static network configuration. The problems are visible when multicast group membership is changed, at least this is what I see in the PR.

We do have a static network config in most places.

At any rate, let's see what we can do to remove the lock entirely. What is this mutex protecting? In looking at the code, the only thing that can be called outside the interrupt context is mlx5e_post_rx_wqes() which is done in a callout if it fails in the interrupt context, so I guess its protecting the state for that?

Rather than than holding a mutex for this rare condition (I have not seen the rq callout called in hours of 100GbE load on a test box), why not just:

If we posted *any* wqes:
- continue as normal and do not schedule the callout, as we'll be called again
If we did not post any wqes:
- disable the irq and schedule the callout at EOI
- the callout function becomes a wrapper that calls mlx5e_post_rx_wqes() and re-enables irqs if it succeeds

Revision Contents
Changeset List

Path

Size

sys/

dev/

mlx5/

mlx5_en/

mlx5_en_rx.c

66 lines

Diff 143712

View Options

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c

mlx5en: do no call if_input() while holding the rq mutexNeeds RevisionPublicActions