Page MenuHomeFreeBSD

mlx5: Pad sq locks to minimize false sharing
AbandonedPublic

Authored by markj on Feb 5 2021, 5:58 PM.
Tags
None
Referenced Files
Unknown Object (File)
Jan 1 2024, 1:57 AM
Unknown Object (File)
Jun 27 2023, 11:17 PM
Unknown Object (File)
May 6 2023, 3:00 AM
Unknown Object (File)
Apr 8 2023, 10:34 AM
Unknown Object (File)
Mar 22 2023, 7:55 AM
Unknown Object (File)
Feb 13 2023, 12:39 PM
Unknown Object (File)
Jan 23 2023, 3:05 AM
Unknown Object (File)
Jan 6 2023, 12:54 AM
Subscribers

Details

Summary

The mlx5_en send queue contains two mutexes, one used by xmit and one by
completion interrupt ithreads. Both are adjacent and they end up
sharing a cache line. Use mtx_padalign instead.

I considered moving the comp_lock to group it with other fields modified
by the tx completion path, but mlx5_en splits the structure into
"static" and non-static regions for initialization purposes so this is a
bit hairy.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 36742
Build 33631: arc lint + arc unit

Event Timeline

markj requested review of this revision.Feb 5 2021, 5:58 PM

Was it based on some measurements?

In D28508#637980, @kib wrote:

Was it based on some measurements?

I'm still trying to find a way to measure some difference. hwpmc is not functional on this particular platform, and my benchmark is limited by memory bandwidth. I just spotted this because there is moderate contention for sq locks in my configuration and I noticed that the locks are grouped together even though the rest of the structure is careful to separate fields updated by xmit vs. completion paths.

I don't mean to commit this unless I can measure some difference, but wanted to show the patch in case you prefer some other approach.

This revision is now accepted and ready to land.Feb 5 2021, 7:02 PM

Both locks are typically locked from the same CPU. Can you show some benefits of this, or is it just theoretical?

Both locks are typically locked from the same CPU. Can you show some benefits of this, or is it just theoretical?

It's theoretical. I can't find any meaningful difference even in a synthetic scenario where dozens of CPUs are using the same send queue. This on a single-socket arm64 platform, by the way.

I noticed that mlx5 does not bind interrupts or ithreads by default. So doesn't it require some manual configuration to ensure that the producer and consumer threads are on the same CPU?

Only if RSS is enabled it binds interrupts. Let's add Drew for comments.

The producer and consumer certainly may be on different CPUs.

I wonder: Would it be better to sort the mutexes into the existing producer / consumer separation, and insert an cacheline sized alignment directive there?

The producer and consumer certainly may be on different CPUs.

I wonder: Would it be better to sort the mutexes into the existing producer / consumer separation, and insert an cacheline sized alignment directive there?

I tend to think so but that requires some larger changes to sq initialization (see the use of mlx5e_sq_zero_start). I can't really see why an optimization like that is needed since sq's are allocated fairly rarely(?).