mlx5: Pad sq locks to minimize false sharing
AbandonedPublic
Actions

Authored by markj on Feb 5 2021, 5:58 PM.

Details

Reviewers

• hselasky
kib
gallatin

Summary

The mlx5_en send queue contains two mutexes, one used by xmit and one by
completion interrupt ithreads. Both are adjacent and they end up
sharing a cache line. Use mtx_padalign instead.

I considered moving the comp_lock to group it with other fields modified
by the tx completion path, but mlx5_en splits the structure into
"static" and non-static regions for initialization purposes so this is a
bit hairy.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 36742
Build 33631: arc lint + arc unit

Event Timeline

markj created this revision.Feb 5 2021, 5:58 PM

Herald added subscribers: ae, imp. · View Herald TranscriptFeb 5 2021, 5:58 PM

markj requested review of this revision.Feb 5 2021, 5:58 PM

Harbormaster completed remote builds in B36742: Diff 83435.Feb 5 2021, 5:58 PM

markj added reviewers: • hselasky, kib.Feb 5 2021, 5:58 PM

Was it based on some measurements?

In D28508#637980, @kib wrote:

Was it based on some measurements?

I'm still trying to find a way to measure some difference. hwpmc is not functional on this particular platform, and my benchmark is limited by memory bandwidth. I just spotted this because there is moderate contention for sq locks in my configuration and I noticed that the locks are grouped together even though the rest of the structure is careful to separate fields updated by xmit vs. completion paths.

I don't mean to commit this unless I can measure some difference, but wanted to show the patch in case you prefer some other approach.

kib accepted this revision.Feb 5 2021, 7:02 PM

This revision is now accepted and ready to land.Feb 5 2021, 7:02 PM

Both locks are typically locked from the same CPU. Can you show some benefits of this, or is it just theoretical?

In D28508#638050, @hselasky wrote:

Both locks are typically locked from the same CPU. Can you show some benefits of this, or is it just theoretical?

It's theoretical. I can't find any meaningful difference even in a synthetic scenario where dozens of CPUs are using the same send queue. This on a single-socket arm64 platform, by the way.

I noticed that mlx5 does not bind interrupts or ithreads by default. So doesn't it require some manual configuration to ensure that the producer and consumer threads are on the same CPU?

Only if RSS is enabled it binds interrupts. Let's add Drew for comments.

Drew, can you look at this patch.

The producer and consumer certainly may be on different CPUs.

I wonder: Would it be better to sort the mutexes into the existing producer / consumer separation, and insert an cacheline sized alignment directive there?

In D28508#638448, @gallatin wrote:

The producer and consumer certainly may be on different CPUs.

I wonder: Would it be better to sort the mutexes into the existing producer / consumer separation, and insert an cacheline sized alignment directive there?

I tend to think so but that requires some larger changes to sq initialization (see the use of mlx5e_sq_zero_start). I can't really see why an optimization like that is needed since sq's are allocated fairly rarely(?).

markj abandoned this revision.Sep 16 2021, 2:47 PM

Revision Contents
Changeset List

Path

Size

sys/

dev/

mlx5/

mlx5_en/

en.h

6 lines

Diff 83435

View Options

mlx5: Pad sq locks to minimize false sharingAbandonedPublicActions