Page MenuHomeFreeBSD

mlx5en(4): Optimize ratelimit support to handle more frequent rate changes.
Needs ReviewPublic

Authored by hselasky on Apr 18 2023, 1:24 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, May 8, 12:26 AM
Unknown Object (File)
Wed, May 8, 12:26 AM
Unknown Object (File)
Tue, May 7, 11:40 PM
Unknown Object (File)
Dec 23 2023, 2:54 AM
Unknown Object (File)
Nov 10 2023, 5:19 PM
Unknown Object (File)
Nov 8 2023, 5:33 PM
Unknown Object (File)
Nov 6 2023, 4:09 PM
Unknown Object (File)
Oct 5 2023, 3:04 PM
Subscribers

Details

Reviewers
gallatin
rrs
kib
Summary

This change is more or less a rewrite of ratelimit support in
mlx5en(4). The main goal is to remove sleepable mutexes and
unneccessary task switching and firmware commands when there are
frequent rate changes, typically when used together with BBR.

Changes in mlx5core:

The rate limit tables have been made global to all mlx5en(4) devices.
There are two tunable sysctls(8) knobs which allows configuring the
available rates:

hw.mlx5.rates.cx4
hw.mlx5.rates.cx5

The cx4 entry is used for supporting older hardware and has only 13
entries. The cx5 entry is used for supporting newer hardware and
allows for more rates, currently up to 30 different rates. The default
values support rates in a logarithmic fashion from 1Mbit/s to
25Gbit/s, which is the maximum supported by the packet pacing set
socket option command, which takes a 32-bit unsigned value in bytes
per second.

The format of these knobs is rate in bytes per second and then burst
size in bytes.

Because the rates are fixed, no serializing or refcount mechanism is
needed when looking up the schedule queue by rate.

Changes in mlx5en:

The support is split into two parts:

Part one is for devices not supporting the QOS remap WQE command,
refer to "hw.mlx5.rates.use_multi_sq=1". For each supported rate,
there are "dev.mce.<n>.conf.channels" buckets of SQ's which are either
in use or free. Actually there are three bucket heads, one for free
SQ's, one for in use SQ's and one for SQ's which are about to be
recycled. A fixed minimum of 1/16 th of the in use SQ's are allocated
for the free list. When the software wants to change the rate a new SQ
is allocated from the new rate, and when the current SQ becomes empty
it is released to the pool of recycled SQ's for the old rate. When two
SQ's are allocated for the same send tag, the second SQ halts
transmission until the first SQ is empty.

Part two is for devices supporting the QOS remap WQE command, refer to
"hw.mlx5.rates.use_multi_sq=0". There is one global "rate" containing
"dev.mce.<n>.conf.channels" buckets of SQ's which are either in use or
free. When the rate or schedule queue index changes, an asynchronous
WQE command is queued. As soon as the WQE command completes, the rate
of the current SQ changes.

When changing any of the following parameters:

dev.mce.<N>.rate_limit.tx_completion_fact
dev.mce.<N>.rate_limit.tx_coalesce_mode
dev.mce.<N>.rate_limit.tx_coalesce_pkts
dev.mce.<N>.rate_limit.tx_coalesce_usecs
dev.mce.<N>.rate_limit.tx_queue_size

the new value of the parameters will take effect only after all the
free SQ's are recycled. This is to avoid too much load on the command
handler on a live running system. It is recommended to configure these
parameters, if any, prior to loading the driver or in
/boot/loader.conf .

The ratelimit support has a single worker threads which is responsible
for keeping a minimum of 1/16th rounded up of SQ's available for new
connections at any time.

MFC after: 1 week
Sponsored by: NVIDIA Networking

Test Plan

The plan is to get this in before FreeBSD 14 is released!

Applies on FreeBSD-14 main as of today!

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped