Fix 32-bit queue bandwidth limitation in pf and in the ALTQ token bucket regulator and HFSC scheduler
ClosedPublic
Actions

Authored by pkelsey on Aug 18 2018, 9:59 AM.

Details

Reviewers

jmallett
kp
loos

Commits

rS338209: Extended pf(4) ioctl interface and pfctl(8) to allow bandwidths of

Summary

I'd like to get these fixes into 12.0. The current behavior is goofy and limited, and this is something that is being used in production.

It is currently not possible to use pfctl(8) to configure queue bandwidth parameters greater than 2^32 - 1 bps (~4.29 Gbps). While such bandwidths can be represented in pf.conf, they will be silently wrapped at the 32-bit boundary when passed to the kernel due to limitations in the pf(4) ioctl interface. Further, the altq(4) token bucket regulator and scheduler implementations are in general written under the assumption of maximum bandwidth parameters of 1 Gbps or less.

This patch:

Updates the the pf(4) ioctl interface to use 64-bit bandwidth parameters instead of 32-bit, and to increase the token bucket regulator bucket depth parameters from 16-bits to 32-bits. This is done in a backwards compatible way, as explained below.

Replaces the 16-bit token bucket regulator bucket depth limit in the pfctl parser with a 32-bit one.

Updates the token bucket regulator computations in the kernel to function properly at higher bandwidths. The fixed point scale has been adjusted. After the various tradeoffs were made, there's enough headroom to operate without integer overflow up to about 137 Gbps and still have at least three significant decimal digits in the scaled rate parameter down to about 1 kbps.

Fixes a bug in the value of tbr_filluptime in the extreme case of a scaled rate of zero.

Removes tbr_get() as it isn't used, and the math in there has always overflowed for interface rates greater than 0.5 bps. Not worth fixing.

Updates the HFSC scheduler to function properly at higher bandwidths. The fixed point scale of the scaled inverse-m parameters had to be adjusted. Now the computational machinery will function without overflow and also maintain at least 3 significant decimal digits with service curve bandwidth parameters up to 100 Gbps and service curve time intervals up to about 88 seconds. The silly/useless chart in the parameter scaling comment has been removed and the comment clarified.

Separates the data type used to pass queue descriptions through the ioctl interface from the data type used to maintain queue state in the kernel.

Backwards compatibility:

The pf(4) ioctl interface has been extended by versioning the data structures that needed to change, and for each ioctl that now uses a versioned data structure, creating a distinct ioctl command code for each version (differentiated by the encoded parameter length). Where an ioctl command has multiple versions of its data structure, the kernel can identify and import any version supplied by userland, and can likewise export to any version supplied by userland. In general, version 0 of a data structure is identified by testing the parameter length encoded in the ioctl cmd, and where there might be more than two revisions to a data structure over time, or there exists some other ambiguity in version detection by size, version 1 adds an explicit version field that gets set by the caller to whatever the latest version was when the calling binary was built.

The struct tags for versioned data structures are the original tag with a _vX (_v0, _v1, etc) appended to them. In almost all cases (currently one exception), the original struct tag name becomes the name of a macro. In the kernel, the macro always resolves to the latest corresponding versioned struct tag - this prevents tons of rename noise in diffs when a new data structure revision is added. In userland, by default, the macros always resolve to version 0. This allows out-of-tree sources that use the pf(4) ioctl interface (e.g., pfstat and pftop in ports) to continue to build and work as before without modification. However, if PFIOC_USE_LATEST is defined before any of the pf and altq headers are included, the macros will resolve to the latest versions. This is used for in-tree consumers, which should always be kept up to date with interface changes, and can be used by out-of-tree consumers when updating to the latest interface versions.

Likewise, in userland, for ioctl command codes that become versioned, the original ioctl command code name becomes a macro that by default resolves to version 0, and if PFIOC_USE_LATEST is defined ahead of the includes, resolves to the latest version.

Other:

I plan on updating the pf(4) man page separately.

Note that only HFSC proper has been updated to handle higher bandwidths. If it is used in conjunction with RED/RIO or CODEL, there still be dragons in those. Future work in other schedulers can use the same interface extension techniques employed so far.

If a pfctl binary built against the version 0 (current) interface is run on a kernel with this patch and is used to configure queues having bandwidth parameters, those values will still wrap as before.

If a pfctl binary built against the interface in this patch is run on a kernel also built with this patch, and is used to configure HFSC queues with bandwidth parameters greater than 2^32 - 1 bps, then an older binary is used to show the status of those queues, the older binaries will show those parameters as saturated at 2^32 - 1 instead of as wrapped values.

If a pfctl binary built against the interface in this patch is run on a kernel also built with this patch, and is used to configure non-HFSC queues with bandwidth parameters greater than 2^32 - 1 bps, then pfctl will saturate those parameters at 2^32 - 1 bps prior to passing them to the kernel. That is, you can't use a new pfctl to feed greater-than-32-bit bandwidth parameters to schedulers that have not yet been updated to handle them. This isn't exactly a full protection against things going awry, as I think some of the other schedulers will have overflow issues before 2^32 - 1 bps is reached.

The bsnmpd altq MIB only has a 32-bit bandwidth parameter in its definition. When the patched one is run against a kernel with this patch, or an older one is run against a kernel with this patch that has been configured by a patched pfctl, it will report a saturated value for bandwidths greater than 32 bits.

Test Plan

Verified some HFSC configs, both over and under 2^32 - 1 bps, by observing rate control traffic as expected (so far up to about 9 Gbps, max run 30 hours).

Verified backwards-compatibility works as expected by running an older pfctl against the patched kernel, as well as a patched pfctl.

Verified clean world and kernel builds with and without ALTQ enabled in the kernel configs.

Tested ports that use pf(4) ioctl interface: pftop builds and runs (showing 32-bit saturated queue banwidths). pfstat builds - did not test its functionality.

Need to test more pf configurations.

Passed a universe build

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

pkelsey created this revision.Aug 18 2018, 9:59 AM

Herald added subscribers: farrokhi, ae. · View Herald TranscriptAug 18 2018, 9:59 AM

pkelsey edited the summary of this revision. (Show Details)Aug 18 2018, 10:18 AM

pkelsey edited the test plan for this revision. (Show Details)

Two minor remarks, at first glance. I've not dug into it at great depth, but your approach looks good and I've not seen any big problems.

This looks like great work, and I agree that we should try to get this in before the 12-freeze.

sbin/pfctl/parse.y
1629 ↗	(On Diff #46876)	Nit: yyerror("tbrsize too big: max %ud", UINT_MAX)?
sys/net/pfvar.h
1441 ↗	(On Diff #46876)	Shouldn't we keep the old one with the old name? What if userspace code tries to use 'DIOCADDALTQ'?

It'd be really awesome if you could also see about writing some basic tests (both as pfctl parser tests, to check we can now set higher bitrates for shaping, and as a basic test that shaping actually happens), but that's certainly not blocking.

jmallett accepted this revision.Aug 18 2018, 3:59 PM

jmallett added inline comments.

sys/net/pfvar.h
1441 ↗	(On Diff #46876)	There are defines of all versioned ioctls (e.g. DIOCADDALTQ) to the appropriate version depending on whether the user-space program has defined PFIOC_USE_LATEST or not.

This revision is now accepted and ready to land.Aug 18 2018, 3:59 PM

pkelsey added inline comments.Aug 18 2018, 4:06 PM

sbin/pfctl/parse.y
1629 ↗	(On Diff #46876)	That's sensible. I agree my change here was too literal.

kp accepted this revision.Aug 18 2018, 4:07 PM

kp added inline comments.

sys/net/pfvar.h
1441 ↗	(On Diff #46876)	Ah yes, I missed them.

In D16782#357098, @kristof wrote:

It'd be really awesome if you could also see about writing some basic tests (both as pfctl parser tests, to check we can now set higher bitrates for shaping, and as a basic test that shaping actually happens), but that's certainly not blocking.

So far, I have tested on a machine with a 10G interface, using a single-queue HFSC config that I've configured to various rate limits up to about 9 Gbps, and observed the expected rate limiting on the receiver. I've also verified that an older pfctl binary sees 32-bit saturated bandwidths when they are at or above 2^32 (pfctl -s queue).

Is this what you mean, or do you mean publishing test configs of some sort, or adding some sort of automatic regression tests?

One thing that appears to be true is the heuristic in pfctl that sets the tbrsize does not work well at the higher interface rates, undersizing the bucket to the point where it interferes with throughput. Based on the way the tbr works, I think a default tbr size of something on the order of (expected_avg_pkt_size * nic_tx_ring_depth / 4) makes sense, and so far in testing, afict that resolves tbr undersizing issues without triggering oversizing issues.

pkelsey added inline comments.Aug 18 2018, 4:23 PM