I'd like to get these fixes into 12.0. The current behavior is goofy and limited, and this is something that is being used in production.
It is currently not possible to use pfctl(8) to configure queue bandwidth parameters greater than 2^32 - 1 bps (~4.29 Gbps). While such bandwidths can be represented in pf.conf, they will be silently wrapped at the 32-bit boundary when passed to the kernel due to limitations in the pf(4) ioctl interface. Further, the altq(4) token bucket regulator and scheduler implementations are in general written under the assumption of maximum bandwidth parameters of 1 Gbps or less.
This patch:
- Updates the the pf(4) ioctl interface to use 64-bit bandwidth parameters instead of 32-bit, and to increase the token bucket regulator bucket depth parameters from 16-bits to 32-bits. This is done in a backwards compatible way, as explained below.
- Replaces the 16-bit token bucket regulator bucket depth limit in the pfctl parser with a 32-bit one.
- Updates the token bucket regulator computations in the kernel to function properly at higher bandwidths. The fixed point scale has been adjusted. After the various tradeoffs were made, there's enough headroom to operate without integer overflow up to about 137 Gbps and still have at least three significant decimal digits in the scaled rate parameter down to about 1 kbps.
- Fixes a bug in the value of tbr_filluptime in the extreme case of a scaled rate of zero.
- Removes tbr_get() as it isn't used, and the math in there has always overflowed for interface rates greater than 0.5 bps. Not worth fixing.
- Updates the HFSC scheduler to function properly at higher bandwidths. The fixed point scale of the scaled inverse-m parameters had to be adjusted. Now the computational machinery will function without overflow and also maintain at least 3 significant decimal digits with service curve bandwidth parameters up to 100 Gbps and service curve time intervals up to about 88 seconds. The silly/useless chart in the parameter scaling comment has been removed and the comment clarified.
- Separates the data type used to pass queue descriptions through the ioctl interface from the data type used to maintain queue state in the kernel.
Backwards compatibility:
The pf(4) ioctl interface has been extended by versioning the data structures that needed to change, and for each ioctl that now uses a versioned data structure, creating a distinct ioctl command code for each version (differentiated by the encoded parameter length). Where an ioctl command has multiple versions of its data structure, the kernel can identify and import any version supplied by userland, and can likewise export to any version supplied by userland. In general, version 0 of a data structure is identified by testing the parameter length encoded in the ioctl cmd, and where there might be more than two revisions to a data structure over time, or there exists some other ambiguity in version detection by size, version 1 adds an explicit version field that gets set by the caller to whatever the latest version was when the calling binary was built.
The struct tags for versioned data structures are the original tag with a _vX (_v0, _v1, etc) appended to them. In almost all cases (currently one exception), the original struct tag name becomes the name of a macro. In the kernel, the macro always resolves to the latest corresponding versioned struct tag - this prevents tons of rename noise in diffs when a new data structure revision is added. In userland, by default, the macros always resolve to version 0. This allows out-of-tree sources that use the pf(4) ioctl interface (e.g., pfstat and pftop in ports) to continue to build and work as before without modification. However, if PFIOC_USE_LATEST is defined before any of the pf and altq headers are included, the macros will resolve to the latest versions. This is used for in-tree consumers, which should always be kept up to date with interface changes, and can be used by out-of-tree consumers when updating to the latest interface versions.
Likewise, in userland, for ioctl command codes that become versioned, the original ioctl command code name becomes a macro that by default resolves to version 0, and if PFIOC_USE_LATEST is defined ahead of the includes, resolves to the latest version.
Other:
- I plan on updating the pf(4) man page separately.
- Note that only HFSC proper has been updated to handle higher bandwidths. If it is used in conjunction with RED/RIO or CODEL, there still be dragons in those. Future work in other schedulers can use the same interface extension techniques employed so far.
- If a pfctl binary built against the version 0 (current) interface is run on a kernel with this patch and it used to configure queues having bandwidth parameters, those values will still wrap as before.
- If a pfctl binary built against the interface in this patch is run on a kernel also built with this patch, and it used to configure HFSC queues with bandwidth parameters greater than 2^32 - 1 bps, then an older binary is used to show the status of those queues, the older binaries will show those parameters as saturated at 2^32 - 1 instead of as wrapped values.
- If a pfctl binary built against the interface in this patch is run on a kernel also built with this patch, and it used to configure non-HFSC queues with bandwidth parameters greater than 2^32 - 1 bps, then pfctl will saturate those parameters at 2^32 - 1 bps prior to passing them to the kernel. That is, you can't use a new pfctl to feed greater-than-32-bit bandwidth parameters to schedulers that have not yet been updated to handle them. This isn't exactly a full protection against things going awry, as I think some of the other schedulers will have overflow issues before 2^32 - 1 bps is reached.
- The bsnmpd altq MIB only has a 32-bit bandwidth parameter in its definition. When the patched one is run against a kernel with this patch, or an older one is run against a kernel with this patch that has been configured by a patched pfctl, it will report a saturated value for bandwidths greater than 32 bits.