Update pfctl(8) tbrsize heuristic for high bandwidth interfaces
ClosedPublic
Actions

Authored by pkelsey on Aug 22 2018, 8:59 PM.

Details

Reviewers

jmallett
kp
loos

Commits

rS338253: Extend tbrsize heuristic in pfctl(8) to provide a sensible value for

Summary

The tbrsize heuristic in pfctl(8), used to set tbrsize when no size is given in the config file, results in poor regulation at high interface speeds (empirically somewhere north of a few Gbps on current equipment). This becomes a potential issue for systems post-r338209 (lifting of 32-bit bandwidth limit).

This adds a new larger value that is applied for interface speeds > 2.5 Gbps. 2.5 Gbps is the highest standard rate that could be used prior to r338209, so the default behavior for all existing systems should remain the same.

The value of 128 was chosen as a balance between giving the interface driver enough work to do per dequeue loop and keeping small the likelihood that a greedy driver will dequeue more than it can give to its hardware (and thus drop the remainder internally), across a reasonable range of average packet sizes (assuming a typical minimum hardware ring size of 1024 for 10Gbps+ interfaces).

Test Plan

Tested up to 25 Gbps.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

pkelsey created this revision.Aug 22 2018, 8:59 PM

This seems straightforward, reasonable, and sufficient. Longer-term, I'd wonder about some sort of arithmetic approach here rather than this kind of hand-scaling?

This revision is now accepted and ready to land.Aug 22 2018, 9:01 PM

In D16852#358874, @jmallett wrote:

This seems straightforward, reasonable, and sufficient. Longer-term, I'd wonder about some sort of arithmetic approach here rather than this kind of hand-scaling?

Well, this is a sort of arithmetic approach, just incorporating some uncertainties that I'm not sure we can improve on very easily.

I *think* the issue with poor regulation at higher speeds is that there is some dead time created by the interaction between the driver and the TBR when the bucket drains too quickly (possibly too many token credits being computed over small intervals, with attendant quantization effects), which effectively lowers the transmit rate on an uneven basis. From this perspective, there is pressure to not let the bucket be drained 'too fast'. Without a lot of lab work, I don't think we can come up with a good threshold for 'too fast'. The current notion of 'this is not too fast' is based on empirical results from rate limiting a collection of TCP streams at various bandwidths up to about 9 Gbps on a 10 Gbps NIC, and at various bandwidths up to about 25 Gbps on a 40 Gbps NIC.

On the other end, if we make the bucket too big, when starting with a full bucket and burst-backlogged queues after an idle period, the driver will be able to dequeue more packets in a burst than it can fit in its hardware ring, dropping the overage internally (and unnecessarily). To avoid that, the bucket needs to be sized such that the bucket depth divided by the average packet size through such a burst dequeue is less than the number of slots in the hardware ring. Even though we could with some work (less now that iflib is around) know the hardware ring size (and then assume it is stationary), I don't think in general 'average packet size over a burst dequeue' can be considered to be a stationary parameter, so I don't think we can come up with a formula that dials this in much better. Assuming a minimum hardware ring depth of 1024 for the fast interfaces, setting the bucket depth at 128*mtu means average packet size can drop to about 1/8 mtu before we run the risk of dropping packets due to burst-after-idle, which seems to be (hand-wave) a pretty good default range w.r.t this issue.

In D16852#359003, @pkelsey wrote:

To avoid that, the bucket needs to be sized such that the bucket depth divided by the average packet size through such a burst dequeue is less than the number of slots in the hardware ring. Even though we could with some work (less now that iflib is around) know the hardware ring size (and then assume it is stationary), I don't think in general 'average packet size over a burst dequeue' can be considered to be a stationary parameter, so I don't think we can come up with a formula that dials this in much better.

Thanks, that's a useful conceptualization. I guess with the old if_start based approach we knew the queue length that the driver supported at least, although even there altq had ownership of that queue, right? It seems like we could have something in ifnet which conveys how many packets a driver can consume theoretically and could derive from that with some amount of over or undercommit depending on whether one wants smoothness or optimal utilization (the purpose of a TBR being more the latter, I guess.) The math you've laid out certainly makes sense given the lack of available metrics from which to derive a number in a less chunky way.

kp accepted this revision.Aug 23 2018, 9:16 AM

Closed by commit rS338253: Extend tbrsize heuristic in pfctl(8) to provide a sensible value for (authored by pkelsey). · Explain WhyAug 23 2018, 4:10 PM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: imp. · View Herald TranscriptAug 23 2018, 4:10 PM