Differential D19131

Decrease the time the kernel takes to install a new PF config with a large number of queues
ClosedPublic
Actions

Authored by pkelsey on Feb 8 2019, 8:02 PM.

Details

Reviewers

Commits

rS344028: MFC r343995:
rS343995: Reduce the time it takes the kernel to install a new PF config containing a…

Summary

The general flavor of this patch is reducing O(n^2) behavior when loading a PF config with a large number of queues to O(n), where n is the number of queues. There are additional costs not fitting that model that have been reduced as well.

In a test VM running on a Xeon E5-1650 v3 @ 3.5GHz, the amount of time the kernel holds onto the traffic-blocking PF ruleset write lock during new configuration commit dropped about 10x, from ~65ms to ~7ms with a configuration containing about 4k queues (and the new hash sizes set to 8192).

Specifically:

Split the active and inactive queue lists into interface and non-interface queue lists. The number of queues representing interfaces is generally small, O(10), and the total number of queues can be large O(10^4). This change prevents all operations that need to be applied only to interface queues from hunting for them in a haystack of other queues.
With the split of the queue lists into interface/non-interface lists, the queue delete loops have been rewritten. In the old formulation, when queues representing interfaces were removed, their qids would be unref'd. However, such queues don't have qids assigned, so the unref operation, while harmless, would result in an exhaustive walk of the qid list each time. In the new formulation, this case is properly a no-op.
altq_add() and the per-discipline add() routines now have an ifp parameter, as the caller in every case has already looked it up and passing it in avoids each of those routines having to look it up again.
HFSC queue destroy used to involve a linear scan of its class table (which has to be big enough for one entry for each queue, so potentially O(10^4) or more), on each queue (class) destroy in order to reclaim the assigned slot. This scan is now avoided by recording the class table slot id in the class itself.
Tag management for queue and rule tags has been converted from list based to hash based. All struct tagnames in a tag set are now hashed by both tag name and by tag value (that is, two indexes are maintained).
- - Finding an existing tag value for a given tag name no longer results in a scan of the list of all members of the tag set - now it's a hash bucket walk.
  - Finding an existing struct tagname for a given tag value no longer results in a scan of the list of all members of the tag set - now it's a hash bucket walk.
  - Assigning a new tag value for a given tag name is now done using a bitset of available tag values instead of walking the list of all members of the tag set looking for a gap. This involves many fewer operations and is much more cache friendly for large tag sets.
- struct tagname is now allocated from a uma zone instead of via malloc.
- There are two new tunables to control the hash size used for each tag set (default for each is 128):
  - net.pf.queue_tag_hashsize
  - net.pf.rule_tag_hashsize