pf: support paging in the DIOCGETSTATESNV call
AbandonedPublic
Actions

Authored by kp on Jun 29 2021, 7:38 PM.

Details

Reviewers

Group Reviewers

network
pfsense

Summary

Retrieving all states can require an inordinate amount of memory. Each
nvlist represented state is slightly less than 2KB of data, and it's not
unusual for machines to have tens or hundreds of thousands of states.

Optionally allow userspace to specify what rows it wants to retrieve
(by start / end row) to limit the memory requirements, at the expense of
having to make multiple calls.

MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 40306
Build 37195: arc lint + arc unit

Event Timeline

kp created this revision.Jun 29 2021, 7:38 PM

Herald added subscribers: melifaro, farrokhi, imp. · View Herald TranscriptJun 29 2021, 7:38 PM

kp requested review of this revision.Jun 29 2021, 7:38 PM

Harbormaster completed remote builds in B40174: Diff 91524.Jun 29 2021, 7:38 PM

kp added a parent revision: D30946: pf: getstates: avoid taking the hashrow lock if the row is empty.Jun 29 2021, 7:38 PM

kp added a child revision: D30948: libpfctl: read states in multiple calls.

I'm kind of responding to the entire patchset here.

While I'm not fond of nvlist, if it is to be used, imho it can be improved.

Hardcoding the size (even if as a macro) goes against using an extendable format.

I propose the following (names are picked without much thought): add DIOCGETSTATESNVPREP which returns struct pf_getstates { int count; size_t state_size; }. state_size is the size per-element as exported through nvlist

This in particular allows to stop hardcoding the 2KB size.

Since userspace has to alloc all the memory anyway, you can extend the count by say 20% (or even double it if not going past a threshold). Then you can call the kernel and have it fill out as much as it sees fit and return, giving you a magic token which you can use to continue the iteration (which would normally be a row number). Then it is up to the kernel to make the most of what userspace is offering all while limiting its own memory use as it sees fit. This should probably be a new ioctl.

Finally, iteration can check for pending signals on relocking, to make it killable regardless of how long it takes. I don't remember the routine to do it and it escapes my quick grep, can find it later.

In D30947#696642, @mjg wrote:

Since userspace has to alloc all the memory anyway, you can extend the count by say 20% (or even double it if not going past a threshold). Then you can call the kernel and have it fill out as much as it sees fit and return, giving you a magic token which you can use to continue the iteration (which would normally be a row number). Then it is up to the kernel to make the most of what userspace is offering all while limiting its own memory use as it sees fit. This should probably be a new ioctl.

That's a very good idea.

In D30947#696642, @mjg wrote:

While I'm not fond of nvlist, if it is to be used, imho it can be improved.

I'd prefer to stick with it here. Ideally I'd like all of pf's calls to use nvlists so we can ease coping with ABI extensions.
This is going to be the most difficult case, because of the amount of data that can be returned. Other calls are easier, because they're (much) smaller.
If that turns out not to be possible, we'll of course have to find an alternative solution.

The immediate reason for moving the an nvlist-version of the get states call is that I added 'orig_ifname' as part of upstreaming a pfsense patch. Doing that with the old-style ioctl (i.e. copy a struct) can't really be done without ABI breakage. We'd have to introduce a _V2 call (and then a _V3 for the next extension and ...)

Hardcoding the size (even if as a macro) goes against using an extendable format.

I propose the following (names are picked without much thought): add DIOCGETSTATESNVPREP which returns struct pf_getstates { int count; size_t state_size; }. state_size is the size per-element as exported through nvlist

This in particular allows to stop hardcoding the 2KB size.

It's a hint rather than an actual size. Its primary purpose is to enable userspace to make better guesses about how much memory it'll have to allocate to get the requested data. That helps reduce the time required because we don't end up making a call, being told there's not enough space and then making the same call again.
Having the kernel provide that guess dynamically would be better, in that we wouldn't have to rebuild the userspace component if that size changes, but that seems like a trivial distinction.

There's a case to be made for DIOCGETSTATESNVPREP (or maybe DIOCGETSTATESSTATS or something), because knowing the number of states will also help to improve the estimate of how much memory we want to allocate. It could also contain the number of hash rows, because that'll let us remove a sysctl call.
Note that the size is only an estimate, because nvlist representations of a state are not of constant size.

Since userspace has to alloc all the memory anyway, you can extend the count by say 20% (or even double it if not going past a threshold). Then you can call the kernel and have it fill out as much as it sees fit and return, giving you a magic token which you can use to continue the iteration (which would normally be a row number). Then it is up to the kernel to make the most of what userspace is offering all while limiting its own memory use as it sees fit. This should probably be a new ioctl.

I'm not sure I see how that's different from what we do here. We guesstimate the required space and the kernel gives us as many rows as it can. (We return full hash rows at a time, because there's no straightforward way of ensuring that we won't return the same state multiple times if we return part of a hashrow. The row could gain or lose states between calls after all.)

Finally, iteration can check for pending signals on relocking, to make it killable regardless of how long it takes. I don't remember the routine to do it and it escapes my quick grep, can find it later.

That seems like a good idea, both for DIOCGETSTATESNV and for DIOCGETSTATES. SIGPENDING(curthread) should work.

In D30947#696946, @kp wrote:

I'm not sure I see how that's different from what we do here. We guesstimate the required space and the kernel gives us as many rows as it can. (We return full hash rows at a time, because there's no straightforward way of ensuring that we won't return the same state multiple times if we return part of a hashrow. The row could gain or lose states between calls after all.)

Ah, perhaps I do understand now. We can do DIOCGETSTATESNVPREP to obtain row count, state size estimate and state count, then allocate an arbitrary amount of memory (say count * size, topped out at 100MB or 10% of system memory, whichever is lower), ask the kernel for all states. The kernel can then give us as many states (or at least rows) as will fit in the allocated size and tell us how far we got, so we can start from there on the next call.

It's similar to what we do now, but with the advantage of getting everything done in one call if there are very few states (whereas with this patch series we always do at least 10 calls).

kp edited parent revisions, added: D31042: libpfctl: Implement pfctl_get_states_stats(); removed: D30946: pf: getstates: avoid taking the hashrow lock if the row is empty.Jul 5 2021, 10:10 AM

Address review remarks.

Harbormaster completed remote builds in B40297: Diff 91780.Jul 5 2021, 11:50 AM

This hard codes the 2048 estimate added in another review. Should be a macro instead.

Constant for estimated nvlist state size.

Harbormaster completed remote builds in B40306: Diff 91794.Jul 5 2021, 1:28 PM

mjg accepted this revision.Jul 5 2021, 2:10 PM

This revision is now accepted and ready to land.Jul 5 2021, 2:10 PM

While this improves the present situation it's still unacceptably slow, so I'm going to try a version without nvlist.