Page MenuHomeFreeBSD

Fix O(n^2) behavior in sysctl
ClosedPublic

Authored by asomers on Sep 8 2022, 11:23 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 8, 9:39 PM
Unknown Object (File)
Sun, Jan 8, 7:57 AM
Unknown Object (File)
Thu, Jan 5, 1:35 AM
Unknown Object (File)
Dec 15 2022, 7:49 PM

Details

Summary

Sysctl OIDs were internally stored in linked lists, triggering O(n^2)
behavior when userland iterates over many of them. The slowdown is
noticeable for MIBs that have > 100 children (for example, vm.uma). But
it's unignorable for kstat.zfs when a pool has > 1000 datasets.

Convert the linked lists into splay trees. This produces a ~25x speedup
for listing kstat.zfs with 4100 datasets, and no measurable penalty for
small dataset counts.

Sponsored by: Axcient
MFC after: 2 weeks

Test Plan

manually tested with ztop, "sysctl kstat.zfs.testpool", and the ATF regression tests.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

hselasky added inline comments.
sys/sys/sysctl.h
207

Please move the opening curly bracket to the next line.

This way of comparing integers leads to a signed bit overflow.
Comparing integers for sort, must be done exactly like this:

if (a->oid_number > b->oid_number)
return (1);
else if (a->oid_number < b->oid_number)
return (-1);
else
return (0);

Hi, thank you for the review. I like the lock update.

Could it impact <sysutils/sysctlinfo-kmod>? After this kernel update, is it still possible to compile and to use the sysctlinfo-kmod port and its "clients"? (deskutils/sysctlview, sysutils/nsysctl, audio/mixertui, sysutils/sysctlbyname-improved-kmod,...)

(I ask because unfortunately now I'm in a busy period and cannot test and give this review the attention it deserves.)

What's up with lock conversion to sx and exclusive-locking everywhere? If you really need to hold the lock the entire time, you can use rms locks instead. If read-locking for normal use can't be achieved, then I'm afraid this patch is a non-starter -- there is *tons* of parallel sysctl calls in fork+exec heavy workloads (most notably for malloc), so this would degrade performance. (EDIT: now that i wrote it, splay rebalances itself also on lookup, so that's a non-starter here)

That said I remember I proposed RB trees originally and I don't mind other trees if that's the general route, but looking at the problem some more I think this is a pessimal approach.

There are 2 kinds of lookups to take care of: name2oid which finds the right numerical set and regular lookup when you are provided all numbers.

For the latter one can employ arrays instead. You would have 2 arrays at each level, one for static and one for dynamic entries. Statically defined oids would index directly into one array. Dynamic oids would start with -1, -2, -3 etc. and would be indexed into their own array after you abs() - 1 their id, so -1 would index at 0, -2 would index at 1 etc. Consequently this is O(1) lookup at each level as opposed O(log n) with no generate cases to worry about.

For name2oid one can consider still having a tree *or* a hash table. Partial problem with hashes here is that the last part of the "address" can be fully dynamic, so would need to be excluded, but there is no way to know that upfront except for custom heuristics. Splay would be fine here for the time being.

sys/kern/kern_sysctl.c
885

if this is the only spot where you need sleeping behavior then perhaps it can be moved up prior to locking

Welp.

I looked around, OpenBSD has everything hardcoded, like so:

switch (name[0]) {
case CTL_KERN:
        fn = kern_sysctl;
        break;
case CTL_HW:
        fn = hw_sysctl;
        break;

then in kern_sysctl:

switch (name[0]) {
case KERN_OSTYPE:
        return (sysctl_rdstring(oldp, oldlenp, newp, ostype));
case KERN_OSRELEASE:
        return (sysctl_rdstring(oldp, oldlenp, newp, osrelease));

and so on. So they don't do pointer chasing over linked lists, but they still have a branchfest. Apparently they don't have name2oid-equivalent.

That said, call it however you want, but I don't think something legitimate implemented slower than in OpenBSD is an option -- O(1) with arrays as described above would be faster than what they are doing and would probably not need changing for years to come.

In D36500#829077, @mjg wrote:

What's up with lock conversion to sx and exclusive-locking everywhere? If you really need to hold the lock the entire time, you can use rms locks instead. If read-locking for normal use can't be achieved, then I'm afraid this patch is a non-starter -- there is *tons* of parallel sysctl calls in fork+exec heavy workloads (most notably for malloc), so this would degrade performance. (EDIT: now that i wrote it, splay rebalances itself also on lookup, so that's a non-starter here)

That's right, splay trees rebalancer on lookup. And that's pretty handy for sysctl, because typical use will lookup the exact same oid several times in a row. For example, sysctl security.jail.jailed searches for the OID 4 times, and each of those searches performs a SPLAY_FIND at three different levels of the heirarchy. All of that repetition is why I chose splay trees over RB trees.

That said I remember I proposed RB trees originally and I don't mind other trees if that's the general route, but looking at the problem some more I think this is a pessimal approach.

There are 2 kinds of lookups to take care of: name2oid which finds the right numerical set and regular lookup when you are provided all numbers.

For the latter one can employ arrays instead. You would have 2 arrays at each level, one for static and one for dynamic entries. Statically defined oids would index directly into one array. Dynamic oids would start with -1, -2, -3 etc. and would be indexed into their own array after you abs() - 1 their id, so -1 would index at 0, -2 would index at 1 etc. Consequently this is O(1) lookup at each level as opposed O(log n) with no generate cases to worry about.

I don't think arrays are a practical solution, because the key range is too large. oid_number can be as high as INT_MAX, so the arrays would have to be dynamically allocated, and they may have to resize on insertion.

For name2oid one can consider still having a tree *or* a hash table. Partial problem with hashes here is that the last part of the "address" can be fully dynamic, so would need to be excluded, but there is no way to know that upfront except for custom heuristics. Splay would be fine here for the time being.

sys/kern/kern_sysctl.c
885

It isn't. For example, there's a malloc in sysctl_ctx_entry_add, which is entered with the lock already held. I think that custom sysctl handlers can also sleep, too.

In D36500#829077, @mjg wrote:

What's up with lock conversion to sx and exclusive-locking everywhere? If you really need to hold the lock the entire time, you can use rms locks instead. If read-locking for normal use can't be achieved, then I'm afraid this patch is a non-starter -- there is *tons* of parallel sysctl calls in fork+exec heavy workloads (most notably for malloc), so this would degrade performance. (EDIT: now that i wrote it, splay rebalances itself also on lookup, so that's a non-starter here)

That's right, splay trees rebalancer on lookup. And that's pretty handy for sysctl, because typical use will lookup the exact same oid several times in a row. For example, sysctl security.jail.jailed searches for the OID 4 times, and each of those searches performs a SPLAY_FIND at three different levels of the heirarchy. All of that repetition is why I chose splay trees over RB trees.

That said I remember I proposed RB trees originally and I don't mind other trees if that's the general route, but looking at the problem some more I think this is a pessimal approach.

There are 2 kinds of lookups to take care of: name2oid which finds the right numerical set and regular lookup when you are provided all numbers.

For the latter one can employ arrays instead. You would have 2 arrays at each level, one for static and one for dynamic entries. Statically defined oids would index directly into one array. Dynamic oids would start with -1, -2, -3 etc. and would be indexed into their own array after you abs() - 1 their id, so -1 would index at 0, -2 would index at 1 etc. Consequently this is O(1) lookup at each level as opposed O(log n) with no generate cases to worry about.

I don't think arrays are a practical solution, because the key range is too large. oid_number can be as high as INT_MAX, so the arrays would have to be dynamically allocated, and they may have to resize on insertion.

It can't, I noted this is already clamped:

/*
 * The starting number for dynamically-assigned entries.  WARNING!
 * ALL static sysctl entries should have numbers LESS than this!
 */
#define CTL_AUTO_START  0x100

So worst case for static entries is not particularly bad. Resizing on insertion may still be sensible to do, but it wont affect lookups.

Also note with arrays providing O(1) there is no problem with what is used how frequently and there is no need to change anything on lookup.

For name2oid one can consider still having a tree *or* a hash table. Partial problem with hashes here is that the last part of the "address" can be fully dynamic, so would need to be excluded, but there is no way to know that upfront except for custom heuristics. Splay would be fine here for the time being.

I can't stress enough that there is tons of sysctl calls happening at the same time, making exclusive-locking for regular use a non-starter.

  • tweak cmp_splay_oid as suggested by @hselasky

Hi, thank you for the review. I like the lock update.

Could it impact <sysutils/sysctlinfo-kmod>? After this kernel update, is it still possible to compile and to use the sysctlinfo-kmod port and its "clients"? (deskutils/sysctlview, sysutils/nsysctl, audio/mixertui, sysutils/sysctlbyname-improved-kmod,...)

(I ask because unfortunately now I'm in a busy period and cannot test and give this review the attention it deserves.)

Yes, it certainly will impact that port. I can bump __FreeBSD_version for this.

I don't think arrays are a practical solution, because the key range is too large. oid_number can be as high as INT_MAX, so the arrays would have to be dynamically allocated, and they may have to resize on insertion.

It can't, I noted this is already clamped:

/*
 * The starting number for dynamically-assigned entries.  WARNING!
 * ALL static sysctl entries should have numbers LESS than this!
 */
#define CTL_AUTO_START  0x100

So worst case for static entries is not particularly bad. Resizing on insertion may still be sensible to do, but it wont affect lookups.

That's only for static entries, which are uncommon. Dynamically entries are much more common, and can range anywhere from 0x100 to 0x7fffffff.

Also note with arrays providing O(1) there is no problem with what is used how frequently and there is no need to change anything on lookup.

For name2oid one can consider still having a tree *or* a hash table. Partial problem with hashes here is that the last part of the "address" can be fully dynamic, so would need to be excluded, but there is no way to know that upfront except for custom heuristics. Splay would be fine here for the time being.

I can't stress enough that there is tons of sysctl calls happening at the same time, making exclusive-locking for regular use a non-starter.

Well yes that's true. But there are also tons of repeated lookups of the same entry over and over, which is what splay trees excel at. So using splay trees means that some threads will get blocked sometimes, but total runtime will be less.

I don't think arrays are a practical solution, because the key range is too large. oid_number can be as high as INT_MAX, so the arrays would have to be dynamically allocated, and they may have to resize on insertion.

It can't, I noted this is already clamped:

/*
 * The starting number for dynamically-assigned entries.  WARNING!
 * ALL static sysctl entries should have numbers LESS than this!
 */
#define CTL_AUTO_START  0x100

So worst case for static entries is not particularly bad. Resizing on insertion may still be sensible to do, but it wont affect lookups.

That's only for static entries, which are uncommon. Dynamically entries are much more common, and can range anywhere from 0x100 to 0x7fffffff.

This is where they start, but with the scheme I proposed they will not use much memory, in fact I expect it to use less than trees. One will have to check what are the typical sizes for subtrees, can start at say 16 and keep doubling until it reaches -- say -- 512, past which it increases per 128 or whatever other value as entries keep being added. This only runs into a problem where tons of entries got added, plus one more, and then everyone preceeding got removed -- then the array remains big with nothing to occupy the space. I don't think that's a serious problem and if need be it can be mitigated by a series of arrays, each of which only mapping to a range of ids.

Also note with arrays providing O(1) there is no problem with what is used how frequently and there is no need to change anything on lookup.

For name2oid one can consider still having a tree *or* a hash table. Partial problem with hashes here is that the last part of the "address" can be fully dynamic, so would need to be excluded, but there is no way to know that upfront except for custom heuristics. Splay would be fine here for the time being.

I can't stress enough that there is tons of sysctl calls happening at the same time, making exclusive-locking for regular use a non-starter.

Well yes that's true. But there are also tons of repeated lookups of the same entry over and over, which is what splay trees excel at. So using splay trees means that some threads will get blocked sometimes, but total runtime will be less.

Contrary, the locking + dirtying of entries will force cross cpu communication, which will be especially bad on multisocket systems.

Splay appeared to be quite inconvenient generally. For instance it was extinguished from the buffer objects, It is still in vm_map.c and this causes some issues.

Why can't you use e.g. red-black tree instead? Because rb tree is used by DMAR, our implementation got some further optimizations recently.

I looked at the sysctl use at the image startup, and there are only two uses left, I believe. One is vm.overcommit check from jemalloc, and another is kern.usrstack from libthr startup for threaded images. Both are quite easy to eliminate, if needed. This is, of course, not an an argument for specific data structure use for sysctl oid container implementation.

I don't know about vm.overcommit -- there were talks about replacing malloc and new one perhaps will not even look at this.

  • Switch to rb trees instead of splay trees
  • Bump __FreeBSD_version due to the KPI change
sys/kern/kern_sysctl.c
1330

name2oid is called *a lot* as well. *another* tree which sorts by name would be in order here.

sys/kern/kern_sysctl.c
1330

Somewhat surprisingly, dtrace doesn't show any hits for name2oid or sysctl_sysctl_name during sysctl kstat.zfs. So I don't think it's worth optimizing it, at least not at this point.

sys/kern/kern_sysctl.c
1330

everyone doing sysctbyname runs into it.

So I just ran pkg build. pkg-static spams sysctlbyname security.jail.jailed for example, there is also hw.availpages from getconf _SC_PHYS_PAGES (used by sort), although the last one should probably get converted into a static entry.

sys/kern/kern_sysctl.c
1330

Maybe it got inlined then. Still, a regression analysis of the time to run sysctl kstat.zfs.testpool shows perfectly linear behavior up to 8000 file systems. Since name2oid is obviously quadratic, it can't be getting called very often. Maybe it only gets called for the initial kstat.zfs.testpool lookup but not during the subsequent iteration over all file systems.

sys/kern/kern_sysctl.c
1330

it is called by everything doing sysctlbyname and I gave you examples above of actual users

sys/kern/kern_sysctl.c
1330

Oh sure. I know that lots of programs call sysctlbyname. I'm just saying that none of them, as far as I know, call it over and over for the same parent mib. I verified that both ztop and sysctl kstat.zfs only do it once each. That's why it doesn't lead to any O(n^2) problems. And as a result, I consider it premature optimization to worry overmuch about name2oid.

@mjg are you ok with the current version of this review, using RB trees? I don't want to make the other changes because:

  • Using sorted arrays with dynamic reallocation would be a lot more work than the RB trees.
  • Adding a second tree to speed up name2oid would be possible, but it really doesn't depend on anything else in this change. It could be done independently.
  • AFAIK name2oid's current performance is good enough for current users. It doesn't blow up quadratically like sysctl_sysctl_next_action does.

sorry, forgot to respond

name2oid is a problem because of how frequently is used, but the patch does not make it slower and improves the other case, so i think it is overall an improvement

This revision is now accepted and ready to land.Sep 26 2022, 9:31 PM

You might ask @dougm to take a look at this.

This revision was automatically updated to reflect the committed changes.
sys/kern/kern_sysctl.c
496

In the old code, the incremented 'old_number' gets tested in another iteration of the SLIST_FOREACH loop. Here, the incremented 'old_number' is assumed to be valid.

I think you need to 'goto retry' here too. Or, better, make a loop.

801–802

RB_FOREACH_SAFE(p, sysctl_oid_list, &oidp->old_children, tmp)
applies here.