Page MenuHomeFreeBSD

sysctlmemlock: add a knob to disable and make interruptible
AbandonedPublic

Authored by rlibby on Apr 1 2021, 10:28 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Apr 14, 12:16 AM
Unknown Object (File)
Mar 19 2024, 10:30 PM
Unknown Object (File)
Mar 19 2024, 7:55 PM
Unknown Object (File)
Nov 3 2023, 6:12 PM
Unknown Object (File)
May 20 2023, 10:43 AM
Unknown Object (File)
Mar 4 2023, 9:32 PM
Unknown Object (File)
Feb 15 2023, 6:40 AM

Details

Reviewers
jhb
mjg
Summary

Some systems, in particular those without swap, do not benefit from
limiting wiring of user pages for sysctl requests, and instead only
experience contention. Add a knob so that those systems may disable the
limit.

Also, make the lock request interruptible.

Test Plan

While toggling the new sysctl kern.sysctl_memlock a few times:

jot 0 | xargs -P $(sysctl -n hw.ncpu) -I %% sysctl -ax > /dev/null

and

jot 0 | xargs -P $(sysctl -n hw.ncpu) -I %% sysctl -x kern.malloc_stats > /dev/null

and confirmed whether the lock is seen with lockstat.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 38263
Build 35152: arc lint + arc unit

Event Timeline

I disagree with the patch.

The current mechanism is crap and should probably be retired, anyone interested in ramping up kernel memory use has plenty of other ways to do it.

The limit is also weirdly small -- 4 times the page is not particularly high.

That said, if the limit is to be kept, I think the following steps should be performed:

  1. bump the threshold higher than that. it may happen to clear contention for whatever sysctl you found problematic
  2. replace the lock with an atomic counter -- to that end introduce a multi-megabyte upper limit and atomic_fetchadd into a global. It can use sleepq to block/wait for wakeups if over limit.
  3. should the kernel get a facility to provide scalable inexact counting, the above will be ready to be converted to use it

Huh. The sysctl at hand produces almost 15MB of data on my test box, that's unsustainable as it is.

For this sucker, I think the code should be reworked to stop pre-wiring all that memory and instead output this in chunks.

In D29542#662307, @mjg wrote:

I disagree with the patch.

The current mechanism is crap and should probably be retired, anyone interested in ramping up kernel memory use has plenty of other ways to do it.

I don't disagree that the current mechanism is outdated.

The limit is also weirdly small -- 4 times the page is not particularly high.

Well, I read the ~4 pages (formerly ~1 page) as an amount of wiring we assume we can afford across all threads simultaneously. So, always freely granting a small number of pages makes sense to me, even with a better wiring limit mechanism.

That said, if the limit is to be kept, I think the following steps should be performed:

  1. bump the threshold higher than that. it may happen to clear contention for whatever sysctl you found problematic

We ran into this with net.inet.tcp.pcblist, but we have other sysctls (out of tree) that have a lot of output. kern.malloc_stats is an in-tree sysctl that has a lot of output. For systems without swap, there's not really a threshold that makes sense -- they simply don't care.

  1. replace the lock with an atomic counter -- to that end introduce a multi-megabyte upper limit and atomic_fetchadd into a global. It can use sleepq to block/wait for wakeups if over limit.
  2. should the kernel get a facility to provide scalable inexact counting, the above will be ready to be converted to use it

I have a plan for this, but it's not ready yet. In short it is to extract the semaphore uma uses (zone_alloc_limit) and make it generally available.

However, even then, we won't care for the limit on a system without swap.

We can certainly keep our change private if you're against this.

Will work on a providing a better semaphore first.