Page MenuHomeFreeBSD

Move to a more conservative alloation scheme for devctl messages
ClosedPublic

Authored by imp on Sep 15 2020, 11:12 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 10, 10:00 PM
Unknown Object (File)
Tue, Dec 24, 6:47 PM
Unknown Object (File)
Sun, Dec 15, 1:11 AM
Unknown Object (File)
Dec 8 2024, 6:37 AM
Unknown Object (File)
Dec 5 2024, 11:37 PM
Unknown Object (File)
Dec 3 2024, 9:00 PM
Unknown Object (File)
Dec 2 2024, 11:37 AM
Unknown Object (File)
Nov 24 2024, 5:00 PM
Subscribers
None

Details

Summary

Change the zone setup:

  • Allow slabs to be returned to the OS
  • Set the number of slots to the max queued
  • Reserve 2% of the max
  • Disable per-cpu caching

Change the alloation strategiy a bit:

  • If a normal allocation fails, try to get the reserve
  • If a reserve allocation fails, re-use the oldest-queued entry for storage
  • If there's a weird race/failure and nothing on the queue to steal, return NULL

This addresses two main issues in the old code:

  • If devd had died, and we're generating a lot of messages, we have an unbounded leak. This new scheme avoids the issue that lead to this.
  • The MPASS that was 'sure' the allocation couldn't have failed turned out to be wrong in some rare cases. The new code doesn't make this assumption.

Since we reserve only 2% of the space, we go from about 1MB of
allocation all the time to more like 50kB for the reserve.

Test Plan

Boot a system and create/destroy a lot of MD devices and check to see if there's a link
Once we're ready, have pho run stress2

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 33583
Build 30834: arc lint + arc unit

Event Timeline

imp requested review of this revision.Sep 15 2020, 11:12 PM
imp created this revision.
imp added reviewers: markj, mjg, kib.
sys/kern/subr_bus.c
440

I need to fix this comment, just noticed...

Fix comment and silly lack of assignment. Tiny style tweaks too.

sys/kern/subr_bus.c
612

Question for markj: I concluded I needed to do this two-step when reading the manual. Do I need to?

sys/kern/subr_bus.c
453

I think you also want to call uma_zone_set_maxcache(z, 0). This ensures that items do not get cached in UMA. Various factors, mainly per-CPU caching and NUMA, can result in a situation where there are free items in a zone that are transiently inaccessible. In this patch we are doing all allocations under a global mutex, so disabling caches should not harm scalability.

612

Yes, I think so. The steps look like this:

  1. Try to allocate an item from the zone's caches.
  2. If M_USE_RESERVE is specified, try allocating an item from the keg's reserve.
  3. Try to allocate a page from which new items can be allocated.
  4. Give up.

So with your change we try to allocate a new page before dipping into the reserves. Basically, the intent is that M_USE_RESERVE allows you to make N non-blocking allocations before you have to start freeing items to replenish the reserve.

615

How do you know that this list is non-empty? In pathological cases multiple threads might have allocated a number of event structures and been preempted before adding them to the queue.

sys/kern/subr_bus.c
453

Great, will do.

612

Yea, that's the intent: if we can get memory, we should use it. If we're tight, we want some small amount of fallback to act as a shock absorber so at least some of the messages get through. The reserve is my best guess at the number we'd have to queue before devd could get scheduled to run to process them, times some big factor. devd might get a bit clogged if it gets a huge burst, so I may need to do some work there to keep it from blocking until the action is done, but that's a separate issue.

615

Good point. We should likely just return NULL in that case like we do when things are disabled. It's the least bad thing we can do w/o blocking.

In practice, though, we'd have to have extreme memory pressure, with > 20 events happening all at the same time on other cores that take a while to fill in. So it would be extremely unlikely to ever happen. Failing safe with returning NULL would be ideal, since it would also cause the generators of the near future to do almost no work (today they do a lot of work and throw it away)

sys/kern/subr_bus.c
615

Yes, I think it should be quite difficult to trigger. However, for testing purposes it's useful to be able to inject failures into malloc()/uma_zalloc(), so we should avoid introducing new cases where such failures can trigger a panic.

Fold in feedback from MarkJ

imp marked 3 inline comments as done.Sep 16 2020, 7:27 PM

Another round of comment cleanup to make things more descriptive and less verbose.

Given this already has support for reusing items from a global list, why uma allocation in the first place? Most notably why per-CPU caches. The current code does not scale regardless and is expensive to use, but with expected low msg rates it should be fine for the time being.

In D26448#588627, @mjg wrote:

Given this already has support for reusing items from a global list, why uma allocation in the first place? Most notably why per-CPU caches. The current code does not scale regardless and is expensive to use, but with expected low msg rates it should be fine for the time being.

I use uma because it deals with the 'feast or famine' nature of this zone. On startup, I need a surge of a few thousand. Most of the rest of the time, the messages come in on time scales measured in hours or days. There are some cases when complex devices arrive or depart where another surge of messages would be generated. Coping with all this would mean duplicating a lot of the uma machinery.

True, I'm not using per-cpu caches, nor trying to optimize the cache behavior. I'm also doing copyout to userland for processing of these events, which is also kinda slow. These refinements make things more robust (my primary goal) and also faster (nice when that happens).

Should message rates increase several orders of magnitude, we can revisit and optimize then, but we'd likely need to look at more than just the allocation practices.

This revision is now accepted and ready to land.Sep 17 2020, 5:03 PM