Page MenuHomeFreeBSD

Move to a more conservative alloation scheme for devctl messages
ClosedPublic

Authored by imp on Tue, Sep 15, 11:12 PM.

Details

Summary

Change the zone setup:

  • Allow slabs to be returned to the OS
  • Set the number of slots to the max queued
  • Reserve 2% of the max
  • Disable per-cpu caching

Change the alloation strategiy a bit:

  • If a normal allocation fails, try to get the reserve
  • If a reserve allocation fails, re-use the oldest-queued entry for storage
  • If there's a weird race/failure and nothing on the queue to steal, return NULL

This addresses two main issues in the old code:

  • If devd had died, and we're generating a lot of messages, we have an unbounded leak. This new scheme avoids the issue that lead to this.
  • The MPASS that was 'sure' the allocation couldn't have failed turned out to be wrong in some rare cases. The new code doesn't make this assumption.

Since we reserve only 2% of the space, we go from about 1MB of
allocation all the time to more like 50kB for the reserve.

Test Plan

Boot a system and create/destroy a lot of MD devices and check to see if there's a link
Once we're ready, have pho run stress2

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

imp requested review of this revision.Tue, Sep 15, 11:12 PM
imp created this revision.
imp edited the test plan for this revision. (Show Details)Tue, Sep 15, 11:13 PM
imp added reviewers: markj, mjg, kib.
imp added inline comments.Tue, Sep 15, 11:16 PM
sys/kern/subr_bus.c
440 ↗(On Diff #77084)

I need to fix this comment, just noticed...

imp updated this revision to Diff 77086.Tue, Sep 15, 11:21 PM

Fix comment and silly lack of assignment. Tiny style tweaks too.

imp added inline comments.Wed, Sep 16, 1:45 AM
sys/kern/subr_bus.c
620 ↗(On Diff #77086)

Question for markj: I concluded I needed to do this two-step when reading the manual. Do I need to?

markj added inline comments.Wed, Sep 16, 5:47 PM
sys/kern/subr_bus.c
453 ↗(On Diff #77086)

I think you also want to call uma_zone_set_maxcache(z, 0). This ensures that items do not get cached in UMA. Various factors, mainly per-CPU caching and NUMA, can result in a situation where there are free items in a zone that are transiently inaccessible. In this patch we are doing all allocations under a global mutex, so disabling caches should not harm scalability.

620 ↗(On Diff #77086)

Yes, I think so. The steps look like this:

  1. Try to allocate an item from the zone's caches.
  2. If M_USE_RESERVE is specified, try allocating an item from the keg's reserve.
  3. Try to allocate a page from which new items can be allocated.
  4. Give up.

So with your change we try to allocate a new page before dipping into the reserves. Basically, the intent is that M_USE_RESERVE allows you to make N non-blocking allocations before you have to start freeing items to replenish the reserve.

623 ↗(On Diff #77086)

How do you know that this list is non-empty? In pathological cases multiple threads might have allocated a number of event structures and been preempted before adding them to the queue.

imp added inline comments.Wed, Sep 16, 6:04 PM
sys/kern/subr_bus.c
453 ↗(On Diff #77086)

Great, will do.

620 ↗(On Diff #77086)

Yea, that's the intent: if we can get memory, we should use it. If we're tight, we want some small amount of fallback to act as a shock absorber so at least some of the messages get through. The reserve is my best guess at the number we'd have to queue before devd could get scheduled to run to process them, times some big factor. devd might get a bit clogged if it gets a huge burst, so I may need to do some work there to keep it from blocking until the action is done, but that's a separate issue.

623 ↗(On Diff #77086)

Good point. We should likely just return NULL in that case like we do when things are disabled. It's the least bad thing we can do w/o blocking.

In practice, though, we'd have to have extreme memory pressure, with > 20 events happening all at the same time on other cores that take a while to fill in. So it would be extremely unlikely to ever happen. Failing safe with returning NULL would be ideal, since it would also cause the generators of the near future to do almost no work (today they do a lot of work and throw it away)

markj added inline comments.Wed, Sep 16, 7:00 PM
sys/kern/subr_bus.c
623 ↗(On Diff #77086)

Yes, I think it should be quite difficult to trigger. However, for testing purposes it's useful to be able to inject failures into malloc()/uma_zalloc(), so we should avoid introducing new cases where such failures can trigger a panic.

imp updated this revision to Diff 77113.Wed, Sep 16, 7:26 PM

Fold in feedback from MarkJ

imp marked 3 inline comments as done.Wed, Sep 16, 7:27 PM
imp updated this revision to Diff 77115.Wed, Sep 16, 8:25 PM

Update some comments.

imp updated this revision to Diff 77116.Wed, Sep 16, 8:29 PM

Another round of comment cleanup to make things more descriptive and less verbose.

imp edited the summary of this revision. (Show Details)Wed, Sep 16, 8:32 PM
mjg added a comment.Wed, Sep 16, 8:35 PM

Given this already has support for reusing items from a global list, why uma allocation in the first place? Most notably why per-CPU caches. The current code does not scale regardless and is expensive to use, but with expected low msg rates it should be fine for the time being.

imp added a comment.Wed, Sep 16, 8:50 PM
In D26448#588627, @mjg wrote:

Given this already has support for reusing items from a global list, why uma allocation in the first place? Most notably why per-CPU caches. The current code does not scale regardless and is expensive to use, but with expected low msg rates it should be fine for the time being.

I use uma because it deals with the 'feast or famine' nature of this zone. On startup, I need a surge of a few thousand. Most of the rest of the time, the messages come in on time scales measured in hours or days. There are some cases when complex devices arrive or depart where another surge of messages would be generated. Coping with all this would mean duplicating a lot of the uma machinery.

True, I'm not using per-cpu caches, nor trying to optimize the cache behavior. I'm also doing copyout to userland for processing of these events, which is also kinda slow. These refinements make things more robust (my primary goal) and also faster (nice when that happens).

Should message rates increase several orders of magnitude, we can revisit and optimize then, but we'd likely need to look at more than just the allocation practices.

markj accepted this revision.Thu, Sep 17, 5:03 PM
This revision is now accepted and ready to land.Thu, Sep 17, 5:03 PM