Details

Reviewers

markj
mjg
kib

Commits

rS365843: Move to a more robust and conservative alloation scheme for devctl messages

Summary

Change the zone setup:

Allow slabs to be returned to the OS
Set the number of slots to the max queued
Reserve 2% of the max
Disable per-cpu caching

Change the alloation strategiy a bit:

If a normal allocation fails, try to get the reserve
If a reserve allocation fails, re-use the oldest-queued entry for storage
If there's a weird race/failure and nothing on the queue to steal, return NULL

This addresses two main issues in the old code:

If devd had died, and we're generating a lot of messages, we have an unbounded leak. This new scheme avoids the issue that lead to this.
The MPASS that was 'sure' the allocation couldn't have failed turned out to be wrong in some rare cases. The new code doesn't make this assumption.

Since we reserve only 2% of the space, we go from about 1MB of
allocation all the time to more like 50kB for the reserve.

Test Plan

Boot a system and create/destroy a lot of MD devices and check to see if there's a link
Once we're ready, have pho run stress2

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

imp requested review of this revision.Sep 15 2020, 11:12 PM

imp created this revision.

Harbormaster completed remote builds in B33583: Diff 77084.Sep 15 2020, 11:12 PM

imp edited the test plan for this revision. (Show Details)Sep 15 2020, 11:13 PM

imp added reviewers: markj, mjg, kib.

imp added inline comments.Sep 15 2020, 11:16 PM

sys/kern/subr_bus.c
440 ↗	(On Diff #77084)	I need to fix this comment, just noticed...

Fix comment and silly lack of assignment. Tiny style tweaks too.

Harbormaster completed remote builds in B33585: Diff 77086.Sep 15 2020, 11:21 PM

imp added inline comments.Sep 16 2020, 1:45 AM

sys/kern/subr_bus.c
620 ↗	(On Diff #77086)	Question for markj: I concluded I needed to do this two-step when reading the manual. Do I need to?

markj added inline comments.Sep 16 2020, 5:47 PM

sys/kern/subr_bus.c
453 ↗	(On Diff #77086)	I think you also want to call uma_zone_set_maxcache(z, 0). This ensures that items do not get cached in UMA. Various factors, mainly per-CPU caching and NUMA, can result in a situation where there are free items in a zone that are transiently inaccessible. In this patch we are doing all allocations under a global mutex, so disabling caches should not harm scalability.
620 ↗	(On Diff #77086)	Yes, I think so. The steps look like this: Try to allocate an item from the zone's caches. If M_USE_RESERVE is specified, try allocating an item from the keg's reserve. Try to allocate a page from which new items can be allocated. Give up. So with your change we try to allocate a new page before dipping into the reserves. Basically, the intent is that M_USE_RESERVE allows you to make N non-blocking allocations before you have to start freeing items to replenish the reserve.
623 ↗	(On Diff #77086)	How do you know that this list is non-empty? In pathological cases multiple threads might have allocated a number of event structures and been preempted before adding them to the queue.

imp added inline comments.Sep 16 2020, 6:04 PM

sys/kern/subr_bus.c
453 ↗	(On Diff #77086)	Great, will do.
620 ↗	(On Diff #77086)	Yea, that's the intent: if we can get memory, we should use it. If we're tight, we want some small amount of fallback to act as a shock absorber so at least some of the messages get through. The reserve is my best guess at the number we'd have to queue before devd could get scheduled to run to process them, times some big factor. devd might get a bit clogged if it gets a huge burst, so I may need to do some work there to keep it from blocking until the action is done, but that's a separate issue.
623 ↗	(On Diff #77086)	Good point. We should likely just return NULL in that case like we do when things are disabled. It's the least bad thing we can do w/o blocking. In practice, though, we'd have to have extreme memory pressure, with > 20 events happening all at the same time on other cores that take a while to fill in. So it would be extremely unlikely to ever happen. Failing safe with returning NULL would be ideal, since it would also cause the generators of the near future to do almost no work (today they do a lot of work and throw it away)

markj added inline comments.Sep 16 2020, 7:00 PM

sys/kern/subr_bus.c
623 ↗	(On Diff #77086)	Yes, I think it should be quite difficult to trigger. However, for testing purposes it's useful to be able to inject failures into malloc()/uma_zalloc(), so we should avoid introducing new cases where such failures can trigger a panic.

Fold in feedback from MarkJ

Harbormaster completed remote builds in B33599: Diff 77113.Sep 16 2020, 7:26 PM

imp marked 3 inline comments as done.Sep 16 2020, 7:27 PM

Update some comments.

Harbormaster completed remote builds in B33601: Diff 77115.Sep 16 2020, 8:25 PM

Another round of comment cleanup to make things more descriptive and less verbose.

Harbormaster completed remote builds in B33602: Diff 77116.Sep 16 2020, 8:29 PM

imp edited the summary of this revision. (Show Details)Sep 16 2020, 8:32 PM

Given this already has support for reusing items from a global list, why uma allocation in the first place? Most notably why per-CPU caches. The current code does not scale regardless and is expensive to use, but with expected low msg rates it should be fine for the time being.

In D26448#588627, @mjg wrote:

Given this already has support for reusing items from a global list, why uma allocation in the first place? Most notably why per-CPU caches. The current code does not scale regardless and is expensive to use, but with expected low msg rates it should be fine for the time being.

I use uma because it deals with the 'feast or famine' nature of this zone. On startup, I need a surge of a few thousand. Most of the rest of the time, the messages come in on time scales measured in hours or days. There are some cases when complex devices arrive or depart where another surge of messages would be generated. Coping with all this would mean duplicating a lot of the uma machinery.

True, I'm not using per-cpu caches, nor trying to optimize the cache behavior. I'm also doing copyout to userland for processing of these events, which is also kinda slow. These refinements make things more robust (my primary goal) and also faster (nice when that happens).

Should message rates increase several orders of magnitude, we can revisit and optimize then, but we'd likely need to look at more than just the allocation practices.

markj accepted this revision.Sep 17 2020, 5:03 PM

This revision is now accepted and ready to land.Sep 17 2020, 5:03 PM

Closed by commit rS365843: Move to a more robust and conservative alloation scheme for devctl messages (authored by imp). · Explain WhySep 17 2020, 5:29 PM

This revision was automatically updated to reflect the committed changes.

imp added a commit: rS365843: Move to a more robust and conservative alloation scheme for devctl messages.

Move to a more conservative alloation scheme for devctl messages
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 77150

head/sys/kern/subr_bus.c

Move to a more conservative alloation scheme for devctl messagesClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 77150

head/sys/kern/subr_bus.c

Move to a more conservative alloation scheme for devctl messages
ClosedPublic
Actions

Revision Contents
Changeset List