Page MenuHomeFreeBSD

Use malloc_domainset(9) instead of malloc_domain(M_WAITOK).
ClosedPublic

Authored by markj on Oct 4 2018, 8:00 PM.

Details

Summary

This fixes some hangs reported by Peter, caused by
malloc_domain(M_WAITOK) blocking indefinitely. In
particular, with this change we will fall back to other domains
if the requested domain is depleted. Since the allocations
modified in this review occur during device/subsystem
initialization, I don't believe this will have much effect in
general.

See D17417 and D17418 for some background.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 20500
Build 19930: arc lint + arc unit

Event Timeline

markj created this revision.Oct 4 2018, 8:00 PM
markj edited the summary of this revision. (Show Details)
markj added a reviewer: mmacy.
emaste added a subscriber: emaste.Oct 4 2018, 8:18 PM
kib accepted this revision.Oct 5 2018, 7:56 PM
This revision is now accepted and ready to land.Oct 5 2018, 7:56 PM
mmacy accepted this revision.Oct 5 2018, 7:58 PM

As I pointed out on IRC, this problem is not specific to hwpmc. Nonetheless it does fix the issue here.

jeff added a comment.Oct 10 2018, 11:41 PM

I would prefer to make this change by changing the semantics of malloc_domain() to mean prefer.

If people really want a single domain we can create another policy array that means 'only this domain'. I doubt it will see much if any use however.

markj added a comment.Oct 11 2018, 1:39 AM

As I pointed out on IRC, this problem is not specific to hwpmc. Nonetheless it does fix the issue here.

Sure, busdma is modified in this review too. This kind of pattern is present in pcpu_page_alloc(), but that function doesn't get called with M_WAITOK so it doesn't pose the same problem as these instances. I'm not aware of any other potential issues.

In D17419#373675, @jeff wrote:

I would prefer to make this change by changing the semantics of malloc_domain() to mean prefer.
If people really want a single domain we can create another policy array that means 'only this domain'. I doubt it will see much if any use however.

Just to be clear: based on this comment and the one in D17418 you're suggesting keeping malloc_domainset() but having malloc_domain() be a wrapper that just selects DSET_PREF(domain)? I'm not opposed to that, but note that I did not convert malloc_domain(M_NOWAIT) callers (all of which are in busdma). I think it's probably fine to have them fall back to other domains though. That said, with your suggested change malloc_domain() would behave differently from kmem_malloc_domain(), uma_zalloc_domain() and vm_page_alloc_domain(), which is not ideal.

jeff added a comment.Oct 11 2018, 2:22 AM

As I pointed out on IRC, this problem is not specific to hwpmc. Nonetheless it does fix the issue here.

Sure, busdma is modified in this review too. This kind of pattern is present in pcpu_page_alloc(), but that function doesn't get called with M_WAITOK so it doesn't pose the same problem as these instances. I'm not aware of any other potential issues.

In D17419#373675, @jeff wrote:

I would prefer to make this change by changing the semantics of malloc_domain() to mean prefer.
If people really want a single domain we can create another policy array that means 'only this domain'. I doubt it will see much if any use however.

Just to be clear: based on this comment and the one in D17418 you're suggesting keeping malloc_domainset() but having malloc_domain() be a wrapper that just selects DSET_PREF(domain)? I'm not opposed to that, but note that I did not convert malloc_domain(M_NOWAIT) callers (all of which are in busdma). I think it's probably fine to have them fall back to other domains though. That said, with your suggested change malloc_domain() would behave differently from kmem_malloc_domain(), uma_zalloc_domain() and vm_page_alloc_domain(), which is not ideal.

Yes, or we could just change malloc_domain() to take a set argument instead of a domain int. I think it was probably a mistake to make APIs that strictly require a specific domain. All of these things could take a policy instead. With your policy iterator changes this starts to look much nicer anyway.

markj added a comment.Oct 11 2018, 2:33 AM
In D17419#373675, @jeff wrote:

I would prefer to make this change by changing the semantics of malloc_domain() to mean prefer.
If people really want a single domain we can create another policy array that means 'only this domain'. I doubt it will see much if any use however.

Just to be clear: based on this comment and the one in D17418 you're suggesting keeping malloc_domainset() but having malloc_domain() be a wrapper that just selects DSET_PREF(domain)? I'm not opposed to that, but note that I did not convert malloc_domain(M_NOWAIT) callers (all of which are in busdma). I think it's probably fine to have them fall back to other domains though. That said, with your suggested change malloc_domain() would behave differently from kmem_malloc_domain(), uma_zalloc_domain() and vm_page_alloc_domain(), which is not ideal.

Yes, or we could just change malloc_domain() to take a set argument instead of a domain int. I think it was probably a mistake to make APIs that strictly require a specific domain. All of these things could take a policy instead. With your policy iterator changes this starts to look much nicer anyway.

Hrm, ok. That will be a lot of churn, hopefully not too much for 12.0. In all of these cases we still want an internal API which really does require a specific domain.

Stepping back for a second, are we going to get to a point where we can make malloc() and uma_zalloc() use the current thread's domainset policy, rather than one derived from the kernel object?

I was doing some domain affinity "hygiene" in our tree last week. Just for the 2 drivers I cared about (mlx5, nvme) the amount of churn was fairly massive to grab the domain using bus_get_domain(), change all metadata mallocs to malloc_domain() and frees to free_domain(). I was thinking that the "cleanest" way to do this might be to alter device_attach() to simply save the current thread's domain affinity, set the domain affinity to the one derived from bus_get_domain(), call the attach routine, and then restore the domainset. Then, if malloc/uma_zalloc can use the current thread's domainset, all of the driver's allocations will automatically be correct by default with no churn. This would also memory allocated in a process context to be "correct", if the process has expressed a preference.

markj added a comment.Oct 11 2018, 3:38 PM

Stepping back for a second, are we going to get to a point where we can make malloc() and uma_zalloc() use the current thread's domainset policy, rather than one derived from the kernel object?

Just for clarity: malloc() and uma_zalloc() don't use the kernel object's policy: UMA uses a round-robin policy in keg_fetch_slab(), and kmem_malloc() actually also uses a round-robin policy (see vm_domainset_iter_malloc_init(), which converts INTERLEAVE to ROUNDROBIN). In HEAD UMA implements the policy by hand; with D17420 it uses vm_domainset. In particular, the UMA/kmem_* domain selection policy is currently hard-coded.

I was doing some domain affinity "hygiene" in our tree last week. Just for the 2 drivers I cared about (mlx5, nvme) the amount of churn was fairly massive to grab the domain using bus_get_domain(), change all metadata mallocs to malloc_domain() and frees to free_domain(). I was thinking that the "cleanest" way to do this might be to alter device_attach() to simply save the current thread's domain affinity, set the domain affinity to the one derived from bus_get_domain(), call the attach routine, and then restore the domainset. Then, if malloc/uma_zalloc can use the current thread's domainset, all of the driver's allocations will automatically be correct by default with no churn. This would also memory allocated in a process context to be "correct", if the process has expressed a preference.

While writing this set of patches I tried several times to add a KPI which would allow a thread to override its domain selection policy and later restore it. I ended up not needing it, and of course there are some cases where that's not sufficient (because the VM object also specifies a policy, and object policy comes first). It sounds like we might want to go into that direction though.

Before thinking about that we'd need UMA and malloc() to support domain selection without hurting the fast path too much. I think Jeff had some ideas around this for UMA, but I don't know any details. One possible approach to your problem would be to have uma_zalloc() and malloc() operate in don't-care mode by default, i.e., the fast path we have today, and for cases where you want to be explicit, you set a domain policy in curthread (distinct from td_domain). uma_zalloc() and malloc() could check for such a policy and call uma_zalloc_domain() and malloc_domain() if one is specified. During device attach you would set this policy to DOMAINSET_PREF(bus_get_domain()), and unset it upon completion.

I'm not sure how bad the churn is, but I could see an approach like the above being especially helpful when you also have to deal with portability layers like the LinuxKPI.

markj abandoned this revision.Oct 16 2018, 4:10 PM

This is no longer needed given the changes in D17418.

markj updated this revision to Diff 49760.Oct 29 2018, 4:55 PM

Rework yet again after some more discussion with Jeff.

This diff returns to the original approach of adding a malloc_domainset(9).
However, this time, instead of having both malloc_domain() and
malloc_domainset(), only the latter is externally visible. I added a
new set of global policies, DOMAINSET_FIXED(), which allow one to specify
that the allocation must come from the specified domain.

The change also goes a bit further than the original in that it adds
contigmalloc_domainset(), kmem_malloc_domainset(), kmem_alloc_attr_domainset()
and kmem_alloc_contig_domainset(). As with malloc_domainset(), the _domain()
variants of all of these functions are now static.

UMA and the page allocator still expose _domain() KPIs. These are harder to
hide, so I won't touch them in this pass. I updated the man pages for
malloc/contigmalloc and domainset.

jeff accepted this revision.Oct 30 2018, 11:40 PM

Thank you for taking the time to iterate and get this right.

This revision is now accepted and ready to land.Oct 30 2018, 11:40 PM
markj closed this revision.Nov 1 2018, 2:06 PM