Page MenuHomeFreeBSD

thread: numa-aware zombie reaping
ClosedPublic

Authored by mjg on Nov 12 2020, 12:36 AM.

Details

Summary

The current global list is a significant problem, in particular induces a lot of cross-domain thread frees. When running poudriere on a 2 domain box about half of all frees were of that nature.

Patch below introduces per-domain thread data containing zombie lists and domain-aware reaping. By default it only reaps from the current domain, only reaping from others if there is free TID shortage or nothing there reaped lingering threads.

Tested on pig1 (4 domains) with tinderbox and flix1 (2 domains) with poudriere, cross frees are basically eliminated.

Basic numa-awareness will enable other optimizations.

Test Plan

survived several hours of poudriere -j 104. almost no cross frees for threads.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mjg requested review of this revision.Nov 12 2020, 12:36 AM
sys/vm/uma.h
304 ↗(On Diff #79448)

This will be separate.

Trying to re-create the needed functionality locally resulted in a lot of headers which needed to be pulled in.

sys/kern/kern_thread.c
137 ↗(On Diff #79448)

static

189 ↗(On Diff #79448)

Can we avoid duplicating these five lines of code?

606 ↗(On Diff #79448)

1000 should instead be an expression that scales with hz.

If the intent is to only reap once a second, I suggest rewriting this as (u_int)(ticks - lticks) >= hz.

624 ↗(On Diff #79448)

Ditto.

sys/vm/uma_core.c
3268 ↗(On Diff #79448)

The reason I suggested to open-code this is because this is not enforceable.

sys/kern/kern_thread.c
189 ↗(On Diff #79448)

i don't see a handy way to do it and I don't think matters

606 ↗(On Diff #79448)

it was smaller, but will dedup to the above later.

sys/vm/uma_core.c
3268 ↗(On Diff #79448)

In that case this can be a general vm helper (vtodomain?). the main point is to not have to pull in half of the vm headers to get the result.

sys/kern/kern_thread.c
189 ↗(On Diff #79448)

Define _thread_count_inc() or so which does it and returns the result, in thread_count_inc(), use the result to decide whether to reap and try again.

Now that I wrote this I think it would be better to create per-domain thread reapers. The code would retain reaping in all the places it is doing now, remote reaping would still be a thing in case of resource shortage, but the ticks business would disappear -- all threads would be guaranteed to get gc'ed at some point.

sys/kern/kern_thread.c
189 ↗(On Diff #79448)

I was thinking thread_count_inc_try

  • add per-domain workers
  • dedup nthread update

I'm looking for suggestion for helper name and placement for vaddr -> domain translation.

I can't help but notice kthread_add is highly deficient in that it does not allow to specify a cpuset. This will have to be fixed, but is not a blocker.

Also this spawns threads even for empty domains, I don't think it's a problem.

Note that useless thread consumes 12K of stack alone. That said, I also dislike over-zealous alignment specifications that eat memory on small systems.

sys/kern/kern_thread.c
141 ↗(On Diff #79470)

I do not like this name. It implies too much for casual reader, while it only means that struct td storage is from specific domain. Perhaps 'struct_thread_domain' would be better.

Or remove the function at all, it seems to be single use.

498 ↗(On Diff #79470)

If you pass &thread_domain_data[i] instead of i, this gets rid of cast gymnastic and one more line in thread_reap_worker.

sys/kern/kern_thread.c
625 ↗(On Diff #79470)

I think this can cause shutdown to hang by up to 5s per domain.

Instead of dedicated threads, you could use a self-arming callout to poll the per-domain queues and schedule a taskqueue thread to drain them.

  • replace threads with callout + taskqueue
mjg marked 8 inline comments as done.Nov 12 2020, 7:42 PM
mjg added inline comments.
sys/kern/kern_thread.c
141 ↗(On Diff #79470)

renamed; there will be more uses

mjg marked an inline comment as done.Nov 12 2020, 7:43 PM

I can allocate the array dynamically based on vm_ndomains, but making the struct smaller is rather problematic but can be done with the machine is single core.

What is needed here is a per-domain allocator similar in use to DPCPU, but as I undertand nothing of the sort exists right now.

  • eliminate struct_thread_domain as markj agreed to just go with uma_item_domain
In D27185#607150, @mjg wrote:

I can allocate the array dynamically based on vm_ndomains, but making the struct smaller is rather problematic but can be done with the machine is single core.

What is needed here is a per-domain allocator similar in use to DPCPU, but as I undertand nothing of the sort exists right now.

I mean, unconditional use of __aligned(CACHE_LINE_SIZE) is detrimental for small/resource constrained arches and configurations. It should be some arch-dependent define, consuming cache line only on big arches amd64/arm64, and empty on other.

Ed once suggested to have a kernel build option like LARGE vs. SMALL which would allow to override arch defaults.

sys/kern/kern_thread.c
624 ↗(On Diff #79481)

We have taskqueue_enqueue_timeout(9) that hides this machinery.

mjg marked an inline comment as done.Nov 13 2020, 10:27 AM

Linux has "aligned_in_smp" or similarly named macro. The above could be __aligned_if_smp(CACHE_LINE_SIZE) with exact value passed in case someone wants bigger alignment. So this would still align on "small" machines as long a they smp but I think that's fine.

sys/kern/kern_thread.c
624 ↗(On Diff #79481)

I have seen it. The taskqueue callback is only there just in case and it looks like it is more expensive to execute than mere callout, but I'm not going to insist one way or the other.

  • add __aligned_if_smp

This will be committed separately.

sys/kern/kern_thread.c
495 ↗(On Diff #79505)

I wrote D27207 to try and address the need for uma_item_domain(). With that you'd write

tdd = &thread_domain_data[vm_phys_domain(vtophys(td))];

and only need to include <vm/pmap.h> and <vm/vm_phys.h>, which I think is reasonable.

sys/kern/kern_thread.c
495 ↗(On Diff #79505)

I really think this should be combined to vtodomain() or similar. For example the kernel can start handing out VAs which encode the target domain or there may be some other optimization which elides the need to grab the physical address.

sys/kern/kern_thread.c
495 ↗(On Diff #79505)

vtodomain() would just expand to what I wrote above, and I'm not sure yet where it should go since it depends on both the pmap and vm_phys modules.

Sure, additional optimizations are possible in some cases, but first I want the existing KPIs to be a bit cleaner.

sys/kern/kern_thread.c
495 ↗(On Diff #79505)

That's fine, the point is that should anything change here down the road consumers will only need to be recompiled to take advantage of it.

sys/kern/kern_thread.c
495 ↗(On Diff #79505)

Assuming that generic optimizations are sufficient, yes. If you start doing things like encoding a domain ID in the VA, then you'd want more specialized interfaces anyway. I'm not sure how that should look, so I prefer to punt on it for now.

sys/kern/kern_thread.c
495 ↗(On Diff #79505)

again, it was just an example, general point being to hide the detail of translations from the consumer. I don't understand where the resistance to a vtodomain (or whatever other name) is coming from, but I'm not going to insist. Just provide something to call and I'll use it.

sys/sys/cdefs.h
248 ↗(On Diff #79505)

I do not think it is right to predicate this on smp.
It should be an option, and I am fine with the option be on by default.

Anyway, it is not appropriate to do that in the scope of this review.

  • rebase
  • resolve domain using vm_phys_domain(vtophys(td))

I don't think optional size reduction is a blocker here.

This patch blocks other work.

markj added inline comments.
sys/kern/kern_thread.c
55 ↗(On Diff #79738)

taskqueue.h sorts before turnstile.h

This revision is now accepted and ready to land.Nov 19 2020, 4:48 AM

Can this per-domain reap mechanism be made more generic ? It is a lot of plumbing to handle one special case of system life. Why not e.g. process free or vmspace free benefit from NUMA affinity ?
You have a memory and destructor.

That's a rather bigger subject and out of scope here. So happens I already made argument that the current per-cpu caching of several data structures should instead be made per-domain. This would include thread, proc, mount points, thread stacks and a lot more. The review at hand mostly moves threads out of the way. I'll be posting more patches to the extent of more numa-friendly operation.

This revision was automatically updated to reflect the committed changes.