Details

Reviewers

kib
markj

Commits

rS367842: thread: numa-aware zombie reaping

Summary

The current global list is a significant problem, in particular induces a lot of cross-domain thread frees. When running poudriere on a 2 domain box about half of all frees were of that nature.

Patch below introduces per-domain thread data containing zombie lists and domain-aware reaping. By default it only reaps from the current domain, only reaping from others if there is free TID shortage or nothing there reaped lingering threads.

Tested on pig1 (4 domains) with tinderbox and flix1 (2 domains) with poudriere, cross frees are basically eliminated.

Basic numa-awareness will enable other optimizations.

Test Plan

survived several hours of poudriere -j 104. almost no cross frees for threads.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mjg created this revision.Nov 12 2020, 12:36 AM

Herald added a subscriber: imp. · View Herald TranscriptNov 12 2020, 12:36 AM

mjg requested review of this revision.Nov 12 2020, 12:36 AM

Harbormaster completed remote builds in B34751: Diff 79448.Nov 12 2020, 12:36 AM

mjg added inline comments.Nov 12 2020, 12:52 AM

sys/vm/uma.h
304 ↗	(On Diff #79448)	This will be separate. Trying to re-create the needed functionality locally resulted in a lot of headers which needed to be pulled in.

markj added inline comments.Nov 12 2020, 2:34 PM

sys/kern/kern_thread.c
137 ↗	(On Diff #79448)	`static`
189 ↗	(On Diff #79448)	Can we avoid duplicating these five lines of code?
606 ↗	(On Diff #79448)	`1000` should instead be an expression that scales with `hz`. If the intent is to only reap once a second, I suggest rewriting this as `(u_int)(ticks - lticks) >= hz`.
624 ↗	(On Diff #79448)	Ditto.
sys/vm/uma_core.c
3268 ↗	(On Diff #79448)	The reason I suggested to open-code this is because this is not enforceable.

mjg added inline comments.Nov 12 2020, 2:55 PM

sys/kern/kern_thread.c
189 ↗	(On Diff #79448)	i don't see a handy way to do it and I don't think matters
606 ↗	(On Diff #79448)	it was smaller, but will dedup to the above later.
sys/vm/uma_core.c
3268 ↗	(On Diff #79448)	In that case this can be a general vm helper (vtodomain?). the main point is to not have to pull in half of the vm headers to get the result.

markj added inline comments.Nov 12 2020, 3:00 PM

sys/kern/kern_thread.c
189 ↗	(On Diff #79448)	Define `_thread_count_inc()` or so which does it and returns the result, in `thread_count_inc()`, use the result to decide whether to reap and try again.

Now that I wrote this I think it would be better to create per-domain thread reapers. The code would retain reaping in all the places it is doing now, remote reaping would still be a thing in case of resource shortage, but the ticks business would disappear -- all threads would be guaranteed to get gc'ed at some point.

sys/kern/kern_thread.c
189 ↗	(On Diff #79448)	I was thinking thread_count_inc_try

add per-domain workers
dedup nthread update

I'm looking for suggestion for helper name and placement for vaddr -> domain translation.

I can't help but notice kthread_add is highly deficient in that it does not allow to specify a cpuset. This will have to be fixed, but is not a blocker.

Also this spawns threads even for empty domains, I don't think it's a problem.

Note that useless thread consumes 12K of stack alone. That said, I also dislike over-zealous alignment specifications that eat memory on small systems.

sys/kern/kern_thread.c
141 ↗	(On Diff #79470)	I do not like this name. It implies too much for casual reader, while it only means that struct td storage is from specific domain. Perhaps 'struct_thread_domain' would be better. Or remove the function at all, it seems to be single use.
498 ↗	(On Diff #79470)	If you pass &thread_domain_data[i] instead of i, this gets rid of cast gymnastic and one more line in thread_reap_worker.

markj added inline comments.Nov 12 2020, 6:10 PM

sys/kern/kern_thread.c
625 ↗	(On Diff #79470)	I think this can cause shutdown to hang by up to 5s per domain. Instead of dedicated threads, you could use a self-arming callout to poll the per-domain queues and schedule a taskqueue thread to drain them.

replace threads with callout + taskqueue

mjg marked 8 inline comments as done.Nov 12 2020, 7:42 PM

mjg added inline comments.

sys/kern/kern_thread.c
141 ↗	(On Diff #79470)	renamed; there will be more uses

I can allocate the array dynamically based on vm_ndomains, but making the struct smaller is rather problematic but can be done with the machine is single core.

What is needed here is a per-domain allocator similar in use to DPCPU, but as I undertand nothing of the sort exists right now.

eliminate struct_thread_domain as markj agreed to just go with uma_item_domain

In D27185#607150, @mjg wrote:

I can allocate the array dynamically based on vm_ndomains, but making the struct smaller is rather problematic but can be done with the machine is single core.

What is needed here is a per-domain allocator similar in use to DPCPU, but as I undertand nothing of the sort exists right now.

I mean, unconditional use of __aligned(CACHE_LINE_SIZE) is detrimental for small/resource constrained arches and configurations. It should be some arch-dependent define, consuming cache line only on big arches amd64/arm64, and empty on other.

Ed once suggested to have a kernel build option like LARGE vs. SMALL which would allow to override arch defaults.

sys/kern/kern_thread.c
624 ↗	(On Diff #79481)	We have taskqueue_enqueue_timeout(9) that hides this machinery.

Linux has "aligned_in_smp" or similarly named macro. The above could be __aligned_if_smp(CACHE_LINE_SIZE) with exact value passed in case someone wants bigger alignment. So this would still align on "small" machines as long a they smp but I think that's fine.

mjg added inline comments.Nov 13 2020, 10:30 AM

sys/kern/kern_thread.c
624 ↗	(On Diff #79481)	I have seen it. The taskqueue callback is only there just in case and it looks like it is more expensive to execute than mere callout, but I'm not going to insist one way or the other.

add __aligned_if_smp

This will be committed separately.

markj added inline comments.Nov 13 2020, 4:30 PM

sys/kern/kern_thread.c
495 ↗	(On Diff #79505)	I wrote D27207 to try and address the need for uma_item_domain(). With that you'd write tdd = &thread_domain_data[vm_phys_domain(vtophys(td))]; and only need to include <vm/pmap.h> and <vm/vm_phys.h>, which I think is reasonable.

mjg added inline comments.Nov 13 2020, 4:45 PM

sys/kern/kern_thread.c
495 ↗	(On Diff #79505)	I really think this should be combined to vtodomain() or similar. For example the kernel can start handing out VAs which encode the target domain or there may be some other optimization which elides the need to grab the physical address.

markj added inline comments.Nov 13 2020, 4:53 PM

sys/kern/kern_thread.c
495 ↗	(On Diff #79505)	vtodomain() would just expand to what I wrote above, and I'm not sure yet where it should go since it depends on both the pmap and vm_phys modules. Sure, additional optimizations are possible in some cases, but first I want the existing KPIs to be a bit cleaner.

mjg added inline comments.Nov 13 2020, 4:56 PM

sys/kern/kern_thread.c
495 ↗	(On Diff #79505)	That's fine, the point is that should anything change here down the road consumers will only need to be recompiled to take advantage of it.

markj added inline comments.Nov 13 2020, 6:06 PM

sys/kern/kern_thread.c
495 ↗	(On Diff #79505)	Assuming that generic optimizations are sufficient, yes. If you start doing things like encoding a domain ID in the VA, then you'd want more specialized interfaces anyway. I'm not sure how that should look, so I prefer to punt on it for now.

mjg added inline comments.Nov 13 2020, 6:12 PM

sys/kern/kern_thread.c
495 ↗	(On Diff #79505)	again, it was just an example, general point being to hide the detail of translations from the consumer. I don't understand where the resistance to a vtodomain (or whatever other name) is coming from, but I'm not going to insist. Just provide something to call and I'll use it.

kib added inline comments.Nov 13 2020, 9:59 PM

sys/sys/cdefs.h
248 ↗	(On Diff #79505)	I do not think it is right to predicate this on smp. It should be an option, and I am fine with the option be on by default. Anyway, it is not appropriate to do that in the scope of this review.

mjg mentioned this in D27217: Add options COMPACT to produce smaller kernel.Nov 14 2020, 6:51 PM

rebase
resolve domain using vm_phys_domain(vtophys(td))

I don't think optional size reduction is a blocker here.

This patch blocks other work.

markj accepted this revision.Nov 19 2020, 4:48 AM

markj added inline comments.

sys/kern/kern_thread.c
55 ↗	(On Diff #79738)	taskqueue.h sorts before turnstile.h

This revision is now accepted and ready to land.Nov 19 2020, 4:48 AM

Can this per-domain reap mechanism be made more generic ? It is a lot of plumbing to handle one special case of system life. Why not e.g. process free or vmspace free benefit from NUMA affinity ?
You have a memory and destructor.

That's a rather bigger subject and out of scope here. So happens I already made argument that the current per-cpu caching of several data structures should instead be made per-domain. This would include thread, proc, mount points, thread stacks and a lot more. The review at hand mostly moves threads out of the way. I'll be posting more patches to the extent of more numa-friendly operation.

Closed by commit rS367842: thread: numa-aware zombie reaping (authored by mjg). · Explain WhyNov 19 2020, 10:01 AM

This revision was automatically updated to reflect the committed changes.

mjg added a commit: rS367842: thread: numa-aware zombie reaping.

thread: numa-aware zombie reaping
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 79745

head/sys/kern/kern_thread.c

thread: numa-aware zombie reapingClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 79745

head/sys/kern/kern_thread.c

thread: numa-aware zombie reaping
ClosedPublic
Actions

Revision Contents
Changeset List