- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Aug 13 2019
Aug 6 2019
Jul 29 2019
Jul 21 2019
Jul 13 2019
In D20931#454068, @kib wrote:In D20931#453942, @jeff wrote:In D20931#453715, @kib wrote:I think that the patch is functionally fine.
It looks strange that
- For newly created stack, you set the backing object domain policy to prefer domain of the current cpu.
- You do not encode any preference when allocating from cache.
When allocating from cache the preference was recorded when it was inserted into cache in kstack_import(). The object is initialized and pointed to by the pages.
So it is UMA_ZONE_NUMA which gives the locality when satisfying the allocation from cache ?
I think the one thing that is questionable here is the PREFER vs FIXED. I should probably fix them both as PREFER. Do you agree?
I think yes.
Address review feedback
Jul 12 2019
The kernel memory allocators do not check the thread's numa policy. To do so on every allocation would really be prohibitively expensive. Instead the zones have simpler policies that are always enforced the same way.
In D20928#453818, @jhb wrote:Hmm, do you require this going forward? I think it's probably fine, but changing the binding is going to possibly break some assumption in the calling code and the purpose of the panic was to force the calling code to be aware of that and handle it (e.g. by unbinding before calling bus_bind_intr() or the like so that then the calling code is aware of the change and can handle the migration if needed.)
In D20931#453715, @kib wrote:I think that the patch is functionally fine.
It looks strange that
- For newly created stack, you set the backing object domain policy to prefer domain of the current cpu.
- You do not encode any preference when allocating from cache.
This is a small diff to fix a bug I ran into with another patch. There are very few cases where it is necessary. The other option would be to have a stack of some kind in sched_bind() but that does not seem attractive.
Jul 10 2019
Jun 28 2019
My recollection is that when I implemented the cache there simply wasn't a lot of traffic for the other free pools and I didn't want to increase fragmentation. I have no objection to doing so now. I do believe we should have some separation for UMA NOFREE zones whether that is a different pool or some other layer I can not say.
Feb 25 2019
We probably shouldn't enable NUMA on 32 bit platforms. It doesn't make a ton of sense.
Feb 24 2019
Dec 18 2018
Nov 21 2018
I have a more complete version of a similar concept sitting in a local repo. Give me a few weeks to get it together and we can discuss.
Oct 30 2018
Thank you for taking the time to iterate and get this right.
Oct 17 2018
I like this
I still feel it would be nice to make kmem_ and malloc_ take a policy possibly with an inline that converts a number into a preferred policy. There is likely no reason that kmem_ allocations should fail if the requested domain is unavailable.
Oct 11 2018
In D17420#373690, @markj wrote:In D17420#373676, @jeff wrote:This is surprisingly simple. I am a little uneasy with the prospect of not doing round-robin on the keg's iterator. Can we not pass in an iterator somehow?
I tried to find a way to do that; the issue is that a) we want to update the iterator with the keg locked, and b) we want to be able to call vm_wait_doms() in vm_domainset_iter_policy(), but we have to drop the keg lock to do that.
We could pass the lock object in as a separate parameter, but I also want to minimize the number of parameters we pass to vm_domainset_*() since those functions are called frequently. Do you see a different approach?
In D17419#373699, @markj wrote:In D17419#371977, @mmacy wrote:As I pointed out on IRC, this problem is not specific to hwpmc. Nonetheless it does fix the issue here.
Sure, busdma is modified in this review too. This kind of pattern is present in pcpu_page_alloc(), but that function doesn't get called with M_WAITOK so it doesn't pose the same problem as these instances. I'm not aware of any other potential issues.
In D17419#373675, @jeff wrote:I would prefer to make this change by changing the semantics of malloc_domain() to mean prefer.
If people really want a single domain we can create another policy array that means 'only this domain'. I doubt it will see much if any use however.
Just to be clear: based on this comment and the one in D17418 you're suggesting keeping malloc_domainset() but having malloc_domain() be a wrapper that just selects DSET_PREF(domain)? I'm not opposed to that, but note that I did not convert malloc_domain(M_NOWAIT) callers (all of which are in busdma). I think it's probably fine to have them fall back to other domains though. That said, with your suggested change malloc_domain() would behave differently from kmem_malloc_domain(), uma_zalloc_domain() and vm_page_alloc_domain(), which is not ideal.
Oct 10 2018
This is surprisingly simple. I am a little uneasy with the prospect of not doing round-robin on the keg's iterator. Can we not pass in an iterator somehow?
I would prefer to make this change by changing the semantics of malloc_domain() to mean prefer.
I'm not a fan of DSET_ but otherwise this LGTM.
Thank you for handling this.
Sep 24 2018
In D17305#368918, @markj wrote:In D17305#368915, @jeff wrote:Can you performance test this? I was a little concerned with constantly touching the domainsets vs touching something hopefully local and in cache.
This is a slow path and only executed when vmd_free_count crosses one of the thresholds.
Can you performance test this? I was a little concerned with constantly touching the domainsets vs touching something hopefully local and in cache.
Aug 27 2018
Fix the lock leak. I would like to get this committed to 12. I have not yet had feedback on the low domain avoidance code in the iterator. How do people feel about this?
Aug 22 2018
Aug 20 2018
This uses a more correct filter for the vm_wait_severe in vm_glue.c.
Aug 19 2018
In D16799#357567, @alc wrote:In D16799#357566, @jeff wrote:In D16799#357563, @alc wrote:In D16799#357540, @jeff wrote:I did not do this before because I felt the ROI was low compared to the churn in ports. Have we contacted ports maintainers? The nvidia and virtualbox port are both going to need #ifdefs.
I will followup with the ports maintainers. I had two reasons for this. Primarily, I want to implement stronger segregation for physical memory allocations that are permanent, e.g., physical pages backing UMA_ZONE_NOFREE. Specifically, I don't want to have to modify the kmem callers to specify a different arena if that arena is simply a placeholder. Secondarily, we use the arenas somewhat differently in older branches. So, mechanical MFCs may not do the correct thing anyway.
Do you intend to do kmem_malloc as well?
Yes, and kmem_free().
At this point in the release we should possibly also consult re@
Sure, and I'll make it clear that this is not a functional change.
In D16799#357563, @alc wrote:In D16799#357540, @jeff wrote:I did not do this before because I felt the ROI was low compared to the churn in ports. Have we contacted ports maintainers? The nvidia and virtualbox port are both going to need #ifdefs.
I will followup with the ports maintainers. I had two reasons for this. Primarily, I want to implement stronger segregation for physical memory allocations that are permanent, e.g., physical pages backing UMA_ZONE_NOFREE. Specifically, I don't want to have to modify the kmem callers to specify a different arena if that arena is simply a placeholder. Secondarily, we use the arenas somewhat differently in older branches. So, mechanical MFCs may not do the correct thing anyway.
I did not do this before because I felt the ROI was low compared to the churn in ports. Have we contacted ports maintainers? The nvidia and virtualbox port are both going to need #ifdefs.
Jul 10 2018
In D16191#343791, @alc wrote:In D16191#343789, @jeff wrote:In D16191#343788, @kib wrote:In D16191#343766, @markj wrote:In D16191#343719, @kib wrote:I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation.
I think vm_page_reclaim_contig_domain() provides most of the machinery needed to implement such a mechanism, FWIW. If there is a NUMA allocation policy which only permits allocations from a specific domain set, we would also need a mechanism to indicate that a given page is "pinned" to that set, and cannot be relocated.
Do we need to stop forking if there is a severe domain ? IMO if there is one non-severe then we can allow the fork to proceed. The process with non-fitting policy would be stopped waiting for a free page in the severly-depleted domain anyway. I think this check is more about preventing the kernel allocators from blocking on fork.
I still think the better question is, why are we allowing a domain preference to push into severe when another domain is completely unused? That's why I think the more general solution is on the allocator side.
For a single page it makes sense to look at the specific domains we may allocate from. But when we fork we have no idea what objects and policies may be involved. So I'm more reluctant to change that.
This actually strikes me as a scheduling problem. The forking thread should be temporarily migrated to an underutilized domain. That said, the right time to do that migration may be execve(), not fork().
In D16191#343788, @kib wrote:In D16191#343766, @markj wrote:In D16191#343719, @kib wrote:I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation.
I think vm_page_reclaim_contig_domain() provides most of the machinery needed to implement such a mechanism, FWIW. If there is a NUMA allocation policy which only permits allocations from a specific domain set, we would also need a mechanism to indicate that a given page is "pinned" to that set, and cannot be relocated.
Do we need to stop forking if there is a severe domain ? IMO if there is one non-severe then we can allow the fork to proceed. The process with non-fitting policy would be stopped waiting for a free page in the severly-depleted domain anyway. I think this check is more about preventing the kernel allocators from blocking on fork.
Jul 9 2018
The basic issue is that there are a handful of places where we test for 'any domain' in min or severe and it may need to be 'every domain we try to allocate from'.
In D16191#343542, @mjg wrote:yes, GB. is there a problem reproducing the bug?
In D16191#343412, @mjg wrote:This does not fix the problem for me -- now things start wedging on 'vmwait'.
The affected machine has 512GB of ram and 4 nodes. Using the prog below with 256MB passed (./a.out 262144) reproduces it. mmacy can be prodded to test on the box.
Jul 7 2018
Jul 6 2018
I would like to keep all cores set in cpuset_domain[0] in the !NUMA case so there are no surprises. Other than that this looks good to go.
Jul 3 2018
Jul 1 2018
In D16078#340860, @kib wrote:Type info can be recovered from the .o compiled with -g, using dwarf dump utilities, which we do not have in base.
When we discussed inlining critical_enter(9) with mjg, my opinion was that struct thread_lite only adds complications. It is good enough to only have the offsets to the members auto-generated, and manually calculate the addresses of the td_critnest and td_owepreempt. It seems that I am the only one who thinks so, everybody else prefer thread_lite. genoffset.h is the good illustration of what I mean. BTW, what are the restrictions on the structure definitions which are processed by the script ?
In D16078#340968, @imp wrote:so you have both thread_lite and offset generator... Why?
Some cosmetic stuff but I'm happy with this patch. If you fix those issues I approve.
Jun 30 2018
Jun 26 2018
In D15985#339132, @avg wrote:IMO, it would be better to do changes this way. Thank you!
In D15985#339080, @avg wrote:In D15985#338992, @jeff wrote:I had just forgotten about IPI_AST but I like the way I have implemented it here better. It gives the remote scheduler a chance to look at what's going on and make a new decision.
I just hoped that maybe a smaller change could fix the problem.
Honestly, to me this change looks overly specialized towards the problem it solves (rather than a general improvement of the scheduling logic).
Also, I don't quite like that _sched_shouldpreempt_ which was a pure function now becomes a function with non-obvious side effects.
genassm already generates these and many other offsets. It would be better to do that than manual constants. There would be a little bit of extra build work to make this happen but it should be trivial. We would also need to assert that sizes and types are correct somewhere. The generator could generate both offset and type information for required fields however.
Jun 25 2018
I think what I would like to do is commit this with the timeshare preempt delta disabled until I get more experience with it and see if I can reason out a better algorithm.
In D15985#338891, @avg wrote:Hmm, now I see a difference between 4BSD and ULE.
If kick_other_cpu does not preempt the remote CPU, then it does this:pcpu->pc_curthread->td_flags |= TDF_NEEDRESCHED; ipi_cpu(cpuid, IPI_AST);On the other hand, tdq_notify either preempts or does nothing at all.
In D15985#338884, @avg wrote:@jeff I do not completely understand why in this scenario cksum runs behind the loop. Does cksum get a priority worse than the loop?
Otherwise, I would expect that even without preemption TDF_NEEDRESCHED would produce a similar effect.
Jun 24 2018
Jun 23 2018
This has been resolved in current with a more complete fix.
In D15977#338278, @kib wrote:In D15977#338276, @jeff wrote:The other thing to consider is how accurate it needs to be and already is. Which thread was scheduled in the last tick is just as arbitrary as this LRU. You just want to replace something that hasn't been used in a long time. I would guess we're more often looking at things that haven't been touched in seconds or at least hundreds of milliseconds than within a few ticks. I can measure that but it will of course be grossly dependent on the workload and amount of memory.
If we do not need an LRU, might be we do not need the partpopq list at all ? E.g. keeping a bins of generations per ticks, up to some limited number of bins.
In D15977#338275, @kib wrote:How much random becomes the order of the partpopq ? Is there any way to evaluate it ?
I mean, a tick is a lot, so instead of only doing it at tick, deletegate the limited (?) sorting of the rvq_partpop queues by lasttick to a daemon.
Some great stuff in here. Let's peel off parts while we perfect the rest.
Jun 15 2018
kqueue and select both use fget_unlocked. If you want to propose files without references for single threaded programs you are free to do so. You should raise it on arch@ as there is no real owner in this area. This patch further reduces differences between select and poll and reduces the number of atomics used in select which I would argue is the more frequently used of the pair.