- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Sep 26 2023
Sep 25 2023
From Ryzen 3000 onward, AMD's MMU automatically promotes four contiguous 16KB-aligned virtual pages mapping four contiguous 16KB-aligned physical pages, creating a single 16KB TLB entry. With this change, the physical contiguity and alignment is guaranteed. Could we dynamically adjust the guard size at runtime to achieve the required virtual alignment?
In D38852#956753, @kib wrote:So what is the main purpose of this change? To get linear, without gaps, pindexes for the kernel stack obj?
Sep 24 2023
Retire PMAP_INLINE. It's unused.
Sep 23 2023
pmap: optimize MADV_WILLNEED on existing superpages
Sep 22 2023
In D41635#955565, @bojan.novkovic_fer.hr wrote:Address @alc 's comments.
Sep 18 2023
@markj I think that you should apply 34eeabff5a8636155bb02985c5928c1844fd3178 to i386 and riscv. Otherwise, I suspect that there will be assertion failures in places like pmap_remove_pde():
KASSERT(vm_page_all_valid(mpte), ("pmap_remove_pde: pte page not promoted"));
Sep 13 2023
Sep 11 2023
In D41344#946751, @kib wrote:In D41344#946693, @dougm wrote:I would appreciate feedback here. I'm afraid that the only (unwritten, verbal) feedback I have so far is that this change is only acceptable if I strip all the code out of subr_pctrie.c and inline it in pctrie.h so that all that code can be generated over again for every user of pctrie.h. Is that a consensus position (of everyone but me)?
Could you elaborate on the motivation of that answer, please? For me it sounds fine to share pctrie code as functions.
Sep 9 2023
Let me say up front, that the worst of the address space creep has been addressed by the changes in place. However, I still see some behavior that I think should be changed. I'll elaborate on that later, but first I have a question: @kib, it has never been clear to me why we update anon_loc. Is it meant solely as an optimization to free address space finding, or is it done for some other reason?
This looks correct now. Thanks!
Sep 3 2023
In D35709#950481, @mhorne wrote:In D35709#950480, @alc wrote:One of my graduate students found that this change had a seriously bad side effect. Specifically, on a Ryzen processor, instead of being able to collect data from 6 counters simultaneously, he could only configure 3 counters. So, we backed out this change locally.
Thanks Alan, I have found the same, and I have a fix for it. The problem is that we now allocate the requested event twice on CPU 0, thus reducing the total number of available counters by two.
I will put the fix up for review within the next week, and make sure it is present in 14.0.
One of my graduate students found that this change had a seriously bad side effect. Specifically, on a Ryzen processor, instead of being able to collect data from 6 counters simultaneously, he could only configure 3 counters. So, we backed out this change locally.
Aug 12 2023
In D41344#943626, @dougm wrote:I can foresee using the vm_page's order field to encode an order, and thereby eliminate a level in the radix trie. Specifically, I believe that we could encode the order for allocated vm_pages as VM_NFREEORDER + the order
Then you would want to change the implementation of vm_radix_lookup to call pctrie_lookup_le, and use the order field and pindex of the found rnode to determine whether your lookup succeeded or not. You might also want to change vm_radix_insert, so that when you inserted that 16th page, vm_radix_insert could remove all the little pages that would be covered by the new big page, before inserting the new big page. And you might want to change vm_radix_remove too, but all those changes would involve calling pctrie_insert and pctrie_remove as necessary. And maybe the pctrie people could implement some kind of 'reclaim everything in this range' call, so that vm_radix could do it faster, somehow. I'm sure the fine people who work on the pctrie project would be happy to help, once you define what you want from them.
Suppose that we have 16 physically contiguous pages on amd64/arm64. I can foresee using the vm_page's order field to encode an order, and thereby eliminate a level in the radix trie. Specifically, I believe that we could encode the order for allocated vm_pages as VM_NFREEORDER + the order, and tweak the assertions in vm_phys.c to test for >= VM_NFREEORDER, rather than equality.
Aug 10 2023
Aug 9 2023
Aug 1 2023
Jul 30 2023
Jul 29 2023
Jul 28 2023
Jul 27 2023
As expected, this reduces the number of instructions in _vm_page_lookup by two. Before:
1f0: 48 89 f2 movq %rsi, %rdx 1f3: 48 d3 ea shrq %cl, %rdx 1f6: 83 e2 0f andl $0xf, %edx 1f9: 48 8b 44 d0 10 movq 0x10(%rax,%rdx,8), %rax 1fe: 48 85 c0 testq %rax, %rax 201: 74 34 je 0x237 <vm_radix_lookup+0x57> 203: a8 01 testb $0x1, %al 205: 75 26 jne 0x22d <vm_radix_lookup+0x4d> 207: 0f b6 50 0a movzbl 0xa(%rax), %edx 20b: 48 8d 0c 95 00 00 00 00 leaq (,%rdx,4), %rcx 213: 48 83 fa 0e cmpq $0xe, %rdx 217: 77 d7 ja 0x1f0 <vm_radix_lookup+0x10> 219: 48 c7 c2 f0 ff ff ff movq $-0x10, %rdx 220: 48 d3 e2 shlq %cl, %rdx 223: 48 21 f2 andq %rsi, %rdx 226: 48 3b 10 cmpq (%rax), %rdx 229: 74 c5 je 0x1f0 <vm_radix_lookup+0x10>
After:
2e0: 48 89 f2 movq %rsi, %rdx 2e3: 48 d3 ea shrq %cl, %rdx 2e6: 83 e2 0f andl $0xf, %edx 2e9: 48 8b 44 d0 10 movq 0x10(%rax,%rdx,8), %rax 2ee: a8 01 testb $0x1, %al 2f0: 75 26 jne 0x318 <vm_radix_lookup+0x48> 2f2: 0f b6 50 0a movzbl 0xa(%rax), %edx 2f6: 48 8d 0c 95 00 00 00 00 leaq (,%rdx,4), %rcx 2fe: 48 83 fa 0e cmpq $0xe, %rdx 302: 77 dc ja 0x2e0 <vm_radix_lookup+0x10> 304: 48 c7 c2 f0 ff ff ff movq $-0x10, %rdx 30b: 48 d3 e2 shlq %cl, %rdx 30e: 48 21 f2 andq %rsi, %rdx 311: 48 3b 10 cmpq (%rax), %rdx 314: 74 ca je 0x2e0 <vm_radix_lookup+0x10>
The effect on arm64 is similar.
Jul 26 2023
Jul 25 2023
I'd like to suggest this patch instead:
diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c index 444e09986d4e..eb607d519247 100644 --- a/sys/vm/vm_map.c +++ b/sys/vm/vm_map.c @@ -2255,10 +2255,10 @@ vm_map_find_min(vm_map_t map, vm_object_t object, vm_ooffset_t offset, int rv;
Jul 24 2023
Jul 22 2023
Jul 20 2023
I'm okay with this. I just reread the mmap(2) description of MAP_STACK, and I don't think that it needs to change to explicitly mention the behavior implemented here. This change implements the behavior that I believe that a reader would most likely expect.
Jul 19 2023
I think that the overloading of the vm map entry fields should be mentioned where the structure is defined.
Jul 18 2023
In D40936#934985, @dougm wrote:original timing:
52450.199u 1452.606s 59:09.38 1518.6% 73478+3095k 121089+33120io 110807pf+0w
modified timing:
52599.911u 1454.881s 59:17.88 1519.2% 73479+3096k 121017+34297io 110758pf+0wDo you have any idea why the timings are so different here? Is the difference consistent across multiple runs?
I ran 3 more buildworld tests on the modified and unchanged kernels immediately after reboots, with no instrumentation
added to count performance stats for any particular function.
The results:Modified:
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52357.285u 1448.444s 59:11.33 1515.0% 73477+3095k 120995+33438io 110815pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52553.118u 1456.257s 59:19.67 1517.2% 73472+3096k 121217+33609io 110821pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52547.476u 1452.889s 59:17.54 1517.9% 73470+3095k 121178+33245io 110870pf+0wOriginal:
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52602.552u 1452.130s 59:20.49 1518.1% 73477+3096k 121068+33157io 110855pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52656.392u 1451.635s 59:27.48 1516.7% 73471+3096k 121118+33354io 110845pf+0w
root@108-254-203-203:/usr/src # time make -j16 buildworld > & /dev/null
52601.014u 1445.216s 59:23.55 1516.6% 73478+3095k 121121+33611io 110844pf+0w
Jul 17 2023
Jul 16 2023
Jul 15 2023
Jul 14 2023
Jul 12 2023
Jul 9 2023
Jul 7 2023
Jul 5 2023
Jul 4 2023
Jul 2 2023
Jun 29 2023
Address Andrew's comment.
Jun 28 2023
In D40782#928081, @kib wrote:Does TLB miss cause cache miss?
Jun 27 2023
Jun 26 2023
This change reduces the number of L1 DTLB misses on a Ryzen 5900X during a "make buildkernel" by 14.3%, in large part, because it makes their TLB coalescing feature more effective.
Jun 25 2023
Jun 24 2023
In D40744#927052, @markj wrote:Is there any reason not to make the same change on arm64? (Or riscv for that matter.)