Builds fine and doesn't cause any obvious breakage on my Ryzen. This feature does appear to be detected properly.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Sep 7 2017
Any feedback? I'd like to get this committed before I RMA my Ryzen CPU.
Sep 6 2017
In D12130#252053, @kib wrote:In D12130#252045, @truckman wrote:I don't think that there is a problem with ithread migration. As a matter of fact the lack of any obvious problems with ithreads makes me suspect a tlb issue. Ithreads live in kernel memory space which is going to be the same everywhere. I'm wondering if a thread that has migrated to a core on the other CCX isn't always getting its tlb fully initialized before returning to userspace.
We do not re-initialize (flush is the common term) thread's TLB before returning to userspace. We flush TLB when switching contexts. More precisely, there are two variants of it.
On older CPUs, without the PCID feature, reload of %cr3 (the page table base pointer) flushes all non-global entries from TLB of the CPU thread which reloaded %cr3. Kernel-mode TLB entries are typically global (PG_G). The reload most often occurs on context switch, see cpu_switch.S.
On newer Intel CPUs, where PCID feature is present, each TLB entry is additionally tagged with the address space ID, and on the switch we typically inform CPU about new address space ID, avoiding the flush.
One of the reason I asked you about the verbose dmesg some time ago was to see which TLB switching mechanism is used, and apparently Ryzens do not implement PCID. Although AMD CPUs do tag TLB entries with ASIDs for SVM for long time.
You could experiment by adding the the equivalent of invltlb_glob() right after %cr3 reload in cpu_switch. This should flush the whole TLB, including the kernel entries.
You already tried to disable superpage promotions, so other known workarounds like erratum383 should be not useful.
Sep 4 2017
Mark more members of struct tdq as volatile to prevent accesses
to them from being optimized out by the compiler. They are
written in one context and read in another, sometimes multiple
times in one function, and we want the reader to notice changes
if they happen.
In D12130#253585, @cem wrote:In D12130#253523, @avg wrote:In D12130#251884, @jeff wrote:I have another patch that was a 2-3% perf improvement at isilon that I have been meaning to backport that is somewhat at odds with this, although not entirely. It performs a search in the last thread to switch out rather than switching to the idle thread. This saves you a context switch into the idle thread and then back out again when stealing will be immediately productive. The code is relatively straightforward. It helps a lot when there are tons of context switches and lots of short running threads coming and going.
I would love to see that change land in the tree. It makes a lot of sense to elide a switch to the idle thread if we have an opportunity to steal a thread and switch to it directly.
https://people.freebsd.org/~cem/0001-Improve-scheduler-performance-and-decision-making.-F.patch if you want to work on adopting it. I've committed the standalone subr_smp.c portion already.
In D12130#253523, @avg wrote:In D12130#251884, @jeff wrote:I have another patch that was a 2-3% perf improvement at isilon that I have been meaning to backport that is somewhat at odds with this, although not entirely. It performs a search in the last thread to switch out rather than switching to the idle thread. This saves you a context switch into the idle thread and then back out again when stealing will be immediately productive. The code is relatively straightforward. It helps a lot when there are tons of context switches and lots of short running threads coming and going.
I would love to see that change land in the tree. It makes a lot of sense to elide a switch to the idle thread if we have an opportunity to steal a thread and switch to it directly.
In D12217#253500, @rozhuk.im-gmail.com wrote:
Sep 3 2017
Works on my Ryzen machine, so I'm no longer flying blind in terms of CPU temperature under load. Doing mental math or manually setting the temperature offset is fine for now.
Sep 2 2017
Should the 20C temperature offset for the 1700X and 1800X be set automatically? When I first got my 1700X, the motherboard had the original BIOS which did not account for the offset. It thought the CPU temperature was 50+C at idle and the fan ran at close to full speed even though I've got a massive heat sink on the CPU. A subsequent BIOS upgrade compensated for the offset and told me that the idle temperature was in the low 30C range and the fan speed was much more reasonable at idle.
Aug 30 2017
I did some further investigation of the successful steals that happen when steal->tdq_load == 1 && steal->tdq_transferrable !=0. That happened 1557 times during a buildworld run. Of those, steal->tdq_cpu_idle was nonzero 1099 times, and steal->tdq_ipipending was nonzero 196 times. It doesn't look like the source CPU being busy handling an interrupt was the issue. On my hardware, cpu 0 handles the majority of interrupts by far, but the distribution these events is fairly evenly distributed across all of the CPUs.
Aug 28 2017
As a test case, you might try building lang/ghc in parallel with some other load so that there is some thread migration activity. I'm not sure it is the same problem, but ghc seems to be super sensitive and I pretty much always see SIGBUS failures on my Ryzen machine.
AMD has publicly been very vague about this problem. About the only thing they've said is that there is some sort of "performance marginality" and that Threadripper is not affected. Since Threadripper uses the same die stepping as Ryzen, it looks like they have figured out how to screen for parts with this problem. There is a long thread on the AMD forum here: https://community.amd.com/thread/215773?start=1095&tstart=0.
Basically this started as an investigation of system hangs, reboots, and oddball process failures on Ryzen as documented in this PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399. It turns out that there are actually multiple issues and this PR ended up tracking the hang/reboot issue, which was eventually fixed. The remaining problem(s) were moved to this PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029. After doing a number of experiments, I came to the conclusion that the majority of the problems were correlated to thread migration, and I discovered that disabling both kern.sched.balance and kern.sched.steal_idle got rid most or all of the problems. That led me to start hacking on the scheduler code to see if there was any pattern to the problematic thread migrations and I found that disabling the last loop iteration that looked at all cores alleviated the problems. Ryzen apparently has issues with its IRET instruction, so I wasn't sure if the problem was triggered by having interrupts disabled for such a long time or if the problem was due to migrating threads between the two CCXes. It also looked to me like there was an off-by-one bug in sched_highest(), so I asked about that in this email thread: https://docs.freebsd.org/cgi/getmsg.cgi?fetch=12902+0+/usr/local/www/mailindex/archive/2017/freebsd-arch/20170827.freebsd-arch (I don't know why my original message isn't in the archive).
Aug 26 2017
Aug 25 2017
Aug 17 2017
Aug 16 2017
Aug 4 2017
Aug 2 2017
Aug 1 2017
Lower sv_maxuser as well to prevent a user from mapping and loading
code into the page at 0x7ffffffff000. Executing code there can cause
the system to hang or silently reboot.
I just tried an experiment where I mapped the page at 0x7ffffffff000 and loaded it with some trivial code. I was able to execute that code once just fine, but if I loop on it so that the process spends most of its time executing code from that page, I get a nearly instant reboot.
Jul 31 2017
In D11780#244444, @ed wrote:Don, thanks for pointing me to this discussion!
In our case shared pages tend to reside at those addresses, but doesn't this problem apply to simply any mapping there? As in, even if you would lower the mapping of the shared page, people could still use mmap() to place a page at the very top and exploit this issue, right? In other words, shouldn't we lower VM_MAXUSER_ADDRESS instead?
Right now VM_MAXUSER_ADDRESS is assumed to be a constant, so lowering that is going to be annoying. That said, is it really worth the trouble distinguishing its value based on the CPU that's currently in use? I personally wouldn't mind to lose a single page of virtual memory.
Jul 30 2017
Update patch for linux change r321728.
cloudabi doesn't appear to use the shared page.
Add the amd64_lower_shared_page() prototype to md_var.h.
Renamed elf64_freebsd_sysentvec_fixup() to amd64_lower_shared_page(),
changed it to use decrements, and reused it in the Linux code. Where
should its prototype be declared?
I can't say that I'm enthusiastic about putting the fixup function in initcpu.c and polluting it with <sys/syent.h> stuff. All of the other code in this file pretty much sticks to poking at the CPU itself.
One thing that I like about the way the tunable works is that the mode is auto if it is unset by the loader. Setting it in the loader forces the chosen mode. The sysctl value reflects the actual mode chosen by the combination of the tunable and the cpuid test.
About my inline comment,the patch was working, but I was really puzzled by why the trampoline address was changing by more than PAGE_SIZE. Even when I used the tunable to disable lowering of the shared page (and even commented out the code that did the adjustment), the lower bits of the trampoline address were going from 0x190 to 0x000. The behavior depended on whether or not the new SYSINIT was present in the code. It took me quite a while, but I eventually figured out that adding the new SYSINIT was perturbing the ordering of the other SYSINITs. In the original code, the 32-bit sysvec was getting initialized first, so the 32-bit stuff would get loaded into the start of the shared page. When a new SYSINIT is added, the 64-bit sysvec is initialized first and the 64-bit trampoline is the first item in the shared page.
The IS_BSP() check isn't sufficient because that code fragment also gets executed on resume and everything would get decremented again.
Jul 22 2017
Jun 30 2017
Jun 14 2017
May 25 2017
May 24 2017
I just got a failure with this patch on 12.0-CURRENT r318776 amd64.
May 19 2017
May 3 2017
Apr 28 2017
Apr 27 2017
Apr 4 2017
On
CPU: AMD Ryzen 7 1700X Eight-Core Processor (3393.69-MHz K8-class CPU)
I get the following results:
sysctl dev.amdtemp
dev.amdtemp.0.rtc.sensor_offset: 0
dev.amdtemp.0.rtc.PerStepTimeUp: 0
dev.amdtemp.0.rtc.PerStepTimeDn: 0
dev.amdtemp.0.rtc.TmpMaxDiffUp: 0
dev.amdtemp.0.rtc.TmpSlewDnEn: 0
dev.amdtemp.0.rtc.CurTmpTjSel: -49.0C
dev.amdtemp.0.rtc.CurTmp: 0.1C
dev.amdtemp.0.%parent: hostb10
dev.amdtemp.0.%pnpinfo:
dev.amdtemp.0.%location:
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent:
Mar 31 2017
Mar 18 2017
Mar 2 2017
Feb 27 2017
Feb 24 2017
Feb 23 2017
Feb 4 2017
Jan 31 2017
Jan 24 2017
Jan 4 2017
Dec 5 2016
Nov 15 2016
Oct 14 2016
Oct 12 2016
Jul 26 2016
Jul 21 2016
Jul 18 2016
Jul 16 2016
Jul 12 2016
Looks fine to me.
Jul 8 2016
Jul 6 2016
Jul 5 2016
In D6928#148146, @ralsaadi_swin.edu.au wrote:Should I remove DN_BH_WLOCK()/DN_BH_WUNLOCK() and use atomic_add_int() instead?
Is the patch reasonable now?
Jul 4 2016
In D6928#148012, @ralsaadi_swin.edu.au wrote:Sorry if my thought is incorrect but just to explain my idea of using DN_BH_WLOCK():
1- ref_count (namely pie_desc.ref_count) is updated only by PIE module and checked (read access) by unload_dn_sched() in ip_dummynet.c.
2- ref_count in unload_dn_sched() is accessed with DN_BH_WLOCK()
3- ref_count in PIE module is updated when DN_BH_WLOCK() is held (either explicitly by calling DN_BH_WLOCK() in PIE module or implicitly by Dummynet somewhere before call PIE functions.
So, is the lock I added still isn't sufficient?
Anyway, using atomic_add_int() is much easier than dealing with deadlocks ;-)Regarding the "a small race that remains", I agree with you, this race could happen. However, changing Dummynet lock type is out of my knowledge.
In D6928#147905, @ralsaadi_swin.edu.au wrote:If PIE ref_count is decremented by the user thread after callout_reset_sbt(), how we guarantee that pie_callout_cleanup() finished its execution?