Page MenuHomeFreeBSD

Lower shared page for amd64 on Ryzen to work around bug with code near top of user space
ClosedPublic

Authored by truckman on Jul 30 2017, 3:50 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 19, 2:02 PM
Unknown Object (File)
Mon, Nov 18, 12:48 AM
Unknown Object (File)
Thu, Nov 14, 9:00 AM
Unknown Object (File)
Tue, Nov 12, 1:16 PM
Unknown Object (File)
Tue, Nov 12, 11:05 AM
Unknown Object (File)
Mon, Nov 11, 11:41 PM
Unknown Object (File)
Mon, Nov 11, 11:40 PM
Unknown Object (File)
Fri, Nov 8, 9:46 AM

Details

Summary

Ryzen (AMD Family 17h) shows stability issues if code is executed
near the top of user space. In our case that is the signal
trampoline that resides in the amd64 shared page.

Move the shared page down by one page on Ryzen as a workaround.

The Linux changes are untested.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/amd64/amd64/elf_machdep.c
97–99 ↗(On Diff #31339)

The presence of the SYSINIT causes the address returned by KERN_PROC_SIGTRAMP to change from 0x7ffffffff190 to 0x7ffffffff000, even if hw_lower_amd64_sharedpage is 0, and even if the body of elf64_freebsd_sysentvec_fixup(() is empty! Changing to an earlier subsystem does not make a difference.

I ran into the same problem when I was using SI_SUB_TUNABLES with TUNABLE_INT_FETCH() in identcpu.c even with the changes to elf_machdep.c reverted.

Hmm, I am not sure what do you mean. My original proposal was to do the following:

diff --git a/sys/amd64/amd64/initcpu.c b/sys/amd64/amd64/initcpu.c
index 63c0f20f492..2c95ffa84a9 100644
--- a/sys/amd64/amd64/initcpu.c
+++ b/sys/amd64/amd64/initcpu.c
@@ -35,6 +35,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/pcpu.h>
+#include <sys/sysent.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 
@@ -45,6 +46,8 @@ __FBSDID("$FreeBSD$");
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
+extern struct sysentvec elf64_freebsd_sysvec;
+
 static int	hw_instruction_sse;
 SYSCTL_INT(_hw, OID_AUTO, instruction_sse, CTLFLAG_RD,
     &hw_instruction_sse, 0, "SIMD/MMX2 instructions available in CPU");
@@ -122,6 +125,12 @@ init_amd(void)
 			wrmsr(0xc0011020, msr);
 		}
 	}
+
+	if (IS_BSP() && CPUID_TO_FAMILY(cpu_id) == 0x17 /* check stepping */) {
+		elf64_freebsd_sysvec.sv_usrstack -= PAGE_SIZE;
+		elf64_freebsd_sysvec.sv_psstrings -= PAGE_SIZE;
+		elf64_freebsd_sysvec.sv_shared_page_base -= PAGE_SIZE;
+	}
 }
 
 /*

The IS_BSP() check is needed because this function is called for each CPU. I just verified (with a hack to move the fragment in different location) that I get both expected vm map and sigtramp report:

# procstat -v $$
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
  24     0x7fffdfffe000     0x7ffffffde000 ---    0    0   0   0 ---- -- 
   24     0x7ffffffde000     0x7fffffffe000 rw-    6    6   1   0 C--D df 
   24     0x7fffffffe000     0x7ffffffff000 r-x    1    1   4   0 ---- ph 
# ./sysctl_sigtramp-64
signal trampoline 0x7fffffffe1e0 0x7fffffffe200

I prefer to explicitly check for the model and stepping rather then creating a variable. Linux module would copy this fragment.

The IS_BSP() check isn't sufficient because that code fragment also gets executed on resume and everything would get decremented again.

In addition to the stepping check, we might also want to check the microcode version.

I added the tunable so that the workaround could be disabled for testing any fixes (microcode or new silicon) without having to recompile
as well as testing for ABI incompatiblity issues on non-Ryzen hardware. Tunables are not set until after the BSP executes this code.

The IS_BSP() check isn't sufficient because that code fragment also gets executed on resume and everything would get decremented again.

Well, there is enough places where to record the fact that the adjustment was already made. E.g. sv_flags can dedicate one bit for this. I.e. I think that this should be local to other workarounds for CPU bugs.

At very least, I do not like that you assign precomputed value and not decrement the existing initialized value.

In addition to the stepping check, we might also want to check the microcode version.

Might be.

I added the tunable so that the workaround could be disabled for testing any fixes (microcode or new silicon) without having to recompile
as well as testing for ABI incompatiblity issues on non-Ryzen hardware. Tunables are not set until after the BSP executes this code.

Also I do not understand your inline comment. Is the behavior you described present with the patch in the revision ? I.e. it does not work ?

About my inline comment,the patch was working, but I was really puzzled by why the trampoline address was changing by more than PAGE_SIZE. Even when I used the tunable to disable lowering of the shared page (and even commented out the code that did the adjustment), the lower bits of the trampoline address were going from 0x190 to 0x000. The behavior depended on whether or not the new SYSINIT was present in the code. It took me quite a while, but I eventually figured out that adding the new SYSINIT was perturbing the ordering of the other SYSINITs. In the original code, the 32-bit sysvec was getting initialized first, so the 32-bit stuff would get loaded into the start of the shared page. When a new SYSINIT is added, the 64-bit sysvec is initialized first and the 64-bit trampoline is the first item in the shared page.

I'm fine with changing the adjustment to a straight decrement. It might be nice to move this to a common function that can be shared by both native and Linux emulation. The cpuid check could be moved there, which would remove the changes from initcpu.c.

It would be nice to avoid the gory details in linux_vdso_install() by using the value of sv_shared_page_base, but we don't know the relative ordering of elf_linux_vdso_init vs the MOD_LOAD handler. Maybe the elf_linux_sysvec changes should be done in linux_vdso_install().

One thing that I like about the way the tunable works is that the mode is auto if it is unset by the loader. Setting it in the loader forces the chosen mode. The sysctl value reflects the actual mode chosen by the combination of the tunable and the cpuid test.

I'm fine with changing the adjustment to a straight decrement. It might be nice to move this to a common function that can be shared by both native and Linux emulation. The cpuid check could be moved there, which would remove the changes from initcpu.c.

I noted this above, this function should live in initcpu.c, near other workarounds.

It would be nice to avoid the gory details in linux_vdso_install() by using the value of sv_shared_page_base, but we don't know the relative ordering of elf_linux_vdso_init vs the MOD_LOAD handler. Maybe the elf_linux_sysvec changes should be done in linux_vdso_install().

linux_vdso uses the SHAREDPAGE constant directly, which I want to avoid. I do not see an obvious reason why cannot it be changed to use sv_shared_page_base. I believe dchagin can give the authentic advice.

I can't say that I'm enthusiastic about putting the fixup function in initcpu.c and polluting it with <sys/syent.h> stuff. All of the other code in this file pretty much sticks to poking at the CPU itself.

One issue of using a shaerd fixup function is that the prototype would seem to belong in md_var.h, but that gets included all over the place and many of those don't know about struct sysentvec.

Renamed elf64_freebsd_sysentvec_fixup() to amd64_lower_shared_page(),
changed it to use decrements, and reused it in the Linux code. Where
should its prototype be declared?

Expanded and clarified a comment in initcpu.c.

Do the Linux sysvec fixup in linux_vdso_install() instead of the MOD_LOAD
handler so that linux_vdso_install() can then use the value of
sv_shared_page_base instead of repeating the same calculation multiple
times.

sys/amd64/amd64/elf_machdep.c
49 ↗(On Diff #31351)

Where should this prototype be declared?

sys/amd64/linux/linux_sysvec.c
88 ↗(On Diff #31351)

Where should this prototype be declared?

sys/amd64/amd64/elf_machdep.c
49 ↗(On Diff #31351)

md_var.h is IMO fine, you need only to provide forward declaration for struct sysentvec.

sys/amd64/amd64/initcpu.c
138 ↗(On Diff #31351)

then then

linux_vdso uses the SHAREDPAGE constant directly, which I want to avoid. I do not see an obvious reason why cannot it be changed to use sv_shared_page_base. I believe dchagin can give the authentic advice.

I'll commit a proper fix soon

sys/amd64/amd64/elf_machdep.c
114 ↗(On Diff #31351)

I don't believe this change is necessary. I think that the only important order is that the shared page is lowered before the INIT_SYSENTVEC, as the latter uses sv_shared_page_base.

sys/amd64/amd64/elf_machdep.c
114 ↗(On Diff #31351)

I agree.

Add the amd64_lower_shared_page() prototype to md_var.h.

Undo the SYSINIT(elf64, ...) order change.

Add a comment noting that the shared page lowering must be done before
INIT_SYSENTVEC() because the latter uses the value of sv_shared_page_base.

Further comment cleanup in initcpu.c.

Note: the Linux changes compile but have not been tested.

dchagin requested changes to this revision.Jul 30 2017, 9:29 PM

Please, adjust Linux code to the r321728. And what about linux32?

This revision now requires changes to proceed.Jul 30 2017, 9:29 PM

I accept the non-linux part of the patch, for the linux emulator code Dmitriy is the authorative source.

To Dmitriy: 32bit processes cannot trigger the supposed issue, because they cannot execute a code near top of the user memory. They are limited to 4G.

BTW, there is also cloudabi which might also require treatment.

cloudabi doesn't appear to use the shared page.

I don't know how the stack address is set. The cloudabi64 code doesn't have any references to USRSTACK or sv_usrstack. I also don't know where the signal trampoline location is set. There are no references to sv_sigcode_base.

truckman edited edge metadata.

Update patch for linux change r321728.

The linux changes can be tested on non-Ryzen hardware by adding

hw.lower_amd64_sharedpage=1

to /boot/loader.conf.

Is it have same effect:

--- /usr/src/sys/amd64/include/vmparam.h	(revision 321734)
+++ /usr/src/sys/amd64/include/vmparam.h	(working copy)
@@ -176,7 +176,7 @@
 
 #define	VM_MAXUSER_ADDRESS	UVADDR(NUPML4E, 0, 0, 0)
 
-#define	SHAREDPAGE		(VM_MAXUSER_ADDRESS - PAGE_SIZE)
+#define	SHAREDPAGE		(VM_MAXUSER_ADDRESS - (2 * PAGE_SIZE))
 #define	USRSTACK		SHAREDPAGE
 
 #define	VM_MAX_ADDRESS		UPT_MAX_ADDRESS

?

Don, thanks for pointing me to this discussion!

In our case shared pages tend to reside at those addresses, but doesn't this problem apply to simply any mapping there? As in, even if you would lower the mapping of the shared page, people could still use mmap() to place a page at the very top and exploit this issue, right? In other words, shouldn't we lower VM_MAXUSER_ADDRESS instead?

Right now VM_MAXUSER_ADDRESS is assumed to be a constant, so lowering that is going to be annoying. That said, is it really worth the trouble distinguishing its value based on the CPU that's currently in use? I personally wouldn't mind to lose a single page of virtual memory.

Also to clarify what CloudABI does:

cloudabi doesn't appear to use the shared page.

I don't know how the stack address is set. The cloudabi64 code doesn't have any references to USRSTACK or sv_usrstack. I also don't know where the signal trampoline location is set. There are no references to sv_sigcode_base.

CloudABI doesn't use a signal trampoline (read: there is no support for UNIX signal handlers). It maps a vDSO at the very top of the address space. On CloudABI, the vDSO contains entry points for all of the individual system calls.

The sv_shared_page_* and sv_usrstack fields are derived from sv_maxuser and the actual size of the vDSO. They are not statically initialized in the sysvec, but set in cloudabi_vdso_init() instead.

In D11780#244444, @ed wrote:

Don, thanks for pointing me to this discussion!

In our case shared pages tend to reside at those addresses, but doesn't this problem apply to simply any mapping there? As in, even if you would lower the mapping of the shared page, people could still use mmap() to place a page at the very top and exploit this issue, right? In other words, shouldn't we lower VM_MAXUSER_ADDRESS instead?

Right now VM_MAXUSER_ADDRESS is assumed to be a constant, so lowering that is going to be annoying. That said, is it really worth the trouble distinguishing its value based on the CPU that's currently in use? I personally wouldn't mind to lose a single page of virtual memory.

My understanding is that the only issue is executing code up there. Good point about the possibility someone mapping that space and executing code up there. Maybe we should adjust sv_maxuser as well.

Changing this is an ABI change that could affect some existing binaries, but things seem to mostly be resilient to this change. The hope is that this is a temporary workaround until AMD releases a microcode fix and/or bug-fixed silicon.

Changing this is an ABI change that could affect some existing binaries

I was under the impression that sv_maxuser could be changed freely without affecting userspace at all. For example, on 32-bit systems sv_maxuser is set to 3 GB, whereas COMPAT_FREEBSD32 on x86-64 uses something larger.

Having sv_maxuser part of the ABI would be pretty bad. If I understand the idea behind x86-64 correctly, future generations of CPUs may have a larger virtual address space. Both halves may grow towards each other until they both become 2^63 bytes in size. Userspace should already prepare for this to happen to ensure forward compatibility.

This revision is now accepted and ready to land.Jul 31 2017, 6:18 AM
In D11780#244448, @ed wrote:

Changing this is an ABI change that could affect some existing binaries

I was under the impression that sv_maxuser could be changed freely without affecting userspace at all. For example, on 32-bit systems sv_maxuser is set to 3 GB, whereas COMPAT_FREEBSD32 on x86-64 uses something larger.

32bit compat gives whole 4G. Amusingly, some programs are sensitive to this.
It is not as bad as Windows, where defaults were to provide only low 2G which caused all addresses have representation of the positive integer, and MS have to maintain the layout unless explicitly allowed by the image.

We have some tests which rely on the possibility to mmap() at the top of the canonical addresses.

So I do not want to change the VA layout without the cause that is very strong. An unconfirmed CPU bug in one stepping of new AMD cores is not enough.

Having sv_maxuser part of the ABI would be pretty bad. If I understand the idea behind x86-64 correctly, future generations of CPUs may have a larger virtual address space. Both halves may grow towards each other until they both become 2^63 bytes in size. Userspace should already prepare for this to happen to ensure forward compatibility.

See above about 32bit. I believe there will be some standard way to mark an ELF image as compatible with the proposed 5-level paging structures, otherwise the current layout will be used.

N.B. To make it clear, I am fine with decrementing sv_maxuser when the workaround is activated, but if the visible consequences of the bug are only seen as a fault delivered to userspace, then it is overreaction. Practically, the page after the sharedpage mapping is almost impossible to allocate unless userspace specifically asks for it.

If I am wrong and the system integrity is compromised, then we have to reduce sv_maxuser for the affected machines.

I just tried an experiment where I mapped the page at 0x7ffffffff000 and loaded it with some trivial code. I was able to execute that code once just fine, but if I loop on it so that the process spends most of its time executing code from that page, I get a nearly instant reboot.

It looks like decrementing sv_maxuser is mandatory to prevent a trival local DoS.

truckman edited edge metadata.

Lower sv_maxuser as well to prevent a user from mapping and loading
code into the page at 0x7ffffffff000. Executing code there can cause
the system to hang or silently reboot.

This revision now requires review to proceed.Aug 1 2017, 4:14 PM
This revision is now accepted and ready to land.Aug 1 2017, 4:16 PM
This revision was automatically updated to reflect the committed changes.