Page MenuHomeFreeBSD

Untangle TPR shadowing and APIC virtualization
ClosedPublic

Authored by yamagi_yamagi.org on Dec 28 2019, 4:55 PM.
Tags
Referenced Files
F107298589: D22942.id69365.diff
Sun, Jan 12, 4:29 AM
F107258570: D22942.id66819.diff
Sat, Jan 11, 7:29 PM
Unknown Object (File)
Thu, Jan 9, 12:07 AM
Unknown Object (File)
Thu, Jan 9, 12:07 AM
Unknown Object (File)
Thu, Jan 9, 12:07 AM
Unknown Object (File)
Thu, Jan 9, 12:07 AM
Unknown Object (File)
Wed, Jan 8, 8:31 PM
Unknown Object (File)
Tue, Jan 7, 11:43 PM

Details

Summary

A long known problem with Bhyve is that Windows guests are rather slow. With Windows 10 1903 this became much worse, to the point that the guest is unusable. This is caused by Windows hammering on the %cr8 control register. For example, Windows 10 1909 on an i7-2620M has about 68,000 %cr8 accesses per second. Each of them triggers a vm exit.

The most common solution for this is TPR shadowing. Bhyve already implements TPR shadowing. On AMD SVM it just works, but the implementation for Intel VT-x is bound to APIC virtualization. And APIC virtualization is a Xeon feature that is missing on most (all?) desktop CPUs.

The patch separates TPR shadowing from APIC virtualization, so TPR shadowing can be used on desktop CPUs as well. The patch doesn't just give a small speed boost, it's a difference like day and night. As an example, without the patch, the installation of Windows 10 1909 takes about 2280 seconds from start to first reboot. With the patch, only 370 seconds. On an old Thinkpad X220, Windows 10 guests were previously unusable, now they are resonable fast.

The patch does:

  • Add a new tuneable 'hw.vmm.vmx.use_tpr_shadowing' to disable TLP shadowing. Also add 'hw.vmm.vmx.cap.tpr_shadowing' to be able to query if TPR shadowing is used.
  • Detach the initialization of TPR shadowing from the initialization of APIC virtualization. APIC virtualization still needs TPR shadowing, but not vice versa. Any CPU that supports APIC virtualization should also support TPR shadowing.
  • When TPR shadowing is used, the APIC page of each vCPU is written to the VMCS_VIRTUAL_APIC field of the VMCS so that the CPU can write directly to the page without intercept.
  • On vm exit, vlapic_update_ppr() is called to update the PPR.
Test Plan

This patch has seen about 10 hours of testing on an 13-CURRENT host with 12.1-RELEASE and Windows 10 1909 guests. Host CPUs were a Core i7-2620M, a Core i7-6700k and a Xeon Silver 4110. I'll deploy it to several production servers when I'm back in the office in early january. It might be good idea to do some testing on Linux guests.

I've run the second version of this patch for one week on several production hosts, some of them with APICv support and without. Guests were Windows Server 2019, Windows Server 2012 R7, Windows 7, Ubuntu 18.04.3 and FreeBSD 12.1. I didn't observe any problems, both the hosts and guest were stable. Guest memory consumption was about the same as without the patch.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I'm not the one to review this whole thing but to my knowledge looks good except for single typo I found

sys/amd64/vmm/intel/vmx.c
177 ↗(On Diff #66058)

s/shadowin /shadowing /

Typo

@yamagi_yamagi.org thank you for this patch. I tried it against 12.1-RELEASE (applied cleanly) and it is indeed like night and day on my i7-4771. While the disk benchmark numbers cannot convey how smooth and snappy the system feels with TPR shadowing, they are impressive themselves (3x!):

  • without TPR shadowing:
  • with TPR shadowing:

That said, when running this disk benchmark, I got an OOM on my system (32GB RAM) with nothing else running other than that Windows VM (4GB allocated RAM). I noticed significantly higher kernel memory usage even when just booting the Windows 10 guest to an idle state. Here are some rough numbers for Wired memory captured from top with TPR shadowing on (default) and off (loader hint) from the same patched kernel:

TPR ShadowingVM offWindows idleCDM launchedCDM test completed/still running before aborted
on783M3770M4326M23G, add'l 4G Laundry, starting to use swap space
off779M2759M2766M23G, no crash, no swap used
on with kmem limits (see below)1174M3119M3124M7643M, no crash, no swap used

CDM: Crystal Disk Mark 7.0 64bit

Noteworthy:

  • ~1GB add'l wired memory with TPR shadowing on just after booting Windows
  • ~550MB add'l memory after starting CDM (no tests run yet!)

I was able to get the system running stable by crudely limiting kernel memory:

vm.kmem_size="8G"
vm.kmem_size_max="8G"
vfs.zfs.arc_max="4G"
vfs.zfs.vdev.cache.size="100M"

I am not sure whether the performance improvements provided by TPR shadowing just triggered a memory issue I have on my system or whether it is an issue with the patch. I suspect it is at least related to the patch because:

  • booting to Windows idle alone is already a huge difference in wired memory
  • the test file size of CDM is only 1GB (default settings) and should fully and easily fit into ZFS ARC (ZFS being the usual culprit when systems run out of kernel memory...)
  • ARC was already limited to 8GB (no tuning of kmem_size, kmem_size_max, and vfs.zfs.vdev.cache.size) before applying the even tighter restrictions on kernel memory shown above (so the rows "on" and "off" in the table above were captured with 8GB ARC limit)

While I don't have much spare time and the system is my production home server, I am happy to do more testing, provide additional output, etc. Please let me know if you want me to do something specific.

Thanks again for that great patch! Hope to see it finalized and included in a future release soon as this is huge for Windows guests on non-server Intel platforms!
Michael

Hi Michael,
thank you for your feedback. So far I haven't seen any increased memory consumption and there is nothing in the patch that could explain it. Unless it uncovers another bug somewhere in the Bhyve code. Or I'm overlooking something, of course. I'll look into it.

I'm currently working on implementing TPR Thresholds, that'll save even more VMEXITs. My Github branch is here: https://github.com/Yamagi/freebsd/commits/wip/tpr_shadowing I'll send a new patch when the changes are ready. Maybe next week.

Regards,
Yamagi

I have tested various host CPUs and Windows versions as a guest and do not see any increased memory consumption. I think it's misleading because Windows has much more CPU cycles available due to TPR shadowing and therefore fills up the RAM allocated to the guest faster. For example, on a Win 10 1909 VM with 2GB of RAM it takes about 20 minutes after boot (idle, without any activity) to reach a memory usage of 2GB (on the host, measured by the 'RES' column in top) without the patch, but only 4 minutes with the patch. With CDM it is similar. Without patch, it takes several runs, with patch, two are enough. But in the end the memory consumption is the same. Do you ever have the exact bhyve cmd for me? I could try to reproduce the problem with it.

I use vm to launch bhyve. As I mentioned, version is stock 12.1-RELEASE-p1. Please see attached


for the exact bhyve invocation.

Windows definitely has more cycles now, so this is a theory worth pursuing. I have run CDM a couple of times now on the original setup (without TPR Shadowing) to see if Wired memory would go overboard but so far haven't been able to provoke the system going out of memory.
Please let me know how your additional testing goes and if there's anything I can test or share with you.

yamagi_yamagi.org edited the test plan for this revision. (Show Details)

Here's a second version, incorporating feedback received here and in private mail:

  • Fix the typo pointed out by darkfiberiru_gmail.com
  • Implement TPR thresholds. These save another bunch of unnecessary VMEXITs, giving another modest speedup. About 5 to 8 percent relative to the first patch, depending on what the guest is doing.
  • Ensure that the PPR isn't updated by the hypervisor if virtual_interrupt_delivery is set to 1. This prevents panics of the host caused by an inconsistent LAPIC state on CPU that support APICv.

Hi,
I've already successfully tested the first version on FreeBSD-12 stable and I'm testing now the second one.
If it's good also I intent to commit this version to current since its a huge improvement esp. for Windows and didn't harm Linux.

I think this change breaks x2apic mode (and Yamagi might have a patch for that) so I'd hold off committing for a bit.

I have a very experimental patch that deactivates TPR shadowing if x2apic mode is used. I'll clean it up and update the review sometimes next week.

I have a very experimental patch that deactivates TPR shadowing if x2apic mode is used. I'll clean it up and update the review sometimes next week.

Good to know. I'll wait for the improvements then.

When is X2APIC used BTW?
How to simulate/activate?

With this diff, I've gotten repeatable kernel panics on hw.model: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz immediately on bhyve invocation (of a linux vm):

Panic String: ISR and isrvec_stk out of sync

backtrace etc available if that's helpful. I *think* this is from Monday Jan 13, my local git ref is d73f9b0ed8c5 I can confirm that later today.

# dmesg
[105] CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz (3200.06-MHz K8-class CPU)
[105]   Origin="GenuineIntel"  Id=0x406f1  Family=0x6  Model=0x4f  Stepping=1
[105]   Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
[105]   Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
[105]   AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
[105]   AMD Features2=0x121<LAHF,ABM,Prefetch>
[105]   Structured Extended Features=0x21cbfbb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,NFPUSG,PQE,RDSEED,ADX,SMAP,PROCTRACE>
[105]   Structured Extended Features3=0x9c000400<MD_CLEAR,IBPB,STIBP,L1DFL,SSBD>
[105]   XSAVE Features=0x1<XSAVEOPT>
[105]   VT-x: Basic Features=0xda0400<SMM,INS/OUTS,TRUE>
[105]         Pin-Based Controls=0xff<ExtINT,NMI,VNMI,PreTmr,PostIntr>
[105]         Primary Processor Controls=0xfff9fffe<INTWIN,TSCOff,HLT,INVLPG,MWAIT,RDPMC,RDTSC,CR3-LD,CR3-ST,CR8-LD,CR8-ST,TPR,NMIWIN,MOV-DR,IO,IOmap,MTF,MSRmap,MONITOR,PAUSE>
[105]         Secondary Processor Controls=0x77fff<APIC,EPT,DT,RDTSCP,x2APIC,VPID,WBINVD,UG,APIC-reg,VID,PAUSE-loop,RDRAND,INVPCID,VMFUNC,VMCS,XSAVES>
[105]         Exit Controls=0xda0400<PAT-LD,EFER-SV,PTMR-SV>
[105]         Entry Controls=0xda0400
[105]         EPT Features=0x6334141<XO,PW4,UC,WB,2M,1G,INVEPT,AD,single,all>
[105]         VPID Features=0xf01<INVVPID,individual,single,all,single-globals>
[105]   TSC: P-state invariant, performance statistics
[105] Data TLB: 2 MByte or 4 MByte pages, 4-way set associative, 32 entries and a separate array with 1 GByte pages, 4-way set associative, 4 entries
[105] Data TLB: 4 KB pages, 4-way set associative, 64 entries
[105] Instruction TLB: 2M/4M pages, fully associative, 8 entries
[105] Instruction TLB: 4KByte pages, 8-way set associative, 128 entries
[105] 64-Byte prefetching
[105] Shared 2nd-Level TLB: 4 KByte /2 MByte pages, 6-way associative, 1536 entries. Also 1GBbyte pages, 4-way, 16 entries
[105] L2 cache: 256 kbytes, 8-way associative, 64 bytes/line

A third revision, incorporating all feedback received here and in private mail:

  • Disable TPR Shadowing when x2apic mode is requested. This is currently more or less a no-op and / or precaution, because Windows guests require UEFI to boot and UEFI boot is broken in x2apic mode. All other guests have just a few writes to %cr8 at boot time.
  • Fix the panic reported by @dch. TPR Shadowing must not be used if Virtual Interript Delivery aka APICv is enabled.
  • Update the PPR only when interrupts are processed. It saves some CPU cycles and makes the code a little bit clearer.
  • Add a wrapper function vlapic_sync_tpr() around vlapic_update_ppr(). This makes the code clearer and vlapic_update_ppr() can stay private to the vlapic code.

Yamagi has mentioned some small optimizations that could be added to this (e.g. saving the TPR in the vcpu struct and only writing to the VMCS if there has been a change) but this can be done in future commits.

This revision is now accepted and ready to land.Mar 1 2020, 4:53 AM

I'll do some more testing of the new version and commit it if successfull.

This revision was automatically updated to reflect the committed changes.

Jason Tubnor has tested the MFC on 11-stable so I'll commit once acked by bz.

head/sys/amd64/vmm/io/vlapic.c
1088

I guess vlapic_update_ppr() should be placed below ops.pending_intr() to avoid usage when Virtual Interrupt Delivery is enabled?

head/sys/amd64/vmm/io/vlapic.c
1088

Yes. It's harmless but does unnecessary work in that case.