Page MenuHomeFreeBSD

Add kvmclock timecounter support to FreeBSD
AbandonedPublic

Authored by jhb on Apr 1 2021, 11:42 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Mar 18, 6:42 PM
Unknown Object (File)
Mon, Mar 18, 1:19 PM
Unknown Object (File)
Tue, Mar 12, 5:15 AM
Unknown Object (File)
Tue, Mar 12, 5:05 AM
Unknown Object (File)
Tue, Mar 12, 5:05 AM
Unknown Object (File)
Tue, Mar 12, 4:59 AM
Unknown Object (File)
Tue, Mar 12, 4:59 AM
Unknown Object (File)
Tue, Mar 12, 4:59 AM

Details

Summary

Hello,

I was debugging what I thought was a zfs performance issue in KVM I had with FreeBSD 13.0-RC[0-4] and it turned out it wasn't a zfs / openzfs problem but a timecounter thing.

I initially looked at all tweakable settings of zfs, tried to enable scan_legacy because it was slightly better, used dtrace to create FlameGraphs and I saw a lot of calls to time counter functions.

In KVM, I ended up enabling hpet but it made it worse: with HPET on and scan_legacy=1, scrubbing my zroot took 2min53sec and with HPET off (FreeBSD would use ACPI-fast instead), it took only 2min19sec.

I got FlameGraphs of that and saw lots of calls to hpet_get_timecount. I then started to read about clock support in KVM and found kvmclock is supposed to be fastest.

That led me to these old patches by Bryan Venteicher and the same thing got submitted again by Stephen J. Kiernan but never progressed.

Adding kvmclock support makes a huge difference (everything I figured out is in this long thread.

The zroot of my KVM machine can be scrubbed in around 10sec instead of 1min35sec (so that's 9.5 faster).

As for my raidz2 pool (it's made of 8 x 4 TB plus cache and log ssds):

  • scrubbing used to take around 23 hours best case (last try took 23:04:17) with kvmclock it's around 12 hours (12:33:11): that's almost twice as fast
  • resilvering one drive took 1 days 10:31:23 and with kvmclock it was around 21 hours (21:21:25): that's 1.7 time faster

This is a simple modification which yields a huge performance boost. The patch applies cleanly on origin/releng/13.0

Regards,
Mathieu

Test Plan

This patch survived on a beefy KVM (16 CPUs with 32 GB of RAM):

  • scrubbing multiples times zroot and a raidz2 at least 3 times now
  • replacing/resilvering 2 drives on my raidz2 pool
  • recompiling the kernel multiple time

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 38387
Build 35276: arc lint + arc unit

Event Timeline

me_freebsd_mathieu.digital retitled this revision from Added kvmclock to Add kvmclock support to FreeBSD.Apr 1 2021, 12:12 PM
me_freebsd_mathieu.digital edited the summary of this revision. (Show Details)
me_freebsd_mathieu.digital edited the test plan for this revision. (Show Details)
me_freebsd_mathieu.digital edited the summary of this revision. (Show Details)
me_freebsd_mathieu.digital retitled this revision from Add kvmclock support to FreeBSD to Add kvmclock timecounter support to FreeBSD.Apr 1 2021, 12:23 PM

For the records: We are still using the original patch mentioned in D16306 for our FreeBSD 11.4 build without any issues. Although i have to admit that we never did such detailed testing with zpool scrubbing.
It would be nice to see this landing in FreeBSD 13 and might even be back ported to 12.

This looks good, as far as I can tell, modulo minor nits on copyright that I'm asking you to consider.

sys/x86/include/kvm.h
3

2014? 2021 would be more appropriate, imho. Here and below (likewise with all rights reserved comment below)

4

This was the custom at the time, but is no longer. Please consider dropping it.

I just noticed this wasn't submitted by bryanv@, he should make the changes to copyright/license wording (or OK them). If he doesn't then consider this an approval as is if we don't hear from from him in a week or so.

Sounds good, I will wait to hear from him before updating the diff.

My employer has contracted work to bring this driver to a full implementation, including adding VDSO support for kvmclock. Currently, with the kvmclock driver as in the original reviews and this one, user space processes end up needing to do a system call for gettimeofday and clock_gettime. With the VDSO support, the clock will be able to be kept in sync via the shared memory pages and avoid the system call overhead.

That work is actively in progress.

Therefore, we may want to hold off trying to commit something right now.

Also note that there was some issues even with this set of changes for some setups, due to the fact that kvmclock relies on TSC for a reference. If you have an unstable TSC, then there can be issues also with kvmclock.

My employer has contracted work to bring this driver to a full implementation, including adding VDSO support for kvmclock. Currently, with the kvmclock driver as in the original reviews and this one, user space processes end up needing to do a system call for gettimeofday and clock_gettime. With the VDSO support, the clock will be able to be kept in sync via the shared memory pages and avoid the system call overhead.

That work is actively in progress.

Therefore, we may want to hold off trying to commit something right now.

Also note that there was some issues even with this set of changes for some setups, due to the fact that kvmclock relies on TSC for a reference. If you have an unstable TSC, then there can be issues also with kvmclock.

A number of people are reporting problems with zfs scrub w/o this fix that they say is corrected by this, so there's some desirer to have it in the tree. Do you know if these setups are common? Or do you have a notion of why these results have been reported in light of this instability?

In D29531#663896, @imp wrote:

A number of people are reporting problems with zfs scrub w/o this fix that they say is corrected by this, so there's some desirer to have it in the tree. Do you know if these setups are common? Or do you have a notion of why these results have been reported in light of this instability?

It's not so much problems or instabilities with zfs but the fact it seems the standard "ACPI-fast" timecounter makes everything zfs related slow as hell under KVM. Using kvmclock or even TSC-slow gives you at least on order or magnitude more iops.

In my example scrubbing my zroot, TSC-slow makes zfs 21 times faster than while using ACPI-fast (which is picked by default under kvm): from 3min30ish to around 10s... And it's completely reproducible. On a larger raidz2 pool with rusty devices, it's almost twice as fast with kvmclock.

I can even tell the difference just looking at iostat: I get more throughput and iops.

Here's what I replied to someone who asked me if I had tested TSC-slow (https://lists.freebsd.org/pipermail/freebsd-fs/2021-April/028901.html):

On Fri, Apr 02, 2021 at 08:29:43PM +0000, Dave Cottlehuber wrote:
Does the issue also go away if you use TSC-slow ?

I haven't, I can try it out. Yeah TSC-slow is poorly rated (-100) but
scrubbing my zroot is much faster with it compared to ACPI-fast (rated
at 900)..

TSC-slow:
scan: scrub repaired 0B in 00:00:12 with 0 errors on Fri Apr 2 22:37:06 2021
scan: scrub repaired 0B in 00:00:09 with 0 errors on Fri Apr 2 22:41:17 2021
scan: scrub repaired 0B in 00:00:09 with 0 errors on Fri Apr 2 22:41:40 2021

ACPI-fast:
scan: scrub repaired 0B in 00:03:33 with 0 errors on Fri Apr 2 22:40:47 2021
scan: scrub repaired 0B in 00:03:29 with 0 errors on Fri Apr 2 22:46:14 2021
scan: scrub repaired 0B in 00:03:38 with 0 errors on Fri Apr 2 22:49:52 2021

sys/x86/include/kvm.h
3

2014 was when I wrote this that keeping it seems appropriate. I'm fine too with saying there's nothing really copyrightable in this file too.

4

That's fine.

me_freebsd_mathieu.digital edited the summary of this revision. (Show Details)
me_freebsd_mathieu.digital edited the test plan for this revision. (Show Details)
  • Removed headers as requested

That work is actively in progress.

Therefore, we may want to hold off trying to commit something right now.

I updated the diff but clearly if you produce something even better it'd be fantastic.

Also note that there was some issues even with this set of changes for some setups, due to the fact that kvmclock relies on TSC for a reference. If you have an unstable TSC, then there can be issues also with kvmclock.

Given the preferred clock source for Linux VMs under KVM is kvmclock (unless a cpu has "invariant TSC" it seems): "if it works for them, it should work for us" (TM). Plus what Bryan and yourself submitted is between 3 to 6 years old now (July of 2018 and January of 2015): couple of fixes were committed in that regard in Linux since then.

I guess the question is now: what kind of time frame are we looking at? I mean given the huge performance boost this simple and almost non obtrusive patch provides, if we're talking about months before we get the better version, it'd be almost silly not to merge that one now. Specially given it seems people have been fixing their kernels since at least FreeBSD 11.

And as @info_dominicschlegel.ch mentioned, it'd be nice to backport that to 12 as well.

Hello,

@me_freebsd_mathieu.digital, what a CPU model was used for testing?

Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz emulated as "Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)" as seen in FreeBSD (<cpu mode='host-model' check='full'/> in libvirt).

In D29531#663896, @imp wrote:

A number of people are reporting problems with zfs scrub w/o this fix that they say is corrected by this, so there's some desirer to have it in the tree. Do you know if these setups are common? Or do you have a notion of why these results have been reported in light of this instability?

Either kvmclock or forcing TSC can work for many. However, if the TSC is being demoted in quality due to failing the quality checks, then there can possibly be issues with kvmclock as well, due to the need for pvclock to use TSC as a reference at times. Note that the TSC typically has taken a VM exit in QEMU (that may not be true any more in newer QEMU, perhaps, but I do not know.)

We had occasional issues with KVM on Wind River Linux with a FreeBSD-based VM guest using kvmclock, where the guest would encounter some occurrences of "calcru: runtime went backwards".
I'm trying to gather more data to see if it was certain combinations of Linux kernel + QEMU version + QEMU settings (which chipset is it saying it is, which chipset is on the host CPU) that may be more susceptible to this. Note these have all Intel based chipsets on the host.

Note: before we built the kvmclock driver into the kernel in our images, we were using it as a kernel module. The only major difference between this patch and the one used for that was adding a MODULE_VERSION() macro to set the module version.

As I mentioned previously, the other thing to consider when using kvmclock in the current patch is the need for a system call for clock_gettime(), gettimeofday(), etc.

When using the TSC, one gets the advantage of timehands structures being updated in the shared memory page, that libc then can use without the need for a system call.
This is also true for HPET (although the HPET is usually a terrible clock source for VM guests) and the reference TSC in Hyper-V deployments.

As I mentioned previously, the other thing to consider when using kvmclock in the current patch is the need for a system call for clock_gettime(), gettimeofday(), etc.

When using the TSC, one gets the advantage of timehands structures being updated in the shared memory page, that libc then can use without the need for a system call.
This is also true for HPET (although the HPET is usually a terrible clock source for VM guests) and the reference TSC in Hyper-V deployments.

It'd be perfect indeed with these system calls added. That said ACPI-fast (the default with KVM) is like the worse performance wise (at least with zfs).

So if we're that unsure of the quality of TSC, we could still have this code in the kernel but with a lower tc_quality so it's present but needs to be enabled manually by the user.

Here's what I get with doing random reads with fio on the device used by zfs (meaning I'm trying to see if the selected timecounter has an impact on the block device layer). Command used was:

fio --filename=/dev/vtbd0 --direct=1 --rw=randread --bs=4k --ioengine=posixaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

And for reference, here are the available timecounters:

kern.timecounter.choice: i8254(0) ACPI-fast(900) KVMCLOCK(2000) TSC-low(-100) dummy(-1000000)

Results:

kvmclockREAD: bw=239MiB/s (251MB/s), 239MiB/s-239MiB/s (251MB/s-251MB/s), io=28.0GiB (30.1GB), run=120009-120009msec
ACPI-fastREAD: bw=106MiB/s (111MB/s), 106MiB/s-106MiB/s (111MB/s-111MB/s), io=12.4GiB (13.3GB), run=120001-120001msec
i8254READ: bw=103MiB/s (108MB/s), 103MiB/s-103MiB/s (108MB/s-108MB/s), io=12.1GiB (13.0GB), run=120001-120001msec
TSC-lowREAD: bw=317MiB/s (333MB/s), 317MiB/s-317MiB/s (333MB/s-333MB/s), io=37.2GiB (39.9GB), run=120012-120012msec

Clearly, ACPI-fast (which is the default) is at the bottom of the pile: I get less than half the bandwidth from the block device compared to what I get with kvmclock. TSC-low is the best of them all but I assume there's a reason it's rated so low (surprisingly it's better that i8254 which is rated higher).

So the problem is not just zfs specific.

As I mentioned previously, the other thing to consider when using kvmclock in the current patch is the need for a system call for clock_gettime(), gettimeofday(), etc.

As I understand this is about VDSO, but the current patch will improve the work from the kernel side.

So I think it would be nice to commit this patch. Of course, if you can then add the necessary capabilities for VDSO, etc. on its basis.

Is there a spec for kvm clock?

sys/conf/files.x86
338

Can we have an option for this, instead of always adding this timecounter?

sys/x86/x86/kvm_clock.c
102

This is too magic, at least a comment explaining it should be written.

128

What is the role of/needs for this variable?

130

I believe the indent is wrong.

sys/x86/x86/kvm_clock.c
54

It would be nice to use designated initializers here, they are used now for other timecounter definitions.

  • Used designated initializers
  • Added a comment to explain what the magical | 1 means
  • Removed useless variable
me_freebsd_mathieu.digital added inline comments.
sys/x86/x86/kvm_clock.c
128

It was useless indeed, just pass NULL to smp_rendezvous instead.

me_freebsd_mathieu.digital added inline comments.
sys/conf/files.x86
338

Can we have an option for this, instead of always adding this timecounter?

I'll have to brush up on how this whole config thing works before I can make it an option, bear with me...

I think this is how optional features work, let me know if I got it completely wrong. And sadly when arc diff --update really messes up with the notes, not sure if I'm using it wrong...

In addition to the copyright thing mentioned inline, a question: This is for FreeBSD VMs running under Linux Kernel-based Virtual Machine, correct? If that's the case, for clarity, please change the

# KVM support

line in in the GENERIC and NOTES files to something like

# Linux KVM support

There are too many expansions of the "KVM" TLA. :-)

sys/x86/include/kvm.h
3

Please add

* SPDX-License-Identifier: BSD-2-Clause-FreeBSD
*
* Copyright (c) YYYY whoever
*

Right after the /*- line, before the Redistribution line.

That goes for kvm_clock.c as well.

In addition to the copyright thing mentioned inline, a question: This is for FreeBSD VMs running under Linux Kernel-based Virtual Machine, correct?

That is correct, I will make the change.

There are too many expansions of the "KVM" TLA. :-)

Yeah, it's the stupidest TLA they could find I guess...

As for the copyright, it was originally there but got removed after @imp said it was no longer custom.

To whom should I assign the copyright to? I mean most of the thing was written by @bryanv, I just reworked the easy bits based on comments made here...

As for the copyright, it was originally there but got removed after @imp said it was no longer custom.

I think he was questioning the "2014", since it's 2021. 😉 But if that's when @bryanv first wrote this, and it's been languishing since then, then that's fine.

The thing which has fallen out of custom is "All Rights Reserved"; it is no longer necessary to assert copyright.

To whom should I assign the copyright to? I mean most of the thing was written by @bryanv, I just reworked the easy bits based on comments made here...

So then perhaps

* Copyright (c) 2014 Bryan Venteicher <bryanv@freebsd.org>
* Copyright (c) 2021 Mathieu Chouquet-Stringer

Oh fair enough, I'm uploading the newer version of the diff...

Clearly, ACPI-fast (which is the default) is at the bottom of the pile: I get less than half the bandwidth from the block device compared to what I get with kvmclock. TSC-low is the best of them all but I assume there's a reason it's rated so low (surprisingly it's better that i8254 which is rated higher).

Yes, TSC-low (note the "low" part of the name occurs when the TSC frequency is high enough that the value gets shifted down and N number of low bits are discarded) ends up with -100 quality number if it fails the quality checks (see test_tsc() function in sys/x86/x86/tsc.c if you are curious.)

ACPI (slow to read, overflows often) and HPET (can be slow to read, may have high drift, not very fine grained) are usually not great timecounter sources in a VM guest environment.

Although now aged, these papers/websites are fairly good starting points to understand why timekeeping is hard in virtual machines:
https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf
http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/#33-tsc-emulation-on-different-hypervisors

As I mentioned previously, my employer has contracted work to bring this driver to a full implementation. An initial draft is under review internally and a Phabricator review is queued up once we give the green light. That is why I suggested holding off on this version of the patch.

As I mentioned previously, my employer has contracted work to bring this driver to a full implementation. An initial draft is under review internally and a Phabricator review is queued up once we give the green light. That is why I suggested holding off on this version of the patch.

Right but as I asked earlier: what kind of time frame are we talking about before you can create a review?

I mean right now we've had a crappy situation for a long time and now, a less than ideal fix. But the less than ideal fix is still better than no fix.

So if you tell us you have no idea how long it'll take before you get the green light, I'd be for merging that one and for your review to update/rewrite it.

If you tell us you'll have something before summer, I'd say let's wait for you.

As for the copyright, it was originally there but got removed after @imp said it was no longer custom.

I think he was questioning the "2014", since it's 2021. 😉 But if that's when @bryanv first wrote this, and it's been languishing since then, then that's fine.

Yes, but he's publishing it now, and there's minor nits corrected, so an update isn't out of line.

The thing which has fallen out of custom is "All Rights Reserved"; it is no longer necessary to assert copyright.

To whom should I assign the copyright to? I mean most of the thing was written by @bryanv, I just reworked the easy bits based on comments made here...

So then perhaps

* Copyright (c) 2014 Bryan Venteicher <bryanv@freebsd.org>
* Copyright (c) 2021 Mathieu Chouquet-Stringer

perhaps, assuming that mathieu did made more than minor tweaks.

This revision is now accepted and ready to land.Apr 9 2021, 1:29 AM

So if you tell us you have no idea how long it'll take before you get the green light, I'd be for merging that one and for your review to update/rewrite it.

If you tell us you'll have something before summer, I'd say let's wait for you.

It should be days, maybe less.

So if you tell us you have no idea how long it'll take before you get the green light, I'd be for merging that one and for your review to update/rewrite it.

If you tell us you'll have something before summer, I'd say let's wait for you.

It should be days, maybe less.

I've just published this work---see D29733.

jhb abandoned this revision.
jhb added a reviewer: me_freebsd_mathieu.digital.

Superseded by D29733.