Page MenuHomeFreeBSD

kvmclock driver with vDSO support
Needs ReviewPublic

Authored by adam_fenn.io on Mon, Apr 12, 6:28 PM.

Details

Summary

This is the work referenced by https://reviews.freebsd.org/D29531#663885, plus changes for the review feedback on D29531 that were relevant here as well (specifically, @kib's commenting input and @rpokala's + @imp's copyright header input).

Test Plan

Thus far, just basic testing on top of QEMU 4.2.1 + kernel 5.8.0 on Ubuntu 20.04 on Intel and AMD HW via:

  • date
  • syscall_timing
  • sysctls debug.clock_show_io, debug.clock_do_io, kern.timecounter.hardware, and kern.timecounter.fast_gettime
  • Using dtrace to sanity-check that sys_gettimeofday()/sys_clock_gettime() were/weren't being called as expected.
  • zzz(8) plus the QEMU system_powerdown and system_wakeup commands (for test-driving the suspend/resume entry points).

I also tried the sysbench invocation in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216759#c11 and can confirm similar differences between using the HPET and using the kvmclock or TSC.

An example including some timing info:

user@test:~ % sysctl kern.timecounter
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 0
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) kvmclock(975) TSC(-100) dummy(-1000000)
kern.timecounter.hardware: kvmclock
kern.timecounter.alloweddeviation: 5
kern.timecounter.timehands_count: 2
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 59794
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 3543295
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.HPET.frequency: 100000000
kern.timecounter.tc.HPET.counter: 1966409884
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.kvmclock.quality: 975
kern.timecounter.tc.kvmclock.frequency: 1000000000
kern.timecounter.tc.kvmclock.counter: 3373140415
kern.timecounter.tc.kvmclock.mask: 4294967295
kern.timecounter.tc.TSC.quality: -100
kern.timecounter.tc.TSC.frequency: 1607988965
kern.timecounter.tc.TSC.counter: 513410333
kern.timecounter.tc.TSC.mask: 4294967295
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000002
test    loop    time    iterations      periteration
gettimeofday    0       1.000971171     24698568        0.000000040
gettimeofday    1       1.007769518     24866552        0.000000040
gettimeofday    2       1.001071247     24700412        0.000000040
gettimeofday    3       1.006630289     24832878        0.000000040
gettimeofday    4       1.003202891     24769714        0.000000040
gettimeofday    5       1.004513069     24790162        0.000000040
gettimeofday    6       1.005332074     24833956        0.000000040
gettimeofday    7       1.002384739     24735470        0.000000040
gettimeofday    8       1.007455964     24884169        0.000000040
gettimeofday    9       1.000256110     24708625        0.000000040
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000002
test    loop    time    iterations      periteration
gettimeofday    0       1.062566203     8546464 0.000000124
gettimeofday    1       1.037318670     8305737 0.000000124
gettimeofday    2       1.050000947     8410842 0.000000124
gettimeofday    3       1.049828515     8405755 0.000000124
gettimeofday    4       1.062561943     8522758 0.000000124
gettimeofday    5       1.037252041     8295647 0.000000125
gettimeofday    6       1.009902776     8103514 0.000000124
gettimeofday    7       1.029863208     8271323 0.000000124
gettimeofday    8       1.059883267     8462669 0.000000125
gettimeofday    9       1.009913686     8092699 0.000000124
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=1
kern.timecounter.fast_gettime: 0 -> 1
user@test:~ % sudo sysctl kern.timecounter.hardware=TSC
kern.timecounter.hardware: kvmclock -> TSC
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000001
test    loop    time    iterations      periteration
gettimeofday    0       1.039693055     39156300        0.000000026
gettimeofday    1       1.059332059     39727049        0.000000026
gettimeofday    2       1.040353569     39121632        0.000000026
gettimeofday    3       1.050123063     39481224        0.000000026
gettimeofday    4       1.049670668     39530649        0.000000026
gettimeofday    5       1.062567179     39960000        0.000000026
gettimeofday    6       1.037282737     38908035        0.000000026
gettimeofday    7       1.062564063     39918185        0.000000026
gettimeofday    8       1.037147904     39045152        0.000000026
gettimeofday    9       1.009909895     38041308        0.000000026
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000001
test    loop    time    iterations      periteration
gettimeofday    0       1.000343476     9173210 0.000000109
gettimeofday    1       1.007733227     9233248 0.000000109
gettimeofday    2       1.051675824     9637613 0.000000109
gettimeofday    3       1.003205497     9195701 0.000000109
gettimeofday    4       1.006564792     9226628 0.000000109
gettimeofday    5       1.001095929     9169797 0.000000109
gettimeofday    6       1.007764181     9224751 0.000000109
gettimeofday    7       1.000836593     9169834 0.000000109
gettimeofday    8       1.006850768     9220051 0.000000109
gettimeofday    9       1.002965937     9190431 0.000000109
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=1
kern.timecounter.fast_gettime: 0 -> 1
user@test:~ % sudo sysctl kern.timecounter.hardware=HPET
kern.timecounter.hardware: TSC -> HPET
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000011
test    loop    time    iterations      periteration
gettimeofday    0       1.001976650     361817  0.000002769
gettimeofday    1       1.007771340     363755  0.000002770
gettimeofday    2       1.007759790     364436  0.000002765
gettimeofday    3       1.002167610     361223  0.000002774
gettimeofday    4       1.005514150     363170  0.000002768
gettimeofday    5       1.004283770     363254  0.000002764
gettimeofday    6       1.003406260     361886  0.000002772
gettimeofday    7       1.006412230     362752  0.000002774
gettimeofday    8       1.049863580     378690  0.000002772
gettimeofday    9       1.006341930     363179  0.000002770
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000011
test    loop    time    iterations      periteration
gettimeofday    0       1.005940780     339403  0.000002963
gettimeofday    1       1.062578460     359351  0.000002956
gettimeofday    2       1.037196880     350767  0.000002956
gettimeofday    3       1.062579500     359486  0.000002955
gettimeofday    4       1.037203790     351224  0.000002953
gettimeofday    5       1.009921970     342757  0.000002946
gettimeofday    6       1.029915260     334432  0.000003079
gettimeofday    7       1.059815440     345772  0.000003065
gettimeofday    8       1.009940770     329601  0.000003064
gettimeofday    9       1.062575370     347237  0.000003060
user@test:~ %

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 38951
Build 35840: arc lint + arc unit

Event Timeline

adam_fenn.io edited the test plan for this revision. (Show Details)

Hello,

I applied the patch to my source tree and while compiling and using the kernel works perfectly, compiling libc failed:

--- __vdso_gettc.po ---
/usr/src/lib/libc/x86/sys/__vdso_gettc.c:446:22: error: expected ')'
                fd = _open("/dev/" PVCLOCK_CDEVNAME, O_RDONLY);
                                   ^
/usr/src/lib/libc/x86/sys/__vdso_gettc.c:446:13: note: to match this '('
                fd = _open("/dev/" PVCLOCK_CDEVNAME, O_RDONLY);
                          ^
/usr/src/lib/libc/x86/sys/__vdso_gettc.c:492:7: error: use of undeclared identifier 'VDSO_TH_ALGO_X86_PVCLK'
        case VDSO_TH_ALGO_X86_PVCLK:
             ^
2 errors generated.
*** [__vdso_gettc.po] Error code 1

Looks like it includes /usr/include/x86/{pvclock.h,vdso.h} but not the newly modified files /usr/src/sys/x86/include/{pvclock.h,vdso.h}.

What am I doing wrong here?

[...]
I applied the patch to my source tree and while compiling and using the kernel works perfectly, compiling libc failed:

Thanks a lot for helping to test-drive this!

--- __vdso_gettc.po ---
/usr/src/lib/libc/x86/sys/__vdso_gettc.c:446:22: error: expected ')'
                fd = _open("/dev/" PVCLOCK_CDEVNAME, O_RDONLY);
                                   ^
/usr/src/lib/libc/x86/sys/__vdso_gettc.c:446:13: note: to match this '('
                fd = _open("/dev/" PVCLOCK_CDEVNAME, O_RDONLY);
                          ^
/usr/src/lib/libc/x86/sys/__vdso_gettc.c:492:7: error: use of undeclared identifier 'VDSO_TH_ALGO_X86_PVCLK'
        case VDSO_TH_ALGO_X86_PVCLK:
             ^
2 errors generated.
*** [__vdso_gettc.po] Error code 1

Looks like it includes /usr/include/x86/{pvclock.h,vdso.h} but not the newly modified files /usr/src/sys/x86/include/{pvclock.h,vdso.h}.

What am I doing wrong here?

How are you building libc?

My expectation is that what you're seeing is build-setup-specific.

I typically build from a tree outside of /usr/src, adjusting the appropriate build(7) tunables for my setup/workflow. But, in an attempt to repro something close to your setup, I just also tried a vanilla handbook-style build+install out of /usr/src with this patch on a fresh VM. This build+install succeeded and its kernel+libc appeared to contain working kvmclock+vDSO functionality. I hope this is a helpful data point?

Thanks a lot for helping to test-drive this!

No problem, thanks for having written and shared the whole thing!

Looks like it was just human error... :-D I don't have much specific to my src.conf or make.conf. I was able to build it and it works great.

Here are the difference between fast and slow gettime using kvmclock:

Clock resolution: 0.000000002
test    loop    time    iterations      periteration
gettimeofday    0       1.027505076     20731674        0.000000049
gettimeofday    1       1.009867116     20045145        0.000000050
gettimeofday    2       1.062553539     21441734        0.000000049
gettimeofday    3       1.027334294     21336504        0.000000048
gettimeofday    4       1.010107762     21046278        0.000000047
gettimeofday    5       1.062555510     22715537        0.000000046
gettimeofday    6       1.027237924     21549121        0.000000047
gettimeofday    7       1.010118686     21905685        0.000000046
gettimeofday    8       1.062564009     22368031        0.000000047
gettimeofday    9       1.004777479     19896091        0.000000050


Clock resolution: 0.000000002
test    loop    time    iterations      periteration
gettimeofday    0       1.000889430     3002010 0.000000333
gettimeofday    1       1.009968789     3044332 0.000000331
gettimeofday    2       1.062557496     3202531 0.000000331
gettimeofday    3       1.027428955     2936040 0.000000349
gettimeofday    4       1.062665511     3059360 0.000000347
gettimeofday    5       1.037124814     2917255 0.000000355
gettimeofday    6       1.062592107     2730658 0.000000389
gettimeofday    7       1.037284758     2955163 0.000000351
gettimeofday    8       1.023330766     2823287 0.000000362
gettimeofday    9       1.054137086     3063078 0.000000344

That makes fast gettime between 6 to 8 times faster.

For the record, here's what I get with ACPI-fast (which would have been picked by the kernel if kvmclock wasn't present, I don't have HPET enabled because it's actually worse), it's around 15/18 times slower than the slow gettime using kvmclock, so around 125 slower than the fast gettime using kvmclock...

Clock resolution: 0.000000280
test    loop    time    iterations      periteration
gettimeofday    0       1.015938699     172802  0.000005879
gettimeofday    1       1.062563516     173407  0.000006127
gettimeofday    2       1.021390315     161050  0.000006342
gettimeofday    3       1.062570500     179625  0.000005915
gettimeofday    4       1.037464119     176640  0.000005873
gettimeofday    5       1.062701244     164911  0.000006444
gettimeofday    6       1.037006791     173690  0.000005970
gettimeofday    7       1.062596481     179243  0.000005928
gettimeofday    8       1.037335330     181999  0.000005699
gettimeofday    9       1.062615478     181789  0.000005845

Clock resolution: 0.000000280
test    loop    time    iterations      periteration
gettimeofday    0       1.029130248     170530  0.000006034
gettimeofday    1       1.062564912     180038  0.000005901
gettimeofday    2       1.037346505     184685  0.000005616
gettimeofday    3       1.062615198     183068  0.000005804
gettimeofday    4       1.037335609     182857  0.000005672
gettimeofday    5       1.062568822     178510  0.000005952
gettimeofday    6       1.037444284     175730  0.000005903
gettimeofday    7       1.062656265     172583  0.000006157
gettimeofday    8       1.037069370     173821  0.000005966
gettimeofday    9       1.062579718     181524  0.000005853

All of these are for a KVM vm running under qemu version 5.1.0 on Fedora 33 kernel 5.11.12.

adam_fenn.io retitled this revision from WIP: kvmclock driver with vDSO support to kvmclock driver with vDSO support.Tue, Apr 20, 8:38 PM
adam_fenn.io edited the summary of this revision. (Show Details)

[...]
All of these are for a KVM vm running under qemu version 5.1.0 on Fedora 33 kernel 5.11.12.

Thanks a lot for doing this and providing this data/feedback, @me_freebsd_mathieu.digital!

I'm continuing my own testing, but, I think have reached a point where this is ready for exposure to more real-world test setups like @me_freebsd_mathieu.digital's. So, I've removed the WIP label to encourage additional testing/review/feedback.

Thanks in advance to anyone willing to review this and/or provide any feedback after taking it for a spin!

Thanks in advance to anyone willing to review this and/or provide any feedback after taking it for a spin!

I'm going to recompile world/kernel with the new version of your patch, it's a good stress test and so far it's been rock solid so you have my vote!

LGTM, just one request

sys/dev/kvm_clock/kvm_clock.c
308

Could the cdev creation be abstracted away, so that you only need to provide it a pointer to a struct pvclock_vcpu_time_info to use?

There's no KVM specific logic in there, as it's only purpose is to map the hypervisor agnostic structure into user-space.

sys/dev/kvm_clock/kvm_clock.c
273

Sorry, forgot to comment: why do you need to map all the vCPU time infos in the pager? user-space should only use one of those (ie: from vCPU#0?) which seems to be already what the vDSO implementation does, but the pvclock device would expose all time infos which seems unnecessary.

Might be easier to just use d_mmap here, and only allow mapping the first page (if there are multiple ones) that contains the pvclock mapped?

See HyperV hyperv_tsc_mma which I think it's slightly easier as it doesn't use a pager.

@dim did some testing:

--- sysbench-acpi-fast.txt	2021-04-28 20:40:05.529460000 +0200
+++ sysbench-kvmclock.txt	2021-04-28 20:33:52.083953000 +0200
@@ -21,28 +21,28 @@

 Done.

-Total operations: 524288 (66548.67 per second)
+Total operations: 524288 (2142666.15 per second)

-512.00 MiB transferred (64.99 MiB/sec)
+512.00 MiB transferred (2092.45 MiB/sec)


 General statistics:
-    total time:                          7.8730s
+    total time:                          0.2397s
     total number of events:              524288

 Latency (ms):
          min:                                    0.00
-         avg:                                    0.01
-         max:                                    0.85
-         95th percentile:                        0.01
-         sum:                                 2644.91
+         avg:                                    0.00
+         max:                                    0.03
+         95th percentile:                        0.00
+         sum:                                  101.93

 Threads fairness:
     events (avg/stddev):           524288.0000/0.00
-    execution time (avg/stddev):   2.6449/0.00
+    execution time (avg/stddev):   0.1019/0.00

 DEBUG: Verbose per-thread statistics:

-DEBUG:     thread #  0: min: 0.0000s  avg: 0.0000s  max: 0.0008s  events: 524288
-DEBUG:                  total time taken by event execution: 2.6449s
+DEBUG:     thread #  0: min: 0.0000s  avg: 0.0000s  max: 0.0000s  events: 524288
+DEBUG:                  total time taken by event execution: 0.1019s

Address @royger's review feedback:

  • Switch to using the device pager and only allowing one page to be mmaped by /dev/pvclock.
  • Attempt to identify and push more of the hypervisor-agnostic content into the hypervisor-agnostic pvclock.{c,h} layer.

Hi @royger,

Thanks a lot for your review and feedback; one of the goals is to make it as easy as possible for the XEN PV clock code to piggyback on the vDSO portion of this work, so, your input is greatly appreciated!

As such, and, as per

Could the cdev creation be abstracted away, so that you only need to provide it a pointer to a struct pvclock_vcpu_time_info to use?

There's no KVM specific logic in there, as it's only purpose is to map the hypervisor agnostic structure into user-space.

...in this latest revision, I've attempted to identify and push more of the hypervisor-agnostic content into the hypervisor-agnistic pvclock.{c,h} layer.

Would the way things are abstracted in this version work well for XEN or what further changes would bring it to the point of being cleanly re-usable by XEN? Any other hypervisor-agnostic content I might have missed?

My expectation/understanding is that XEN could:

  • (1) Use VCPUOP_register_vcpu_time_memory_area to create a vDSO-ready vCPU 0 struct pvclock_vcpu_time_info page that it could pass as parameter ti_vcpu0_page of pvclock_softc_init().
  • (2) Share pvclock_tc_get_timecount() with KVM clock.

One point regarding (2)---I noticed that xentimer_get_timecount() deliberately does not disable preemption around xen_fetch_vcpu_time(DPCPU_GET(vcpu_info)), independent of the current state of the PVCLOCK_FLAG_TSC_STABLE flag. My current understanding is that, unless PVCLOCK_FLAG_TSC_STABLE is set, staying on the same vCPU throughout this operation is necessary for trying to ensure that (a) the contents of the current vCPU's struct pvclock_vcpu_time_info and (b) the TSC value that is read and applied to (a) to produce the final timestamp, were both derived from the same underlying physical TSC. And, if this is not the case, in the presence of an "unstable TSC," the chances of seeing non-monotonic/inaccurate timestamps would increase.

What am I missing? I'd thought this would be applicable to the pvclock API and be hypervisor-agnostic, but maybe there's something XEN-specific here?

Thanks again for your input---hope this work ultimately benefits XEN guest support just as much!

pvclock_softc_init feels slightly weird, as it's not a real softc, same with pvclock_attach or detach, as they are not called from the typical bus functions (ie: there's no identify/probe for example, and neither are part of a device_method_t struct). I would just use pvclock_private, and have a single pvclock_init/pvclock_destroy set of helpers. Attempting to make them look like DEVMETHOD is IMO confusing.

And yes, I think the Xen timer is missing a preempt disable around the rdtsc fetch and the vcpu_time_info processing, to assert they are from the same vCPU. In practice I think it's very unlikely to be preempted at that point, but does indeed need fixing.

sys/dev/kvm_clock/kvm_clock.c
152

I don't think you need to strictly use contigmalloc here, as each pCPU will use vtophys to get the physical address of it's vcpu_time_info area?

As the size of a vcpu_time_info is 32bytes, so it should never cross a page boundary provided the first element is properly aligned.

lib/libc/x86/sys/__vdso_gettc.c
347

I wonder whether it would be possible to place the scale_delta helper in the pvclock header and have a single implementation shared between the kernel and libc, having duplicated copies of this helper is cumbersome.