Details

Reviewers

kib
markj
stevek
royger

Commits

rG6fa88a627d5e: kvm_clock: KVM paravirtual clock support
rG6c69c6bb4c7f: kvm_clock: KVM paravirtual clock support

Summary

This is the work referenced by https://reviews.freebsd.org/D29531#663885, plus changes for the review feedback on D29531 that were relevant here as well (specifically, @kib's commenting input and @rpokala's + @imp's copyright header input).

Test Plan

Thus far, just basic testing on top of QEMU 4.2.1 + kernel 5.8.0 on Ubuntu 20.04 on Intel and AMD HW via:

date
syscall_timing
sysctls debug.clock_show_io, debug.clock_do_io, kern.timecounter.hardware, and kern.timecounter.fast_gettime
Using dtrace to sanity-check that sys_gettimeofday()/sys_clock_gettime() were/weren't being called as expected.
zzz(8) plus the QEMU system_powerdown and system_wakeup commands (for test-driving the suspend/resume entry points).

I also tried the sysbench invocation in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216759#c11 and can confirm similar differences between using the HPET and using the kvmclock or TSC.

An example including some timing info:

user@test:~ % sysctl kern.timecounter
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 0
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) kvmclock(975) TSC(-100) dummy(-1000000)
kern.timecounter.hardware: kvmclock
kern.timecounter.alloweddeviation: 5
kern.timecounter.timehands_count: 2
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 59794
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 3543295
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.HPET.frequency: 100000000
kern.timecounter.tc.HPET.counter: 1966409884
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.kvmclock.quality: 975
kern.timecounter.tc.kvmclock.frequency: 1000000000
kern.timecounter.tc.kvmclock.counter: 3373140415
kern.timecounter.tc.kvmclock.mask: 4294967295
kern.timecounter.tc.TSC.quality: -100
kern.timecounter.tc.TSC.frequency: 1607988965
kern.timecounter.tc.TSC.counter: 513410333
kern.timecounter.tc.TSC.mask: 4294967295
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000002
test    loop    time    iterations      periteration
gettimeofday    0       1.000971171     24698568        0.000000040
gettimeofday    1       1.007769518     24866552        0.000000040
gettimeofday    2       1.001071247     24700412        0.000000040
gettimeofday    3       1.006630289     24832878        0.000000040
gettimeofday    4       1.003202891     24769714        0.000000040
gettimeofday    5       1.004513069     24790162        0.000000040
gettimeofday    6       1.005332074     24833956        0.000000040
gettimeofday    7       1.002384739     24735470        0.000000040
gettimeofday    8       1.007455964     24884169        0.000000040
gettimeofday    9       1.000256110     24708625        0.000000040
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000002
test    loop    time    iterations      periteration
gettimeofday    0       1.062566203     8546464 0.000000124
gettimeofday    1       1.037318670     8305737 0.000000124
gettimeofday    2       1.050000947     8410842 0.000000124
gettimeofday    3       1.049828515     8405755 0.000000124
gettimeofday    4       1.062561943     8522758 0.000000124
gettimeofday    5       1.037252041     8295647 0.000000125
gettimeofday    6       1.009902776     8103514 0.000000124
gettimeofday    7       1.029863208     8271323 0.000000124
gettimeofday    8       1.059883267     8462669 0.000000125
gettimeofday    9       1.009913686     8092699 0.000000124
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=1
kern.timecounter.fast_gettime: 0 -> 1
user@test:~ % sudo sysctl kern.timecounter.hardware=TSC
kern.timecounter.hardware: kvmclock -> TSC
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000001
test    loop    time    iterations      periteration
gettimeofday    0       1.039693055     39156300        0.000000026
gettimeofday    1       1.059332059     39727049        0.000000026
gettimeofday    2       1.040353569     39121632        0.000000026
gettimeofday    3       1.050123063     39481224        0.000000026
gettimeofday    4       1.049670668     39530649        0.000000026
gettimeofday    5       1.062567179     39960000        0.000000026
gettimeofday    6       1.037282737     38908035        0.000000026
gettimeofday    7       1.062564063     39918185        0.000000026
gettimeofday    8       1.037147904     39045152        0.000000026
gettimeofday    9       1.009909895     38041308        0.000000026
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000001
test    loop    time    iterations      periteration
gettimeofday    0       1.000343476     9173210 0.000000109
gettimeofday    1       1.007733227     9233248 0.000000109
gettimeofday    2       1.051675824     9637613 0.000000109
gettimeofday    3       1.003205497     9195701 0.000000109
gettimeofday    4       1.006564792     9226628 0.000000109
gettimeofday    5       1.001095929     9169797 0.000000109
gettimeofday    6       1.007764181     9224751 0.000000109
gettimeofday    7       1.000836593     9169834 0.000000109
gettimeofday    8       1.006850768     9220051 0.000000109
gettimeofday    9       1.002965937     9190431 0.000000109
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=1
kern.timecounter.fast_gettime: 0 -> 1
user@test:~ % sudo sysctl kern.timecounter.hardware=HPET
kern.timecounter.hardware: TSC -> HPET
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000011
test    loop    time    iterations      periteration
gettimeofday    0       1.001976650     361817  0.000002769
gettimeofday    1       1.007771340     363755  0.000002770
gettimeofday    2       1.007759790     364436  0.000002765
gettimeofday    3       1.002167610     361223  0.000002774
gettimeofday    4       1.005514150     363170  0.000002768
gettimeofday    5       1.004283770     363254  0.000002764
gettimeofday    6       1.003406260     361886  0.000002772
gettimeofday    7       1.006412230     362752  0.000002774
gettimeofday    8       1.049863580     378690  0.000002772
gettimeofday    9       1.006341930     363179  0.000002770
user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0
user@test:~ % syscall_timing gettimeofday
Clock resolution: 0.000000011
test    loop    time    iterations      periteration
gettimeofday    0       1.005940780     339403  0.000002963
gettimeofday    1       1.062578460     359351  0.000002956
gettimeofday    2       1.037196880     350767  0.000002956
gettimeofday    3       1.062579500     359486  0.000002955
gettimeofday    4       1.037203790     351224  0.000002953
gettimeofday    5       1.009921970     342757  0.000002946
gettimeofday    6       1.029915260     334432  0.000003079
gettimeofday    7       1.059815440     345772  0.000003065
gettimeofday    8       1.009940770     329601  0.000003064
gettimeofday    9       1.062575370     347237  0.000003060
user@test:~ %

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In D29733#691518, @dch wrote:

shall I just drop 11.* from the port, or do you want to make a patch for x86 & x64 versions?

Perhaps we drop 11.x from the port what with 11.x soon to be EOL'ed---does that seem reasonable to you, too?

Thanks and sorry for the delay! Mostly nits, I think the only major question is whether you need to use the dma system to allocate the memory for the vcpu time infos. I think it would be fine to just use plain malloc.

lib/libc/x86/sys/__vdso_gettc.c
104 ↗	(On Diff #89085)	I think the inline here is IMO misleading, as you end up taking a pointer to the function when filling the tsc_selector struct, so the inline is just dropped.
358 ↗	(On Diff #89085)	I think it would be enough to just say that the TSC_STABLE is mandatory because TSCs must be synchronized across CPUs or else using vcpu time info of CPU 0 on for all CPUs would be wrong.
sys/dev/kvm_clock/kvm_clock.c
151	Do you really need to use the dma subsystem for this? Isn't it enough to just malloc a region of memory and use it? You could then also use M_WAITOK maybe?
sys/x86/x86/pvclock.c
174 ↗	(On Diff #89085)	I think this should return EACCES as the file only allows opening in read-only mode.
223 ↗	(On Diff #89085)	Couldn't vDSO became functional at some later point (ie: when migrated to a different host) even if the current vcpu_time_info reports non-stable? I wonder whether we should just check if the stable flag is supported instead of also checking if it's currently set.
252 ↗	(On Diff #89085)	Given the list of arguments here I think we should consider creating a structure and passing a pointer to it, or rather fill some of the pvclock struct fields by the caller? You could split the fields of pvclock between the public ones to be filled by the caller and the private ones.
258 ↗	(On Diff #89085)	Nit: I usually try to avoid splitting such log messages because grepping for them afterwards is not possible. I would rather try to place the message starting on the newline as you won't have to split it then.
308 ↗	(On Diff #89085)	Given the pvclock has a resolution of 1ns and that's fixed I think it would be fine to just hardcode this as 1us. It's not like PVCLOCK_RESOLUTION_NS can or will be modified anyway?

Address @royger's review feedback.

Harbormaster completed remote builds in B40153: Diff 91478.Jun 28 2021, 10:46 PM

Thanks a lot for this latest round of feedback, @royger! Revised accordingly and/or discussion inline.

lib/libc/x86/sys/__vdso_gettc.c
358 ↗	(On Diff #89085)	I decided to just remove this comment altogether. It was basically just attempting to point out how this flag can theoretically change at any time, which would explain how we could arrive at this location from `binuptime()` even though `binuptime()`, indirectly via `tk_enabled`, just observed `PVCLOCK_FLAG_TSC_STABLE` as set. But I don't think it's as subtle of a point as I must have when I originally wrote this, and I can see how this comment might confuse if `__vdso_pvclock_tsc()` is read separately from `binuptime()`.
sys/dev/kvm_clock/kvm_clock.c
151	Do you really need to use the dma subsystem for this? Isn't it enough to just malloc a region of memory and use it? I'd decided against `malloc()` since it doesn't provide API-level alignment guarantees; while, implementation-wise, it does look like >= `PAGE_SIZE`-ed `malloc(9)` allocations will be page-aligned, my read of `malloc(9)` (particularly section `IMPLEMENTATION NOTES`) is that this should not be assumed by callers. (FWIW, I had also noted (1) the precedent of hyperv's use of `BUS_DMA(9)` for its corresponding PV clock in-memory data and (2) internally, this draft's use of `BUS_DMA(9)` will ultimately do the same thing that the first draft was doing with its direct use of the physical memory pager). Just recapping, the intention is that `timeinfos` be readily `mmap`-able---the allocation containing `timeinfos` should start and end on a page boundary (with any unused portion of the last page zeroed). Does this choice make sense or am I missing something? You could then also use M_WAITOK maybe? This choice was considering use cases where the driver is loaded during boot (whether compiled in or via `loader.conf(5)`). My thinking was that a failure to satisfy a relatively small allocation like this during boot would be an indication of far greater problems and, as such, it wouldn't be worth adding any blocking+reclaims+retries to the boot process, especially since the system can still function without kvmclock. But I suppose it's also ok to allow the reclaims+retries, so, I've switched `bus_dmamem_alloc()` from `BUS_DMA_NOWAIT` to `BUS_DMA_WAITOK` in this version as per this suggestion.
sys/x86/x86/pvclock.c
223 ↗	(On Diff #89085)	Right. But `pvclock_tc_vdso_timehands{,32}()` is called with each vDSO timehands update, so, such scenarios should be supported, no? I think the `(pvc->ti_vcpu0_page->flags & PVCLOCK_FLAG_TSC_STABLE) != 0` clause could, indeed, be dropped. But I included it thinking that, in the unstable case, we might as well short-circuit the vDSO codepath sooner than later---with this clause, the vDSO codepath decides to fall back to the syscall codepath after it looks at `tk->tk_enabled` in `binuptime()` whereas, without this clause, this decision will happen after the `PVCLOCK_FLAG_TSC_STABLE` flag check in `__vdso_pvclock_tsc()`. I obtained a rough measurement of the delta between these two versions of the unstable TSC codepath by inverting the `PVCLOCK_FLAG_TSC_STABLE` check in `__vdso_pvclock_tsc()` and then looking at `syscall_timing gettimeofday` numbers for (1) `kern.timecounter.fast_gettime=0` and (2) `kern.timecounter.fast_gettime=1`. I figure, on my `PVCLOCK_FLAG_TSC_STABLE` test systems, (1) should roughly simulate the version where the code is left as-is and (2) should roughly simulate the version where this clause is dropped. These `syscall_timing gettimeofday` numbers from (1) to (2) regressed from 116ns -> 146ns = +30ns on my AMD HW and 119ns -> 141ns = +22ns on my Intel HW. So, potentially non-negligible and makes sense to keep this clause in place (or achieve the same outcome in some other way)? What do you think?

adam_fenn.io marked 3 inline comments as done.Jun 29 2021, 4:03 PM

It looks like we need this for EC2 "T3" and "T3A" family instances, where apparently it's possible for a VM to bounce between CPUs running at different clock speeds(!!!) and the TSC-low timecounter is running into problems (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256781).

(I don't have anything to contribute to reviewing the change, just wanted to flag the EC2 environment as one where this is relevant.)

Just replied to the most controversial part I think, will try to review the rest tomorrow. Thanks!

sys/dev/kvm_clock/kvm_clock.c
151	Do you really need to use the dma subsystem for this? Isn't it enough to just malloc a region of memory and use it? I'd decided against `malloc()` since it doesn't provide API-level alignment guarantees; while, implementation-wise, it does look like >= `PAGE_SIZE`-ed `malloc(9)` allocations will be page-aligned, my read of `malloc(9)` (particularly section `IMPLEMENTATION NOTES`) is that this should not be assumed by callers. Right, TBH I think malloc returning and address not aligned to a page boundary when allocating a region >= PAGE_SIZE would likely be a bug in the implementation. (FWIW, I had also noted (1) the precedent of hyperv's use of `BUS_DMA(9)` for its corresponding PV clock in-memory data and (2) internally, this draft's use of `BUS_DMA(9)` will ultimately do the same thing that the first draft was doing with its direct use of the physical memory pager). Just recapping, the intention is that `timeinfos` be readily `mmap`-able---the allocation containing `timeinfos` should start and end on a page boundary (with any unused portion of the last page zeroed). Does this choice make sense or am I missing something? Well, I think it's overly complicated (and cumbersome due to all the initialization involved) to use bus_dma just to allocate a memory region that's page aligned. I'm sure there are already instances in-kernel that assume that malloc(PAGE_SIZE) returns a page. I wouldn't be opposed to using: sc->timeinfos = malloc(size, ...); if ((sc->timeinfos & PAGE_MASK) != 0) { device_printf(..., "failed to allocate aligned memory region"); KASSERT(0, ("Failed to allocate page aligned region")); return (ENOMEM); } But I understand others might have different opinions. Maybe @kib could provide some insight on whether there's another way to allocate aligned memory regions without having to resort to bus_dma, I don't have that much insight on the VM subsystem. You could then also use M_WAITOK maybe? This choice was considering use cases where the driver is loaded during boot (whether compiled in or via `loader.conf(5)`). My thinking was that a failure to satisfy a relatively small allocation like this during boot would be an indication of far greater problems and, as such, it wouldn't be worth adding any blocking+reclaims+retries to the boot process, especially since the system can still function without kvmclock. Likely, if we fail to satisfy such small allocation it's likely the kernel won't be able to finish booting anyway. Adding a comment as to why BUS_DMA_NOWAIT is used (ie: not because the context prevents using BUS_DMA_WAITOK) would be OK also. But I suppose it's also ok to allow the reclaims+retries, so, I've switched `bus_dmamem_alloc()` from `BUS_DMA_NOWAIT` to `BUS_DMA_WAITOK` in this version as per this suggestion.

imp added inline comments.Jun 30 2021, 6:14 PM

sys/dev/kvm_clock/kvm_clock.c
151	While malloc(9) can't give these guarantees, uma(8) can. Creating the tag, allocating and loading it are basically creating a uma zone behind the scenes and using them directly would be better than using busdma just to get alignment.

kib added inline comments.Jun 30 2021, 6:59 PM

sys/dev/kvm_clock/kvm_clock.c
151	We have malloc_domainset_aligned(), added exactly to export the alignment guarantees from the malloc subsystem.

royger added inline comments.Jul 1 2021, 11:09 AM

sys/dev/kvm_clock/kvm_clock.c
151	Thanks! That's exactly what's needed here. I think this function has not been added to the malloc(9) man page? Would it make sense to also add some syntactic sugar and introduce a: #define malloc_aligned(s, a, t, f) malloc_domainset_aligned(s, a, t, DOMAINSET_RR(), f) Or similar.

kib added inline comments.Jul 1 2021, 11:22 AM

sys/dev/kvm_clock/kvm_clock.c
151	Of course, feel free to add. There were no users for it until now.

Use malloc_domainset_aligned() as per @kib's suggestion.

Harbormaster completed remote builds in B40221: Diff 91630.Jul 1 2021, 7:04 PM

Yes, thanks a lot, @kib, for pointing out malloc_domainset_aligned()! Much better! (Doh---I think I did see this a while back but misinterpreted its relative lack of use and absence from malloc(9) as discouraging new uses).

I've switched to malloc_domainset_aligned() for now so that this work isn't potentially blocked on separate submission(s) to add a malloc_aligned() and update malloc(9) etc, figuring we can easily switch over to such a malloc_aligned() if/when appropriate.

adam_fenn.io marked 5 inline comments as done.Jul 1 2021, 7:33 PM

kib added inline comments.Jul 1 2021, 7:55 PM

lib/libc/x86/sys/__vdso_gettc.c
363 ↗	(On Diff #91630)	I do not quite follow this. Why do you need to set pvclock_vcpu_info to MAP_FAILED in advance, during init? Basically, this introduces somewhat unpleasant and very hard to diagnose case, where two threads happen to execute e.g. gettimeofday(), and then one of them falls back to syscall. Unpleasant part is that on the next gettimeofday() call this thread would use vdso path, and kernel vs. usermode timecounters might be relatively off. I tried to avoid this with HPET. Also note use of acq/rel atomics to not rely implicitly on x86 TSO.
sys/dev/kvm_clock/kvm_clock.c
96	This blank line is excessive.
227	Excessive blank line.
229	And this one. It would be too much spam to comment about all of them.
253	Why do you need newbus attachment for this thing? Due to the clock kobj interface? But is it required?
sys/x86/x86/pvclock.c
136 ↗	(On Diff #91630)	Please do not use these linuxisms. I suspect you need load_acq(&wc->version) instead of these two lines ...
139 ↗	(On Diff #91630)	.. and then atomic_thread_fence_acq() instead of this rmb(). But where is the writer?

royger added inline comments.Jul 2 2021, 7:26 AM

sys/x86/x86/pvclock.c
139 ↗	(On Diff #91630)	The writer is the hypervisor.

adam_fenn.io mentioned this in D31004: malloc(9): Document/add aligned variants.Jul 2 2021, 9:27 PM

Initial/partial round of addressing @kib's feedback.

Harbormaster completed remote builds in B40371: Diff 91941.Jul 7 2021, 5:50 PM

Thanks a lot for the review/feedback, @kib! I'm thinking about the questions raised by the init-related item; this revision attempts to address the other items in the meantime.

lib/libc/x86/sys/__vdso_gettc.c
363 ↗	(On Diff #91630)	hm... Yeah, this was assuming that it's always "ok"---in that a thread will never see a syscall-based time reading that is less than a previous syscall- or vDSO-based time reading---for a thread to fall back to the syscall path and, based on that assumption, was deliberately allowing the described race in exchange for simplified init code. If it can't be assumed to be safe to switch back and forth between the vDSO and syscall codepaths, then maybe we can't currently support the vDSO codepath for this clock source as the timekeeping code presently stands. (I'll think about this more). But doesn't the hyperv PV TSC vDSO codepath make this same assumption, since a thread can end up switching between the vDSO and syscall codepaths at any time based on whether `TscSequence == 0` (`hyperv_ref_tsc->tsc_seq == 0`)? I want to make sure I have a correct and detailed understanding of the specific vDSO/syscall mismatch problem(s) you're referencing. Would you be able to provide example(s) of the specific way(s) that syscall-based and vDSO-based time values can differ that are being referenced here? I think I can see a case where two calls to `clock_gettime()` for `CLOCK_MONOTONIC_FAST` or `CLOCK_UPTIME_FAST`---which only use `th_offset` without adding the subsequent delta---could lead to the second call's value being less than the first. This would happen if a thread managed to make a syscall-based reading followed by a vDSO-based reading that both occurred after the in-kernel timehands update but before the corresponding vDSO-based timehands update. Is this an example of the sorts of cases you're thinking of? Am I understanding correctly if I'm thinking that this same scenario but for `CLOCK_{MONOTONIC,UPTIME}{,_PRECISE}`---which use `th_offset` plus a `th_offset_count`-relative delta---would not have this problem because the `th_offset_count`-based delta that would be included in each reading would be relative to its respective `th_offset` value?
sys/dev/kvm_clock/kvm_clock.c
253	Right, all of this is as per the clock/RTC subsystem (`sys/clock.h`, `kern/subr_rtc.c`), which does use the clock kobj interface's entry points.

adam_fenn.io marked an inline comment as done.Jul 7 2021, 7:34 PM

Switch the vDSO and syscall codepaths over to a common (and freestanding) version of __vdso_gettc_rdtsc().

(A previous revision accidentally switched the vDSO codepath from __vdso_gettc_rdtsc() to rdtsc(); this fixes that).

Harbormaster completed remote builds in B40421: Diff 92030.Jul 9 2021, 9:10 PM

Can you extract the userspace/kernel code unification with rdtsc_ordered.h into separate patch?

sys/x86/include/_rdtsc_ordered.h
50 ↗	(On Diff #92030)	We do not indent function definitions in macros typically. Please remove these 4 spaces.
77 ↗	(On Diff #92030)	There is absolutely no reason to defined this macro. Both kernel and userspace variants are used once, so you could just put the corresponding code directly into libc and kernel, and avoid obfuscation through the macro. In fact, I think that the better approach is to avoid macros at all. Put the code you want to share (as I understand, this is DEFINE_RDTSC_ORDERED_COMMON) into e.g. sys/x86/include/rdtsc_fenced.h directly, and include it into libc "vdso" and kernel rdtsc.h

Rebase, address @kib's review feedback, and related clean-up.

(Regarding rdtsc_ordered() and rdtscp_aux()---I plan to split these into separate submissions, but, before doing so, figure we might as well wait to see if they are still present in an approved version of this submission).

Harbormaster completed remote builds in B40797: Diff 93092.Aug 2 2021, 1:59 AM

adam_fenn.io marked an inline comment as done.Aug 2 2021, 2:00 AM

adam_fenn.io mentioned this in D31415: x86: cpufunc: Add rdtscp_aux().Aug 5 2021, 2:16 AM

adam_fenn.io mentioned this in D31416: x86: cpufunc: Add rdtsc_ordered().Aug 5 2021, 2:25 AM

adam_fenn.io mentioned this in D31417: x86: files: Make sys/x86/x86/pvclock.c optional.Aug 5 2021, 2:27 AM

adam_fenn.io mentioned this in D31418: pvclock: Add vDSO support.Aug 5 2021, 2:29 AM

No functional changes intended. Since the last revision:

Rebase again to pick up malloc_aligned() now that D31004 has been committed.
kvm_clock_attach(): Use malloc_aligned() instead of malloc_domainset_aligned().
Split out freestanding portions into separate submissions.

Harbormaster completed remote builds in B40863: Diff 93248.Aug 5 2021, 2:41 AM

adam_fenn.io added parent revisions: D31004: malloc(9): Document/add aligned variants, D31415: x86: cpufunc: Add rdtscp_aux(), D31416: x86: cpufunc: Add rdtsc_ordered(), D31417: x86: files: Make sys/x86/x86/pvclock.c optional, D31418: pvclock: Add vDSO support.Aug 5 2021, 2:43 AM

kib added inline comments.Aug 5 2021, 2:34 PM

sys/dev/kvm_clock/kvm_clock.c
113	if ((regs[0] & (KVM_FEATURE_CLOCKSOURCE \| KVM_FEATURE_CLOCKSOURCE2)) == 0)
144	Should this case be assert? Is it possible that _CLOCKSOURCE feature disappears?
149	malloc() for PAGE_SIZE and larger is a strange choice. Why not use kmem_malloc()? For instance, ldt allocation on x86 uses kmem_malloc(), and it is quite similar to kvm clock case.

Address @kib's feedback.

Harbormaster completed remote builds in B40880: Diff 93293.Aug 5 2021, 10:58 PM

adam_fenn.io marked 3 inline comments as done.Aug 5 2021, 11:07 PM

adam_fenn.io added inline comments.

sys/dev/kvm_clock/kvm_clock.c
149	I see that `kmem_malloc()`, internally, always returns page-aligned allocations, but does it formally guarantee this condition at the API level like `malloc_{,domainset_}aligned(9)` do? There also seem to be so few uses of it relative to `malloc(9)`, especially among drivers---should we maintain that precedent here?

kib added inline comments.Aug 5 2021, 11:14 PM

sys/dev/kvm_clock/kvm_clock.c
149	Yes, the kmem_malloc() KPI fundamental property is that it operates on the page granularity. As I said, the intent there (kvmclock) matches the KPI purpose perfectly. It is one-off allocation from the platform-specific driver, which is tightly integrated with the rest of the system.

Address @kib's feedback.

Harbormaster completed remote builds in B40883: Diff 93296.Aug 5 2021, 11:46 PM

adam_fenn.io marked an inline comment as done.Aug 5 2021, 11:48 PM

adam_fenn.io added inline comments.

sys/dev/kvm_clock/kvm_clock.c
149	Ok, sounds good, then. Thanks for the clarification!

adam_fenn.io removed a parent revision: D31004: malloc(9): Document/add aligned variants.Aug 6 2021, 5:28 AM

adam_fenn.io marked an inline comment as done.Aug 6 2021, 5:34 AM

kib accepted this revision.Aug 6 2021, 6:27 PM

kib added inline comments.

sys/dev/kvm_clock/kvm_clock.c
149	Outer () are not needed.
153	round_page() is not needed, kmem_malloc() handles roundup internally

This revision is now accepted and ready to land.Aug 6 2021, 6:27 PM

Address @kib's feedback.

This revision now requires review to proceed.Aug 6 2021, 8:02 PM

Harbormaster completed remote builds in B40898: Diff 93337.Aug 6 2021, 8:02 PM

adam_fenn.io marked 2 inline comments as done.Aug 6 2021, 8:02 PM

This revision was not accepted when it landed; it landed in state Needs Review.Aug 14 2021, 12:59 PM

Closed by commit rG6c69c6bb4c7f: kvm_clock: KVM paravirtual clock support (authored by adam_fenn.io, committed by kib). · Explain Why

This revision was automatically updated to reflect the committed changes.

kib added a commit: rG6c69c6bb4c7f: kvm_clock: KVM paravirtual clock support.

kib added a commit: rG6fa88a627d5e: kvm_clock: KVM paravirtual clock support.Oct 12 2021, 4:04 PM

kvmclock driver with vDSO support
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 93677

sys/amd64/conf/GENERIC

sys/amd64/conf/MINIMAL

sys/amd64/conf/NOTES

sys/conf/files.x86

sys/dev/kvm_clock/kvm_clock.c

sys/i386/conf/GENERIC

sys/i386/conf/MINIMAL

sys/i386/conf/NOTES

sys/x86/include/kvm.h

kvmclock driver with vDSO supportClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 93677

sys/amd64/conf/GENERIC

sys/amd64/conf/MINIMAL

sys/amd64/conf/NOTES

sys/conf/files.x86

sys/dev/kvm_clock/kvm_clock.c

sys/i386/conf/GENERIC

sys/i386/conf/MINIMAL

sys/i386/conf/NOTES

sys/x86/include/kvm.h

kvmclock driver with vDSO support
ClosedPublic
Actions

Revision Contents
Changeset List