Page MenuHomeFreeBSD

virtio_random: pipeline fetching the data... This hides latencies that reach 500us, where otherwise we are busy looping...
ClosedPublic

Authored by jmg on Mar 3 2023, 6:47 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 26, 9:55 AM
Unknown Object (File)
Tue, Nov 26, 9:55 AM
Unknown Object (File)
Sun, Nov 24, 4:19 AM
Unknown Object (File)
Sat, Nov 23, 3:30 AM
Unknown Object (File)
Sun, Nov 17, 11:29 PM
Unknown Object (File)
Oct 24 2024, 2:52 AM
Unknown Object (File)
Oct 15 2024, 10:34 PM
Unknown Object (File)
Oct 2 2024, 11:53 AM

Details

Reviewers
delphij
markm
Group Reviewers
csprng
Summary

This was tested w/ qemu on arm64 where the median latency was previously
over 64us, but is now half that (for the times that we have data),
but most of the time we don't have data, so then it returns in
1-2us...

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 50135
Build 47027: arc lint + arc unit

Event Timeline

jmg requested review of this revision.Mar 3 2023, 6:47 PM

This change was inspired by noticing that CPU usage was significantly increased when running w/ the virtio_random module loaded. W/o it loaded, the harvest thread was consuming ~.02% cpu, but when it was loaded, it was consuming ~1% CPU (and there's a bug report of it consuming 100% cpu: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=269823 which this would partially fix).

This change drops the CPU usage w/ virtio loaded down to ~.1%.

I ran some tests using dtrace, using the script: https://www.funkthat.com/gitea/jmg/scripts/src/branch/main/dtrace/virtio_random.d

This showed major latency, one run, notice how one call took over 4ms, and during that entire time, we were busy waiting the cpu, so burning energy, and cpu cycles to do nothing useful):

  value count
   2048 0
   4096 4
   8192 6
  16384 50
  32768 99
  65536 96
 131072 53
 262144 8
 524288 6
1048576 4
2097152 1
4194304 1

With this change:

value count
   128 0
   256 2
   512 14
  1024 232
  2048 124
  4096 7
  8192 3
 16384 6
 32768 58
 65536 61
131072 11
262144 6
524288 0

as you can see, the longest latency is less than 500µs, an 8x performance improvement.

This revision is now accepted and ready to land.Mar 3 2023, 7:17 PM

Nice observation and fix.

sys/dev/virtio/random/virtio_random.c
63

Paired with the full-size nominal alignment this is supposed to guarantee we don't straddle a page boundary, I guess? I'm not sure big alignments like (4 * HARVESTSIZE) are actually honored, though. The softc backing memory is allocated by the bus which is unaware of vtrnd_softc. It would probably be safer to have the data heap-allocated and pointed to from the softc, rather than embedded.

jmg marked an inline comment as done.Mar 3 2023, 10:57 PM
jmg added inline comments.
sys/dev/virtio/random/virtio_random.c
63

Paired with the full-size nominal alignment this is supposed to guarantee we don't straddle a page boundary, I guess? I'm not sure big alignments like (4 * HARVESTSIZE) are actually honored, though. The softc backing memory is allocated by the bus which is unaware of vtrnd_softc. It would probably be safer to have the data heap-allocated and pointed to from the softc, rather than embedded.

so, HARVESTSIZE is 2, so this is an alignment of 8, which I'm pretty sure is done/required/allowed.

I just ran an experiment to verify alignment 8 is preserved. I added the following line:
_Static_assert(offsetof(struct vtrnd_softc, vtrnd_value) == (4 * 8 + 8 + 16 + 16), "offset correct");

compiled the code to make sure it's correct. (4 pointers, bool, two structs each 16 bytes each), then I added a bool variable immediately before, and added 8 to the offset, and it compiled fine. Then I changed the "expected" offset to only +4 if I removed the aligned value, so it is working as intended.

If you want to be extra safe, I can add the following:
_Static_assert((offsetof(struct vtrnd_softc, vtrnd_value) % (sizeof(uint32_t) * HARVESTSIZE)) == 0, "may cross page boundary");

which enforces that the alignment is kept, though I guess if we want to be REALLY careful, we should also make sure that the size is 8 in case it changes. The absolute smallest alignment is pointer (this seems like it should be fixed to at least be 8 if not 16, see UMA_ALIGN_PTR). at least by my reading of the code, softc allocations are aligned to power of 2's starting at 16 bytes, as the structure is 80 bytes, the softc will be aligned to 128, which is the next power of 2.

For another day would be to have a virtio function that takes a pointer+length, and does the necessary work to create the sglist_seg and sglist so we don't have to worry about it.. that is given (size / PAGE_SIZE + 2 segs, and a pointer, that it'll do the required splitting so this won't ever be a problem.

rebuilt with this patch, here's the disas:

`Dump of assembler code for function vtrnd_read:

0xffffffff82b05340 <+0>:     55      push   %rbp
0xffffffff82b05341 <+1>:     48 89 e5        mov    %rsp,%rbp
0xffffffff82b05344 <+4>:     41 57   push   %r15
0xffffffff82b05346 <+6>:     41 56   push   %r14
0xffffffff82b05348 <+8>:     41 54   push   %r12
0xffffffff82b0534a <+10>:    53      push   %rbx
0xffffffff82b0534b <+11>:    48 83 ec 10     sub    $0x10,%rsp
0xffffffff82b0534f <+15>:    48 8b 1d 3a 1f 00 00    mov    0x1f3a(%rip),%rbx        # 0xffffffff82b07290 <g_vtrnd_softc>
0xffffffff82b05356 <+22>:    45 31 f6        xor    %r14d,%r14d
0xffffffff82b05359 <+25>:    48 85 db        test   %rbx,%rbx
0xffffffff82b0535c <+28>:    74 4e   je     0xffffffff82b053ac <vtrnd_read+108>
0xffffffff82b0535e <+30>:    80 7b 20 00     cmpb   $0x0,0x20(%rbx)
0xffffffff82b05362 <+34>:    75 48   jne    0xffffffff82b053ac <vtrnd_read+108>
0xffffffff82b05364 <+36>:    41 89 f4        mov    %esi,%r12d
0xffffffff82b05367 <+39>:    49 89 ff        mov    %rdi,%r15
0xffffffff82b0536a <+42>:    48 8b 7b 10     mov    0x10(%rbx),%rdi
0xffffffff82b0536e <+46>:    48 8d 75 dc     lea    -0x24(%rbp),%rsi
0xffffffff82b05372 <+50>:    e8 29 be f0 fd  call   0xffffffff80a111a0 <virtqueue_dequeue>
0xffffffff82b05377 <+55>:    48 85 c0        test   %rax,%rax
0xffffffff82b0537a <+58>:    74 30   je     0xffffffff82b053ac <vtrnd_read+108>
0xffffffff82b0537c <+60>:    44 8b 75 dc     mov    -0x24(%rbp),%r14d
0xffffffff82b05380 <+64>:    45 39 f4        cmp    %r14d,%r12d
0xffffffff82b05383 <+67>:    45 0f 42 f4     cmovb  %r12d,%r14d
0xffffffff82b05387 <+71>:    4c 8d 63 48     lea    0x48(%rbx),%r12
0xffffffff82b0538b <+75>:    4c 89 ff        mov    %r15,%rdi
0xffffffff82b0538e <+78>:    4c 89 e6        mov    %r12,%rsi
0xffffffff82b05391 <+81>:    4c 89 f2        mov    %r14,%rdx
0xffffffff82b05394 <+84>:    e8 b7 66 5d fe  call   0xffffffff810dba50 <memcpy_erms>
0xffffffff82b05399 <+89>:    4c 89 e7        mov    %r12,%rdi
0xffffffff82b0539c <+92>:    4c 89 f6        mov    %r14,%rsi
0xffffffff82b0539f <+95>:    e8 4c 93 1f fe  call   0xffffffff80cfe6f0 <explicit_bzero>
0xffffffff82b053a4 <+100>:   48 89 df        mov    %rbx,%rdi
0xffffffff82b053a7 <+103>:   e8 e4 fe ff ff  call   0xffffffff82b05290 <vtrnd_enqueue>
0xffffffff82b053ac <+108>:   44 89 f0        mov    %r14d,%eax
0xffffffff82b053af <+111>:   48 83 c4 10     add    $0x10,%rsp
0xffffffff82b053b3 <+115>:   5b      pop    %rbx
0xffffffff82b053b4 <+116>:   41 5c   pop    %r12
0xffffffff82b053b6 <+118>:   41 5e   pop    %r14
0xffffffff82b053b8 <+120>:   41 5f   pop    %r15
0xffffffff82b053ba <+122>:   5d      pop    %rbp
0xffffffff82b053bb <+123>:   c3      ret

End of assembler dump.
`

based on the repeating pattern:
`
CPU ID FUNCTION:NAME

3  73327                     vtrnd_read:0 
3  73328                     vtrnd_read:1 
3  73329                     vtrnd_read:4 
3  73330                     vtrnd_read:6 
3  73331                     vtrnd_read:8 
3  73332                    vtrnd_read:10 
3  73333                    vtrnd_read:11 
3  73334                    vtrnd_read:15 
3  73335                    vtrnd_read:22 
3  73336                    vtrnd_read:25 
3  73337                    vtrnd_read:28 
3  73338                    vtrnd_read:30 
3  73339                    vtrnd_read:34 
3  73340                    vtrnd_read:36 
3  73341                    vtrnd_read:39 
3  73342                    vtrnd_read:42 
3  73343                    vtrnd_read:46 
3  73344                    vtrnd_read:50 
3  73345                    vtrnd_read:55 
3  73346                    vtrnd_read:58 
3  73360                   vtrnd_read:108 
3  73361                   vtrnd_read:111 
3  73362                   vtrnd_read:115 
3  73363                   vtrnd_read:116 
3  73364                   vtrnd_read:118 
3  73365                   vtrnd_read:120 
3  73366                   vtrnd_read:122 
3  73367                   vtrnd_read:123 
3  73327                     vtrnd_read:0

`

we go from instruction 58 to 108; I'm too tired to understand what that means right now

sys/dev/virtio/random/virtio_random.c
63

I forgot HARVESTSIZE was just 2 and this is only 8 bytes, sorry.

Yeah, the softc will probably be 8-byte aligned on any 64-bit architecture with or without the explicit annotation. I just don't think the annotation is causing that. I'm less sure we always 8-byte align on 32-bit arch, but it's definitely possible. I was more concerned when I was mistakenly picturing a larger vtrnd_value.

Then I changed the "expected" offset to only +4 if I removed the aligned value, so it is working as intended.

This relies on the softc itself being suitably aligned. The member offset could be correct but if the softc is allocated misaligned, the member won't be aligned either.

For another day would be to have a virtio function that takes a pointer+length, and does the necessary work to create the sglist_seg and sglist so we don't have to worry about it.. that is given (size / PAGE_SIZE + 2 segs, and a pointer

sglist_build? https://github.com/freebsd/freebsd-src/blob/1c09320d5833fef8a4b6cc0091883fd47ea1eb1b/sys/kern/subr_sglist.c#L697

jmg marked an inline comment as done.Mar 3 2023, 11:43 PM

rebuilt with this patch, here's the disas:

`Dump of assembler code for function vtrnd_read:

0xffffffff82b05340 <+0>:     55      push   %rbp
0xffffffff82b05341 <+1>:     48 89 e5        mov    %rsp,%rbp
0xffffffff82b05344 <+4>:     41 57   push   %r15
0xffffffff82b05346 <+6>:     41 56   push   %r14
0xffffffff82b05348 <+8>:     41 54   push   %r12
0xffffffff82b0534a <+10>:    53      push   %rbx
0xffffffff82b0534b <+11>:    48 83 ec 10     sub    $0x10,%rsp
0xffffffff82b0534f <+15>:    48 8b 1d 3a 1f 00 00    mov    0x1f3a(%rip),%rbx        # 0xffffffff82b07290 <g_vtrnd_softc>
0xffffffff82b05356 <+22>:    45 31 f6        xor    %r14d,%r14d
0xffffffff82b05359 <+25>:    48 85 db        test   %rbx,%rbx
0xffffffff82b0535c <+28>:    74 4e   je     0xffffffff82b053ac <vtrnd_read+108>
0xffffffff82b0535e <+30>:    80 7b 20 00     cmpb   $0x0,0x20(%rbx)
0xffffffff82b05362 <+34>:    75 48   jne    0xffffffff82b053ac <vtrnd_read+108>
0xffffffff82b05364 <+36>:    41 89 f4        mov    %esi,%r12d
0xffffffff82b05367 <+39>:    49 89 ff        mov    %rdi,%r15
0xffffffff82b0536a <+42>:    48 8b 7b 10     mov    0x10(%rbx),%rdi
0xffffffff82b0536e <+46>:    48 8d 75 dc     lea    -0x24(%rbp),%rsi
0xffffffff82b05372 <+50>:    e8 29 be f0 fd  call   0xffffffff80a111a0 <virtqueue_dequeue>
0xffffffff82b05377 <+55>:    48 85 c0        test   %rax,%rax
0xffffffff82b0537a <+58>:    74 30   je     0xffffffff82b053ac <vtrnd_read+108>

[...]

0xffffffff82b053ac <+108>:   44 89 f0        mov    %r14d,%eax
0xffffffff82b053af <+111>:   48 83 c4 10     add    $0x10,%rsp
0xffffffff82b053b3 <+115>:   5b      pop    %rbx
0xffffffff82b053b4 <+116>:   41 5c   pop    %r12
0xffffffff82b053b6 <+118>:   41 5e   pop    %r14
0xffffffff82b053b8 <+120>:   41 5f   pop    %r15
0xffffffff82b053ba <+122>:   5d      pop    %rbp
0xffffffff82b053bb <+123>:   c3      ret

End of assembler dump.

we go from instruction 58 to 108; I'm too tired to understand what that means right now

So, this is simply saying that it attempted to _dequeue (read) the data, it failed, and so returning 0... I assume that you're also not seeing 100% cpu usage on the harvest thread anymore?

This is the expected behavior for your case. In mine, where the HV does work, w/o https://reviews.freebsd.org/D38897, I'll see the above 3 out of 4 times... because it loops trying to read the data 4 times in quick succession but there's only data from the last set of 4 polls.

So, I think what might be wrong is that you don't have another virtio driver to "unstick" the virtio queue. FreeBSD only has drivers for balloon, block, console, network, entropy and scsi. I looked at your config, and the only other device you define that is supported by FreeBSD is balloon. In my config, I'm using both network and block so it's possible that it's working here only because there is additional disk/network traffic causing the rng events to get dequeued...

jmg marked an inline comment as done.Mar 3 2023, 11:58 PM
jmg added inline comments.
sys/dev/virtio/random/virtio_random.c
63

you must have skipped over this statement:

softc allocations are aligned to power of 2's starting at 16 bytes, as the structure is 80 bytes, the softc will be aligned to 128, which is the next power of 2.

I looked at subr_bus.c and kern_malloc.c and traced through the code to come to this conclusion. You'll need to provide me w/ stronger evidence that my reading of the code is wrong.

Also, malloc(9) says:

The malloc(), realloc(), and reallocf() functions return a kernel virtual
address that is suitably aligned for storage of any type of object, or
NULL if the request could not be satisfied (implying that M_NOWAIT was
set).

I assume that this includes the _domainset variants, AND by aligned for any type, I assume means that if you have a 256byte type, that it'll be aligned ti 256 bytes, which meshes w/ my reading of the code that it aligns to a power of 2... so as I said, the softc is 128 bytes, so will be aligned to 128 bytes.

And specifically this code: https://cgit.freebsd.org/src/tree/sys/kern/kern_malloc.c#n1250
is what assures that allocations are properly aligned (except that apparently allocations > 256 && <= 384 will only be 4 byte aligned)...

jmg marked an inline comment as done.Mar 4 2023, 12:13 AM
jmg added inline comments.
sys/dev/virtio/random/virtio_random.c
63

sglist_build? https://github.com/freebsd/freebsd-src/blob/1c09320d5833fef8a4b6cc0091883fd47ea1eb1b/sys/kern/subr_sglist.c#L697

nice, it exists, would be nice if there was a version that would use a static allocation, but I guess now that it's in the softc, it could be used as the address of the softc won't change, but I'm convinced per above that it won't cross a page boundary so doesn't matter/need to be. I've spent way too much time today verifying that the allocation will be properly aligned.

sys/dev/virtio/random/virtio_random.c
63

No, I didn’t miss that part. Your response reminds me why I don’t participate in FreeBSD much anymore, though.

In D38898#885654, @jmg wrote:

So, I think what might be wrong is that you don't have another virtio driver to "unstick" the virtio queue. FreeBSD only has drivers for balloon, block, console, network, entropy and scsi. I looked at your config, and the only other device you define that is supported by FreeBSD is balloon. In my config, I'm using both network and block so it's possible that it's working here only because there is additional disk/network traffic causing the rng events to get dequeued...

you can see (part? of) my config at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=269874
so there is a network driver, but there might not have been much going on

regardless, the fact that virtio_random requires some other driver to unstick its poll queue sounds like a bug in our virtio implementation.

In D38898#885654, @jmg wrote:

So, I think what might be wrong is that you don't have another virtio driver to "unstick" the virtio queue. FreeBSD only has drivers for balloon, block, console, network, entropy and scsi. I looked at your config, and the only other device you define that is supported by FreeBSD is balloon. In my config, I'm using both network and block so it's possible that it's working here only because there is additional disk/network traffic causing the rng events to get dequeued...

you can see (part? of) my config at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=269874
so there is a network driver, but there might not have been much going on

regardless, the fact that virtio_random requires some other driver to unstick its poll queue sounds like a bug in our virtio implementation.

I didn't see any in your config listed at: https://gist.github.com/igalic/dba54db678ad2f12b08d35504e7909de#file-qemu-conf-L87 which is why I assumed there wasn't one, but maybe a -nic on the command line automatically adds a network adapter. (I'm a qemu newbie, so I didn't even know a config file could be created till you posted yours.)

For my system, I'm using the command:

qemu-system-aarch64 -cpu host -accel hvf -smp 2 -m 4096 -drive file=pflash0.slow.img,format=raw,if=pflash,readonly=on -drive file=pflash1.slow.img,format=raw,if=pflash -M virt -device virtio-gpu-pci -display default,show-cursor=on -device qemu-xhci -device usb-kbd -device usb-tablet -monitor unix:qemu-monitor-socket,server,nowait -device intel-hda -device hda-duplex -device virtio-rng-pci -drive file=vm.raw,format=raw,if=virtio,cache=writethrough -nographic -serial mon:stdio

So we need to enlist someone who knows virtio and/or qemu to debug this further. I don't see any thing wrong w/ the virtio_random driver, but I also am not at all familar w/ virtio.

In D38898#885682, @jmg wrote:

I didn't see any in your config listed at: https://gist.github.com/igalic/dba54db678ad2f12b08d35504e7909de#file-qemu-conf-L87 which is why I assumed there wasn't one, but maybe a -nic on the command line automatically adds a network adapter. (I'm a qemu newbie, so I didn't even know a config file could be created till you posted yours.)

For my system, I'm using the command:

qemu-system-aarch64 -cpu host -accel hvf -smp 2 -m 4096 -drive file=pflash0.slow.img,format=raw,if=pflash,readonly=on -drive file=pflash1.slow.img,format=raw,if=pflash -M virt -device virtio-gpu-pci -display default,show-cursor=on -device qemu-xhci -device usb-kbd -device usb-tablet -monitor unix:qemu-monitor-socket,server,nowait -device intel-hda -device hda-duplex -device virtio-rng-pci -drive file=vm.raw,format=raw,if=virtio,cache=writethrough -nographic -serial mon:stdio

Here's how LXD starts qemu:

/snap/lxd/24483/bin/qemu-system-x86_64 -S -name test-fbsd14 -uuid 3b4bf603-cf1d-4043-b7ee-ed1ce83aa0c1 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/test-fbsd14/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/test-fbsd14/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/test-fbsd14/qemu.pid -D /var/snap/lxd/common/lxd/logs/test-fbsd14/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd

The qemu.conf is, indeed https://gist.github.com/igalic/dba54db678ad2f12b08d35504e7909de

So we need to enlist someone who knows virtio and/or qemu to debug this further. I don't see any thing wrong w/ the virtio_random driver, but I also am not at all familar w/ virtio.

i didn't see anything with it even before this 😅
I still suspect that there's something generally wrong with our virtio Subsystem

markm added a subscriber: markm.

LGTM.

any update on this?
is this good enough to be committed?

markj added inline comments.
sys/dev/virtio/random/virtio_random.c
63

In general, the fact that malloc(9) returns a suitably aligned buffer here is an implementation detail, not a guarantee of the interface. It can become false in certain conditions ("options REDZONE" for instance) and might become false in the future (if we add additional malloc UMA zones to try and reduce internal fragmentation, for instance). Hence malloc_aligned(9).

The manual page is basically just saying that you'll get alignment up to the largest primitive C type. Drivers don't control their softc allocation, so yes, Conrad is right, vtrnd_value should in principle be dynamically allocated using malloc_aligned().

Yes this is all somewhat moot so long as HARVESTSIZE*sizeof(uint32_t) is 8, but presumably drivers shouldn't be relying on that.

sys/dev/virtio/random/virtio_random.c
63

I'd honestly just drop the sizeof assert. It's totally bogus.
I'd also not worry about alignment, except I'd just add a kassert in attach that ensures the required condition, eg, something like:
KASSERT(((uintptr_t)sc->vtrnd_value) & PAGE_MASK) == (((uintptr_t)&sc->vtrnd_value) + sizeof(sc->vtrnd_value)) & PAGE_MASK))'
that ensures no page crossing. Then if the malloc details change, it will be flagged if it matters. So we know it works and can't fail today, and this will ensure that we get errors if what we know today turns out to be wrong in the future somehow.

sys/dev/virtio/random/virtio_random.c
63

better:

KASSERT(btoc(sc->vtrnd_value) == btoc((vm_offset_t)sc->vtrnd_value + sizeof(sc->vtrnd_value) - 1));

(forgot the -1 to test the last byte, not the first byte that follows).

This patch is really required to run anything serving web traffic on Hetzner's virtual machines. Can we please check it in?

This patch is really required to run anything serving web traffic on Hetzner's virtual machines. Can we please check it in?

if we can't, we need to start producing images with virtio_random.ko blocklisted

jhb added inline comments.
sys/dev/virtio/random/virtio_random.c
184

Presumably you could just init the sglist once here and not bother with redoing it for each vtrnd_enqueue call since it's always enqeueing the same address?

In D38898#885654, @jmg wrote:

...
So, I think what might be wrong is that you don't have another virtio driver to "unstick" the virtio queue. FreeBSD only has drivers for balloon, block, console, network, entropy and scsi. I looked at your config, and the only other device you define that is supported by FreeBSD is balloon. In my config, I'm using both network and block so it's possible that it's working here only because there is additional disk/network traffic causing the rng events to get dequeued...

On what side? Queues are entirely independent in the spec and in FreeBSD. If that's what's going on it must be some dodgy implementation on the host side.

sys/dev/virtio/random/virtio_random.c
291

The prototype is static but this is not

I've merged the updated version from D41656.