Page MenuHomeFreeBSD

bhyve ahci: Improve robustness of TRIM handling
ClosedPublic

Authored by jhb on Oct 21 2024, 4:09 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 11, 2:21 AM
Unknown Object (File)
Sat, Jan 11, 2:17 AM
Unknown Object (File)
Wed, Jan 1, 8:36 AM
Unknown Object (File)
Thu, Dec 26, 9:26 PM
Unknown Object (File)
Fri, Dec 20, 11:31 PM
Unknown Object (File)
Dec 9 2024, 5:25 AM
Unknown Object (File)
Nov 24 2024, 7:08 AM
Unknown Object (File)
Nov 19 2024, 10:29 AM
Subscribers

Details

Summary

The previous fix for a stack buffer leak in the ahci device model
actually broke the handling of TRIM as one of the checks it added
caused TRIM commands to never be completed. This resulted in command
timeouts if a guest OS did a 'newfs -E' of an AHCI disk, for example.
Also, for the invalid case the previous check was handling, the device
model should be failing with an error rather than claiming success.

To resolve this, validate the length of a TRIM request and fail with
an error if it exceeds the maximum number of supported blocks
advertised via IDENTIFY. In addition, if the PRDT does not provide
enough data, fail the command with an error rather than performing a
partial completion.

This is somewhat complicated by the implementation of TRIM in the ahci
device model. A single TRIM request can specify multiple LBA ranges.
The device model handles this by dispatching blockif_delete() requests
one at a time. When a blockif_delete() request completes, the device
model locates the TRIM buffer and searches for the next LBA range to
handle. Previously, the device model would re-read the trim buffer
from guest memory each time. However, this was subject to some
unpleasant races if the guest changed the PRDT entries or CFIS while a
command was in flight. Instead, read the buffer of trim ranges once
and cache it across multipe internal blockif requests.

Fixes: 71fa171c6480 bhyve: Initialize stack buffer in pci_ahci
Sponsored by: The FreeBSD Foundation

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 60131
Build 57015: arc lint + arc unit

Event Timeline

jhb requested review of this revision.Oct 21 2024, 4:09 PM

Note that 71fa171c6480 has not been MFC'd due to these outstanding issues.

This fixes a regression from the previous fix. With current main if you just boot a VM with an AHCI attached disk backed by a zvol (so supports TRIM) and do newfs -E /dev/ada0 the guest FreeBSD kernel hangs in a loop of AHCI timeouts as mav@ worried in the previous review. I hadn't expected that the previous review was actually broken, but my guess is the added done >= sizeof(buf) - 8 check was wrong. It probably should have been '>' instead as if you get a full 512 byte block, it will break from the loop before the last valid entry and never send a reply leading to the hang.

That said, while I could have added validation on each re-read of the CFIS and PRDT after each blockif_delete(), I prefer to avoid the weird TOCTOU-style races and just read it once and validate it once. This does fix the timeouts I see with newfs on main. I have not tested the reported bug though (I was hoping Pierre might have a reproducer guest image he can test against this?)

This approach does also make it easier if we wanted to support multi-block TRIM buffers btw. We probably should have a #define for the number of blocks and the length check I've added should be against that value (maybe before multiplying by 512?) and that value should be what we return in IDENTIFY. However, you could then just change that one knob to the desired number of blocks to support. I'm not sure it matters though? My guess is 1 block is enough for typical workloads?

There are also various races (I think) with the CIFS being changed by the guest while a request is in flight. We really should be caching the CIFS for the duration of a command. The issue there though is that CIFS is variable-sized. :( We could at least cache the common header though I think which would probably handle all of the races I can see.

usr.sbin/bhyve/pci_ahci.c
864

This being conditional in the old code did not make sense to me. I suspect it was a bug in the old code (not related to the SA) but you would only hit if you had a TRIM buffer that was completely empty (all lengths zero).

usr.sbin/bhyve/pci_ahci.c
864

I don't remember what I was thinking back then, but looking on it now it seems to break recursion of ahci_handle_port() -> ahci_handle_slot() -> ahci_handle_cmd() -> ahci_handle_dsm_trim() -> ahci_handle_port().

usr.sbin/bhyve/pci_ahci.c
864

Hmmm, ok. So I should put it back then I guess.

935

Does that mean I should not call this here? This is always "first".

usr.sbin/bhyve/pci_ahci.c
935

I think so. And not only ahci_handle_port(), but I suppose previous two lines also, since the command was never marked pending.

Correct synchronous command completion handling

jhb marked 2 inline comments as done.Oct 21 2024, 7:25 PM

@mav does this version look ok? It still works for me with the basic 'newfs -E' test in a VM.

usr.sbin/bhyve/pci_ahci.c
877

Maybe a KASSERT to document that it must be ATA_SEND_FPDMA_QUEUED?

Looks good to me. Thanks.

This revision is now accepted and ready to land.Oct 23 2024, 2:20 PM
usr.sbin/bhyve/pci_ahci.c
877

Such an assertion can fail if the guest modifies the CFIS while the command is in-progress. If we care about those races then we need a separate change to read and cache the CFIS at the start of command processing and free it after the command completes. Note that if the ncq flag is "wrong" we don't crash, we just write a different result into the FIS. This might confuse the guest, but it shouldn't impact the hypervisor.

This revision was automatically updated to reflect the committed changes.
usr.sbin/bhyve/pci_ahci.c
877

Would else if (cfis[2] == ATA_SEND_FPDMA_QUEUED) make sense?

usr.sbin/bhyve/pci_ahci.c
877

But then what do you do in the third case? Especially given that this is in the continuation phase where we have already emitted at least one trim. Also, there are many other places that read CIFS multiple times in this device model. If we do care about such races, we will need to cache the CIFS instead of fixing all these places to fail with errors if the CIFS changed.

usr.sbin/bhyve/pci_ahci.c
877

Since there's only 32 cfis, and since they are small, it would be better to allocate them into a slot (like real hardware does) and pass that around instead of guest memory. It would be a better emulation of the DMA that's done, since the drive sees only one version of the CFIS, and it's undefined what happens if you change the CFIS after submitting the command.

I'd also be tempted to say ncq = (cfis[2] == ATA_SEND_FPDMA_QUEUED) instead, so we only do ncq completion processing on the relatively rare ncq trim command (though we could avoid this whole mess by not advertising ncq trim support, but that would pessimize some applications that don't want to pay the queueing penalty on latency and the avoided mess is small).

usr.sbin/bhyve/pci_ahci.c
877

I'm happy to fix the model to cache the CIFS, that's just an orthogonal change and isn't TRIM specific. The main thing is I didn't read the SATA (or is it ATA?, I had to look at three different specs to try to understand AHCI) spec closely enough to determine what the upper bound on the CIFS size is. We can easily malloc a copy of it that we pass around, though we also need the original address still so that code can read the PRDT for commands that use it. Currently they just read from cifs + 0x80.