Page MenuHomeFreeBSD

hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT
ClosedPublic

Authored by decui_microsoft.com on Mar 14 2018, 7:07 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 12, 2:37 PM
Unknown Object (File)
Dec 3 2024, 3:37 AM
Unknown Object (File)
Nov 22 2024, 8:14 PM
Unknown Object (File)
Nov 21 2024, 11:33 PM
Unknown Object (File)
Nov 21 2024, 6:43 AM
Unknown Object (File)
Oct 31 2024, 12:01 PM
Unknown Object (File)
Oct 21 2024, 9:45 AM
Unknown Object (File)
Oct 3 2024, 6:53 PM
Subscribers
None

Details

Summary

CAM_SEL_TIMEOUT was introduced in https://reviews.freebsd.org/D7521 (r304251),
which claimed:

"VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those
invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan
for LUN number higher than 7."

But it turns out this is not correct:

  1. I think what really filters the invalid LUNs in r304251 is that:

before r304251, we could set the CAM_REQ_CMP without checking
vm_srb->srb_status at all:
ccb->ccb_h.status |= CAM_REQ_CMP.

r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly, so
the invalid LUNs are filtered.

  1. I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT

with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be
filtered, and I successfully hot-added and hot-removed 8 disks to/from
the VM without any issue.

  1. CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error():
    • For a selection timeout, we consider all of the LUNs on
    • the target to be gone. If the status is CAM_DEV_NOT_THERE,
    • then we only get rid of the device(s) specified by the
    • path in the original CCB.

This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires
3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns
CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this
is the bug I reported recently:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I manually tested the patch with releng/{10.3, 10.4, 11.0, 11.1}, stable/{10, 11), and master on local Windows Server 2012 R2, 2016 and Azure, and I don't hit any regression. For Azure testing, I created multiple VMs (Standard D4 v2 (8 vcpus, 28 GB memory) ) in West Central US region and ran a lot of "hot add/remove disks" tests, e.g.
a) hot add and remove a disk on LUN0
b) hot add and remove a disk on LUN1
c) hot add and remove a disk on LUN9
d) hot add and remove a disk on LUN0, and LUN9
e) hot add and remove a disk on LUN1, and LUN9
f) hot add and remove a disk on LUN0~7
g) hot add and remove a disk on LUN0~6, LUN8
h) hot add and remove a disk on LUN1, 9, 10, 11, 12, 13, 14, 15
i) hot add and remove a disk on LUN0, 9, 10, 11, 12, 15, 18, 63
I didn't find any issue.

Hongxiong also helped to test it on 2012 R2 and 2016. No issue. Hongxiong is going to run more test cases on Azure.

CAM_DEV_NOT_THERE is a proper error code for the case of absent LUN. CAM_SEL_TIMEOUT means absent target, which may be too aggressive if used improperly.

This revision is now accepted and ready to land.Mar 14 2018, 5:14 PM

If there is no new comment, I'm going to commit the patch next week.

This revision was automatically updated to reflect the committed changes.