Page MenuHomeFreeBSD

hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT
ClosedPublic

Authored by decui_microsoft.com on Mar 14 2018, 7:07 AM.

Details

Summary

CAM_SEL_TIMEOUT was introduced in https://reviews.freebsd.org/D7521 (r304251),
which claimed:

"VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those
invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan
for LUN number higher than 7."

But it turns out this is not correct:

  1. I think what really filters the invalid LUNs in r304251 is that:

before r304251, we could set the CAM_REQ_CMP without checking
vm_srb->srb_status at all:
ccb->ccb_h.status |= CAM_REQ_CMP.

r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly, so
the invalid LUNs are filtered.

  1. I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT

with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be
filtered, and I successfully hot-added and hot-removed 8 disks to/from
the VM without any issue.

  1. CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error():
    • For a selection timeout, we consider all of the LUNs on
    • the target to be gone. If the status is CAM_DEV_NOT_THERE,
    • then we only get rid of the device(s) specified by the
    • path in the original CCB.

This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires
3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns
CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this
is the bug I reported recently:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

I manually tested the patch with releng/{10.3, 10.4, 11.0, 11.1}, stable/{10, 11), and master on local Windows Server 2012 R2, 2016 and Azure, and I don't hit any regression. For Azure testing, I created multiple VMs (Standard D4 v2 (8 vcpus, 28 GB memory) ) in West Central US region and ran a lot of "hot add/remove disks" tests, e.g.
a) hot add and remove a disk on LUN0
b) hot add and remove a disk on LUN1
c) hot add and remove a disk on LUN9
d) hot add and remove a disk on LUN0, and LUN9
e) hot add and remove a disk on LUN1, and LUN9
f) hot add and remove a disk on LUN0~7
g) hot add and remove a disk on LUN0~6, LUN8
h) hot add and remove a disk on LUN1, 9, 10, 11, 12, 13, 14, 15
i) hot add and remove a disk on LUN0, 9, 10, 11, 12, 15, 18, 63
I didn't find any issue.

Hongxiong also helped to test it on 2012 R2 and 2016. No issue. Hongxiong is going to run more test cases on Azure.

mav accepted this revision.Mar 14 2018, 5:14 PM

CAM_DEV_NOT_THERE is a proper error code for the case of absent LUN. CAM_SEL_TIMEOUT means absent target, which may be too aggressive if used improperly.

This revision is now accepted and ready to land.Mar 14 2018, 5:14 PM

If there is no new comment, I'm going to commit the patch next week.

This revision was automatically updated to reflect the committed changes.