There are a number of outstanding issues with device arrival and departure in CAM. This is intended to be a starting point towards fixing them.
These diffs fix issues in the da(4) driver and CAM in general, and add extra debugging for tracking down missing reference releases in the da(4) driver. (The extra debugging isn't intended to go in the tree as-is.)
You'll see messages like this:
da33: ref src 0x8, refcount 2, softc refcount 0
cam_periph_alloc: attempt to re-allocate invalid device da33 with new device already found rejected flags 0x21a refcount 2
daasync: Unable to attach to new device due to status 0x6
In this case, the 'ref src 0x8' means that GEOM has not called back into
the da(4) driver to tell it that the device has gone away. To determine
that devfs has not called back into GEOM, we can look at two debugging
printfs I put in GEOM to show when we call destroy_dev_sched_cb() and when
we get the callback from devfs. If we see the first one without the
callback, then that means that devfs hasn't called back.
The patch against head fixes some arrival and departure issues in CAM.
Without those patches (especially the reference around allocation), you'll
likely hit other issues before you run into this particular issue.
The patch also includes debugging for the da(4) driver to track which
specific references have been acquired and released, and print status so
that when a device returns, we know which reference(s) to the old device
have not been released.
To generate the problem, I have been using a Supermicro server with a 6Gb
LSI SAS controller, two expanders and 30+ drives. At a minimum, though, it
requires a SAS controller and expander and a few drives to run the test.
I have not been running the devad2 test under the test framework, but rather with a one off script that I'll try uploading separately.
I need help figuring out why devfs isn't calling back into GEOM to complete device destruction at times.