Page MenuHomeFreeBSD

CAM arrival and departure fixes and extra debugging
Needs ReviewPublic

Authored by ken on Oct 7 2016, 5:38 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Mar 26, 2:12 AM
Unknown Object (File)
Mar 15 2024, 3:38 PM
Unknown Object (File)
Feb 29 2024, 12:19 PM
Unknown Object (File)
Jan 7 2024, 5:46 PM
Unknown Object (File)
Dec 20 2023, 12:06 AM
Unknown Object (File)
Nov 20 2023, 11:28 AM
Unknown Object (File)
Oct 5 2023, 10:23 PM
Unknown Object (File)
Jul 9 2023, 5:40 AM
Subscribers

Details

Summary

There are a number of outstanding issues with device arrival and departure in CAM. This is intended to be a starting point towards fixing them.

These diffs fix issues in the da(4) driver and CAM in general, and add extra debugging for tracking down missing reference releases in the da(4) driver. (The extra debugging isn't intended to go in the tree as-is.)

You'll see messages like this:

da33: ref src 0x8, refcount 2, softc refcount 0
cam_periph_alloc: attempt to re-allocate invalid device da33 with new device already found rejected flags 0x21a refcount 2
daasync: Unable to attach to new device due to status 0x6

In this case, the 'ref src 0x8' means that GEOM has not called back into
the da(4) driver to tell it that the device has gone away. To determine
that devfs has not called back into GEOM, we can look at two debugging
printfs I put in GEOM to show when we call destroy_dev_sched_cb() and when
we get the callback from devfs. If we see the first one without the
callback, then that means that devfs hasn't called back.

The patch against head fixes some arrival and departure issues in CAM.
Without those patches (especially the reference around allocation), you'll
likely hit other issues before you run into this particular issue.

The patch also includes debugging for the da(4) driver to track which
specific references have been acquired and released, and print status so
that when a device returns, we know which reference(s) to the old device
have not been released.

To generate the problem, I have been using a Supermicro server with a 6Gb
LSI SAS controller, two expanders and 30+ drives. At a minimum, though, it
requires a SAS controller and expander and a few drives to run the test.

I have not been running the devad2 test under the test framework, but rather with a one off script that I'll try uploading separately.

I need help figuring out why devfs isn't calling back into GEOM to complete device destruction at times.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 5497
Build 5722: CI src buildJenkins

Event Timeline

ken retitled this revision from to CAM arrival and departure fixes and extra debugging.
ken updated this object.
ken edited the test plan for this revision. (Show Details)
ken added reviewers: imp, hselasky, kib, mav, asomers, scottl.
ken set the repository for this revision to rS FreeBSD src repository - subversion.

Updated the patch to add the devad2 test.

I have not made sure devad2 builds yet in the overall build framework. It does build on its own. As a test, it should be run only on a system with the right hardware with the right disks selected to be a part of the test.

Is the missed callbacks on destroy issue the only problem you see ?

Could you try to simplify the test, esp. if you could provide a sample cdev-only driver which would demonstrate the supposed destroy_dev_cb(9) problem, it would be ideal.

I am asking you to provide the test instead of trying to write it myself since I really do not know cam and geom and do not know what patterns of interaction with cdev config code are used there. BTW, during the previous round of discussion of the devad2, I and pho did a hunt on the devfs issues, and fixed at least one hard problem of VFS/devfs interacting, see r294204 and r294205, as well as several bugs in in-tree devfs users. I do not believe this is relevant to your report, but we did not found anything else.