Page MenuHomeFreeBSD

cam_sim: Shift to using refcount(9)
AbandonedPublic

Authored by imp on May 22 2021, 2:00 AM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 24 2023, 4:20 PM
Unknown Object (File)
Nov 17 2023, 8:46 PM
Unknown Object (File)
Nov 17 2023, 8:42 PM
Unknown Object (File)
Nov 17 2023, 8:35 PM
Unknown Object (File)
Jun 26 2023, 11:30 PM
Unknown Object (File)
Jun 16 2023, 3:39 AM
Subscribers
None

Details

Reviewers
mav
scottl
ken
Group Reviewers
cam
Summary

Use refcount(9) to manage references. Partially tear down the sim in
cam_sim_free so that all further calls to sim_action and sim_poll fail.
Set the mtx = NULL so no further locking is done for this sim. Document
the contract with the sim better, especially for pending requests that
come in prior to this call. cam_sim_free() is no longer blocking, but as
far as the client SIM is concerned, "sim" has been disposed of and the
SIM no longer owns "sim" memory. This also eliminates the
cam_sim_free_mtx for those SIMs that have to registered lock.

Sponsored by: Netflix

Test Plan

So far I've only lightly tested this patch. I plan on doing a lot
more testing to ensure that I can fail an nvme controller
under load and have this not only work, but also not hit
a race in that path today that I kludged up a fix for
by running multiple threads, but this might be better.

@mav this is a simpler version of the review I shared
earlier that I'm going to abandon. This has all the other
extra stuff removed.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 39353
Build 36242: arc lint + arc unit

Event Timeline

imp requested review of this revision.May 22 2021, 2:00 AM
imp added reviewers: mav, scottl, ken.
imp edited the test plan for this revision. (Show Details)
sys/cam/cam_sim.c
207

What would be the realistic use of free_devq == 0, considering that cam_sim_free() does not wait any more?

209

For SIMs without mutex calls to those functions is not protected. This is racy.

sys/cam/cam_sim.c
207

The devq is still in use until all refs go to 0.

209

Such sims already have to deal with requests that come in while cam_sim_free is running before the recount drops to 0. So the race is smaller in this case.

sys/cam/cam_sim.c
207

If SIM no longer owns SIM memory, how can it track the references?

209

Previously requests could come in while the cam_sim_free() is still running, but with this change they can come in even when it is completed. Mandatory request counting on the SIM side does not make it easier, plus it is also impossible to do right due to lack of locking on sim_action() call. sim_action() will be called with SIM pointer that it is no longer allowed to use.

sys/cam/cam_sim.c
209

Previously requests could come in while the cam_sim_free() is still running, but with this change they can come in even when it is completed.

While I may need some atomic here to make it visible to other threads, I'm changing the action pointers to not call into the sim anymore so once cam_sim_free returns, no more calls into this SIM will happen.

Mandatory request counting on the SIM side does not make it easier, plus it is also impossible to do right due to lack of locking on sim_action() call. sim_action() will be called with SIM pointer that it is no longer allowed to use.

The sim pointer will be valid until all references are gone, so there is no read after free issues. I'll talk to markj to see if we need to worry about xpt_action reading the sim_action, then that thread rescheduling before that call is made. It's an interesting race I'll need to think about.

sys/cam/cam_sim.c
209

We have only one (nvme) driver in the tree that sets mtx to NULL. If it were to set softc == NULL before calling cam_sim_free()
and insta-failing the request if sc == NULL in sim_action() that would close this race completely.

For other drivers that call cam_sim_free() with the mtx locked, I'm not quite sure what to say, though. That seems to be the harder case since today we drop the lock and wait for the ref count to drop to 0 synchronously.

I'm still working through the right way to make sure that the entries in these queues get freed when there's pending I/O when cam_sim_free() is called. I think the answer to that might inform this race too.

So I can make the mtx == NULL case work. nvme is the only one in the tree and I just set a 'dead function' pointer for action and poll. There will be no threads blocked on mtx, and no new entries.

However, the mtx != NULL case can't work w/o the wakeups / waits given the current structure. If there's any thread blocking on the lock we now hold to do something, we'll get odd results. Many of the sims I surveyed will unlock the mtx, then proceed to destroy it (either immediately or eventually). And there's a race now between the holding of the lock and the destroying of the lock, and there's many asserts that seem likely to fire and or have this end in various bad ways. So at least some redesign and/or retaining the wait loop is needed. I could also introduce a cam_sim_free_now_wait, but that troubles me as well in ways I can't quite articulate. The right way to go isn't clear yet. As such, I'm going to abandon this iteration.