Page MenuHomeFreeBSD

cam_periph_ccbwait could return while ccb in progress
ClosedPublic

Authored by rlibby on Sep 23 2016, 11:23 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 22, 3:40 PM
Unknown Object (File)
Mon, Nov 18, 8:33 PM
Unknown Object (File)
Mon, Nov 18, 7:10 PM
Unknown Object (File)
Sun, Nov 17, 4:09 PM
Unknown Object (File)
Sun, Nov 17, 2:13 PM
Unknown Object (File)
Sun, Nov 17, 2:07 PM
Unknown Object (File)
Sun, Nov 17, 12:30 PM
Unknown Object (File)
Sun, Nov 17, 11:04 AM
Subscribers

Details

Summary

In cam_periph_runccb, cam_periph_ccbwait was using the value of the ccb
pinfo.index and status fields to determine whether the ccb was done,
but these fields are updated without a contending lock and could glitch
into states that would be erroneously interpreted as done. Instead,
have cam_periph_ccbwait look for the explicit result of the function
cam_periph_done.

We ran into this with a misconfiguration that caused us to take many
iterations through the cam_periph_runccb ERESTART loop. Eventually we
would hit the race and end up executing both the error routine and
xpt_done_process concurrently for the same ccb.

Test Plan

make buildkernel KERNCONF=GENERIC
kyua test -k /usr/tests/sys/Kyuafile

Additionally I have re-run the scenario that was expressing the race
with this patch applied to Isilon's tree.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 5300
Build 5481: arc lint + arc unit

Event Timeline

rlibby retitled this revision from to cam_periph_ccbwait could return while ccb in progress.
rlibby updated this object.
rlibby edited the test plan for this revision. (Show Details)
rlibby added reviewers: markj, mav.
rlibby added a subscriber: jhb.

I agree that there is a problem, though not sure it is not a problem of bad cam_periph_ccbwait() design. Your patch looks acceptable, but I don't very like proposed transient parameter. I think code would look much better if you pass explicit state value instead of meaningless TRUE/FALSE.

In D8020#166009, @mav wrote:

I agree that there is a problem, though not sure it is not a problem of bad cam_periph_ccbwait() design.

Yes, it's awkward due to the many locks involved. The most direct approach would have been to have cam_periph_ccbwait take an appropriate lock--but I don't think there is any such lock: the ccb index and status are not protected by any one lock. In the glitch case addressed here, the other thread holds the sendq lock while updating the index, but in other contexts index is updated under other locks--or none at all. Additionally, I think we are already holding and sleeping with the path/periph lock here.

I also have a concern about whether a similar race can lead us to miss a wakeup with CAM_UNLOCKED. But that would be a separate issue, and I'm not actually sure it's possible.

Your patch looks acceptable, but I don't very like proposed transient parameter. I think code would look much better if you pass explicit state value instead of meaningless TRUE/FALSE.

Will do.

rlibby edited edge metadata.

mav feedback: supply new index directly instead of via bool switch

sys/cam/cam_queue.c
174–177

Shouldn't we be doing writes with the appropriate acquire / release semantics instead?

sys/cam/cam_queue.c
174–177

Yes, I suppose so. It would be "in addition to" rather than "instead", right?

New draft on the way, please let me know what you think.

Added acquire/release semantics around done signalling as mentioned by imp.
I'm on x86-64 so I don't think I can do meaningful testing of this.

sys/cam/cam_periph.c
988 ↗(On Diff #20695)

I'm not sure this is doing much of anything. C already treats calls to functions as sequence points IIRC, so a compiler barrier is pointless here as it is effectively already applied in the caller. Also, this expects xpt_path_mtx (never NULL) to always be locked. Is this function used by drivers that set CAM_UNLOCKED? If not, then the existing locking in xpt_done_process() already protects this with no need for atomics. Also, acq/release only really work when used on the same variable to protect other variables. (And the acquire is generally used in a loop while waiting for the write issued with 'rel' to become visible to the other CPU.) At a minimum you would have to use 'atomic_load_acq_int()' on the read of pinfo.index, and you'd perhaps have to use a loop instead of single if. However, my guess is you probably don't need the barriers at all.

sys/cam/cam_periph.c
988 ↗(On Diff #20695)

Unfortunately I'm afraid I don't know enough about CAM to settle this.

C already treats calls to functions as sequence points IIRC, so a compiler barrier is pointless here as it is effectively already applied in the caller.

Agree, but the atomic_thread_fenc_acq/rel also constrains the runtime memory operation ordering...

Is this function used by drivers that set CAM_UNLOCKED?

Unsure, will have to defer to CAM experts on that. I was operating under that impression, though (xpt_async, dastart).

Also, acq/release only really work when used on the same variable to protect other variables.

Yes, it is only meant to sync with the change of state of the index variable. Is

tmp = g_foo
atomic_thread_fence_acq()

not as strong as

tmp = atomic_load_acq_int(&g_foo)

?

I didn't do the atomic_load_acq_int() in the if condition because as you say, it is not currently a loop, so the acq doesn't happen after the xpt_path_sleep case, and I wasn't sure whether the sleep path would guarantee a read barrier.

At a minimum you would have to use 'atomic_load_acq_int()' on the read of pinfo.index, and you'd perhaps have to use a loop instead of single if.

I agree, this would make more sense to me as a loop on the done condition with an acq for the index variable. However, I wasn't sure that a loop was safe (is it possible some drives send a wakeup without modifying the checked conditions??). My impression is that looping is the right thing to do, but I didn't want to risk a regression there while getting this bug fix in.

(Additionally, the CAM_UNLOCKED case appears to me to be vulnerable to a missed wakeup (again, given as you say that it is actually in use), but that is the same regardless of whether we loop.)

What is your preference? Leave the atomic_store_rel_int in xpt_done_process and delete the atomic_thread_fence_acq here?

Since there's multiple paths through cam_ccbq_remove_ccb, which one causes the problem? When we short-circuit out, or in the depths of camq_remove()? What is the thing that's running concurrently that's causing cam_periph_ccbwait() to be called? Looks like it is only cam_periph_runccb(), which is called only on ccb's that we've allocated or outherwise have control of. How is this CCB both recently allocated and also being removed from a queue by a different thread?

sys/cam/cam_periph.c
988 ↗(On Diff #20695)

I don't think this is needed. If you want to constrain runtime memory reordering by the compiler there are better ways to do that.

Is this function used by drivers that set CAM_UNLOCKED?

Unsure, will have to defer to CAM experts on that. I was operating under that impression, though (xpt_async, dastart).

Shouldn't matter because CAM_UNLOCKED is set under the periph lock everywhere just prior to it being unlocked and submitted. So I'm not sure how this will help. Can you describe the exact scenario when it would be vulnerable to a missed wakeup?

I don't think that any driver sends a wakeup w/o modifying the conditions because the drivers don't modify pinfo.index, just cam_xpt.c and cam_queue.h. There might be a small issue, but if so, you should protect all uses in those files with atomics. Though in general, higher level locks already held are supposed to protect these data structures (we'll want to get Scott Long's, Justin Gibbs's or Ken Merry's second opinion on this, though). Otherwise, all kinds of concurrent mischief can occur. This might be the one field, though that crosses domains.

sys/cam/cam_queue.c
174–177

Instead of. Pick one set of memory semantics, not multiple.

In D8020#166471, @imp wrote:

Since there's multiple paths through cam_ccbq_remove_ccb, which one causes the problem? When we short-circuit out, or in the depths of camq_remove()?

My understanding of the bug condition that we encountered is that it must be the cam_ccbq_remove_ccb/camq_remove case. If we are removing from the extra queue, then it will have been an abort, and I know it wasn't because I added an assertion to the XPT_ABORT case to rule it out. Meanwhile xpt_run_devq pops the CAMQ_HEAD.

What is the thing that's running concurrently that's causing cam_periph_ccbwait() to be called? Looks like it is only cam_periph_runccb(), which is called only on ccb's that we've allocated or outherwise have control of. How is this CCB both recently allocated and also being removed from a queue by a different thread?

One thread may start a ccb and insert it into the ccbq, but another may process the queue and send it to the sim.

In our bug condition...

  • One thread ("X") creates the ccb and calls cam_periph_runccb and inside that cam_periph_ccbwait. In a usual case this may have pushed the ccb all the way to the sim,
  • But it may not, it may just leave it on the ccbq.
  • After X drops the devq->send_mtx another thread ("Y") may pick it up and xpt_run_devq() while X is still on its way to cam_periph_ccbwait (this part is inference, I didn't document it explicitly, but I believe it should be possible to document with ktr logging of ccb touches).
  • As part of xpt_run_devq the ccb has the transient CAM_UNQUEUED_INDEX (as described in summary at top).
  • Thread X observes index in the transient state.
  • Thread X pauses for some reason (interrupt?).
  • Thread Y sends the ccb to the sim, where something happens that changes its status (from CAM_REQ_INPROG to something else, CAM_REQUEUE_REQ in our case).
  • Thread X resumes and sees status is not CAM_REQ_INPROG, and therefore does not wait.
  • The status may also not be CAM_REQ_CMP...
  • Thread X interprets this state as done, and an error occurred, and calls error_routine.
  • But thread Y may still be working on the ccb, or it may be on a doneq, or it may be being processed by the xpt_done_td.
  • Further, the error routine itself may requeue the ccb. Due to another bug on our end, this was occurring and causing many iterations of the cam_periph_runccb ERESTART loop, greatly expanding opportunities to hit this race.

I was able to catch this in two ways, one by setting the following dtrace panic:

dtrace -w -n 'fbt::daerror:entry /args[0]->ccb_h.pinfo.index != -1/ {panic();}'

This found the ccb with an index of CAM_DONEQ_INDEX and being processed concurrently by xpt_done_process. I also added a similar assertion at the bottom of cam_periph_ccbwait with a similar result.

Ultimately this led to cam scribbling on pages backing 2k malloc buckets, causing bizarre panics elsewhere in the system, which was how we noticed it.

sys/cam/cam_periph.c
988 ↗(On Diff #20695)

Shouldn't matter because CAM_UNLOCKED is set under the periph lock everywhere just prior to it being unlocked and submitted. So I'm not sure how this will help.

I don't want to push on this because like I said I haven't put an effort into verifying that this actually happens. But here was my conception. With CAM_UNLOCKED, xpt_done_process does not hold the periph lock and signals "done" to cam_periph_ccbwait via the ccb index variable. This appears to allow memory writes in xpt_done_process (such as those to ccb_h->status which clear the CAM_DEV_QFRZN flag) to be reordered after memory reads in cam_periph_runccb (which read that flag). My understanding is that this won't actually happen on amd64 but is allowed by the generic memory model FreeBSD targets.

Can you describe the exact scenario when it would be vulnerable to a missed wakeup?

The missed wakeup would be

  • Thread X examines done conditions, and decides to go to sleep...
  • Concurrently (since we're CAM_UNLOCKED) xpt_done_process on another thread sets index to CAM_UNQUEUED_INDEX and sends a wakeup.
  • Thread X goes to sleep.

I haven't observed this, this is just code inspection, I'm not sure it actually occurs.

sys/cam/cam_queue.c
174–177

Ah, I think we are perhaps talking about two different problems here.

That cam_periph_ccbwait can see the index as CAM_UNQUEUED_INDEX during xpt_run_devq is an accidental glimpse of a temporary state that wasn't meant to communicate anything at all to another thread. My current understanding is that avoiding the glitch (the temporary visibility of index as CAM_UNQUEUED_INDEX during xpt_run_devq) is necessary in order to fix the bug. Or else cam_periph_ccbwait and xpt_run_devq would need to be made to take contending locks. I don't think acquire/release semantics can address this, that temporary state can still leak.

On the other hand the xpt_done_process -> cam_periph_ccbwait handoff, which is separate, is apparently done without a coordinating lock sometimes (with CAM_UNLOCKED). (This is maybe just a non-understanding on my part..?) What I thought you meant above was that this case should have acq/rel semantics since it is (apparently) sometimes unlocked, and that's what I tried to address in the third diff. I'm perfectly happy to drop this and go back to the second diff if that's your preference, I don't think this change by itself is necessary or sufficient to fix the bug I set out to fix. (It's a nop on amd64 anyway.)

As I have told before, and still think so after some more code looking, the whole approach current approach to cam_periph_ccbwait() implementation is wrong. CAM is event-driven, and from this point operation considered complete not when some status is changed, but when completion callback (ccb_h.cbfcnp) is called. In case of cam_periph_runccb() it supposed to be cam_periph_done() call. At this moment the last does nothing other then wakeup() call, but I think it should. On a good side, both calls to cam_periph_done() and cam_periph_ccbwait() are both running under the lock, and so they could safely check/set some CCB field.

Solving of above would limit pinfo.index use as state to only case of CCB aborting on stage of submission queue, and that queue has own lock (send_mtx), so that path does no need another locking or atomics. Attempt to set/check other statuses when CCB is not in some locked queue IMO can be used only as a hint, but barely as any reliable indicator for serious purposes.

Okay. I think your suggestion sounds good but I don't think I know enough about CAM to rewrite the whole current approach to cam_periph_ccbwait myself. Do you want me just to a file a bug in bugzilla for the scribbler and drop this patch then?

In D8020#166574, @rlibby_gmail.com wrote:

Okay. I think your suggestion sounds good but I don't think I know enough about CAM to rewrite the whole current approach to cam_periph_ccbwait myself. Do you want me just to a file a bug in bugzilla for the scribbler and drop this patch then?

I don't think it is in any way complicated. It is mostly about finding proper field in CCB to store the completion status. I just would not want you to invent extra complications doing some atomics magic, since that is IMO not the right way. I don't see much bad adding more states as you did in your original patch, if you wish, but later changes were IMO unneeded overkill.

In D8020#166581, @mav wrote:

I don't think it is in any way complicated. It is mostly about finding proper field in CCB to store the completion status. I just would not want you to invent extra complications doing some atomics magic, since that is IMO not the right way. I don't see much bad adding more states as you did in your original patch, if you wish, but later changes were IMO unneeded overkill.

I agree the acq/rel operations are not needed. I'd be happy to revert to diff 20667...

It is mostly about finding proper field in CCB to store the completion status.

Yeah, I see.

I guess one option is this: we could use the value of ccb_h->cbfcnp itself, as it is ours to manipulate in cam_periph_runccb. We would set it to NULL or some special value in cam_periph_done and then look for that in cam_periph_ccbwait. It feels a little hacky, but I think it has the desired semantics.

We could also define a new status flag, or a new ccb_hdr field, but those also seems a little hacky.

Any suggestions?

In D8020#166586, @rlibby_gmail.com wrote:
In D8020#166581, @mav wrote:

It is mostly about finding proper field in CCB to store the completion status.

Yeah, I see.

I guess one option is this: we could use the value of ccb_h->cbfcnp itself, as it is ours to manipulate in cam_periph_runccb. We would set it to NULL or some special value in cam_periph_done and then look for that in cam_periph_ccbwait. It feels a little hacky, but I think it has the desired semantics.

Here's an illustration of what that might look like. I haven't tested it at all yet.
https://github.com/rlibby/freebsd/commit/982e0af68273eddbeb3a1c7c1824e4b81cb15470

In D8020#166602, @rlibby_gmail.com wrote:
In D8020#166586, @rlibby_gmail.com wrote:
In D8020#166581, @mav wrote:

It is mostly about finding proper field in CCB to store the completion status.

Yeah, I see.

I guess one option is this: we could use the value of ccb_h->cbfcnp itself, as it is ours to manipulate in cam_periph_runccb. We would set it to NULL or some special value in cam_periph_done and then look for that in cam_periph_ccbwait. It feels a little hacky, but I think it has the desired semantics.

Here's an illustration of what that might look like. I haven't tested it at all yet.
https://github.com/rlibby/freebsd/commit/982e0af68273eddbeb3a1c7c1824e4b81cb15470

Fixed up to account for not calling the cbfcnp for non-XPT_FC_QUEUED ops:
https://github.com/rlibby/freebsd/commit/6c35cccbcef14e830a0758a6edcd34564a5f0d5a

In D8020#166663, @rlibby_gmail.com wrote:

Fixed up to account for not calling the cbfcnp for non-XPT_FC_QUEUED ops:
https://github.com/rlibby/freebsd/commit/6c35cccbcef14e830a0758a6edcd34564a5f0d5a

This would be good if not one more ugliness of CAM: during some error recovery scenarios camperiphscsisenseerror() tends to reuse CCB in place, replacing original CCB with recovery one, and replacing it back on recovery completion. I am not sure it worked properly with existing code, but with the proposed one it even more likely won't. I think this may work as quick workaround: "while (ccb->ccb_h.cbfcnp != cam_periph_done_panic)".

Okay, I will make that change and then update this diff soon. Testing on Isilon hit an issue (that may be on our side) so I'm going to check on that first.

In D8020#166780, @mav wrote:
In D8020#166663, @rlibby_gmail.com wrote:

Fixed up to account for not calling the cbfcnp for non-XPT_FC_QUEUED ops:
https://github.com/rlibby/freebsd/commit/6c35cccbcef14e830a0758a6edcd34564a5f0d5a

This would be good if not one more ugliness of CAM: during some error recovery scenarios camperiphscsisenseerror() tends to reuse CCB in place, replacing original CCB with recovery one, and replacing it back on recovery completion. I am not sure it worked properly with existing code, but with the proposed one it even more likely won't. I think this may work as quick workaround: "while (ccb->ccb_h.cbfcnp != cam_periph_done_panic)".

Perhaps we should outlaw the practice, since it is making strong assumptions that the CCB is large enough for its reuse, which in the future might not be true. We're long past the days where we can't just have a single CCB reserved for this purpose preallocated (error handling is necessarily single threaded, so this should be OK). How hard would it be to fix it to not do that?

Review feedback: have cam_periph_ccbwait check result of cam_periph_done

In D8020#166795, @imp wrote:
In D8020#166780, @mav wrote:

This would be good if not one more ugliness of CAM: during some error recovery scenarios camperiphscsisenseerror() tends to reuse CCB in place, replacing original CCB with recovery one, and replacing it back on recovery completion. I am not sure it worked properly with existing code, but with the proposed one it even more likely won't. I think this may work as quick workaround: "while (ccb->ccb_h.cbfcnp != cam_periph_done_panic)".

Perhaps we should outlaw the practice, since it is making strong assumptions that the CCB is large enough for its reuse, which in the future might not be true. We're long past the days where we can't just have a single CCB reserved for this purpose preallocated (error handling is necessarily single threaded, so this should be OK). How hard would it be to fix it to not do that?

I agree that we should stop doing that. I am not ready to say how easy it is, haven't looked that area much recently, but I am absolutely sure it should be done or at least reinvestigated.

mav edited edge metadata.

This revision looks good to me.

This revision is now accepted and ready to land.Sep 28 2016, 10:44 PM
This revision was automatically updated to reflect the committed changes.