Always stop the scheduler when entering kdb
ClosedPublic
Actions

Authored by vangyzen on Oct 24 2018, 4:57 PM.

Details

Reviewers

kib
emaste
markj
jhb
cem

Commits

rS339917: Always stop the scheduler when entering kdb

Summary

Set curthread->td_stopsched when entering kdb via any vector.
Previously, it was only set when entering via panic, so when
entering kdb another way, mutexes and such were still "live",
and an attempt to lock an already locked mutex would panic.

Test Plan

It compiled, with and without SMP. I haven't tested it yet.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 20396
Build 19836: arc lint + arc unit

Event Timeline

vangyzen created this revision.Oct 24 2018, 4:57 PM

Herald added subscribers: dab, badger. · View Herald TranscriptOct 24 2018, 4:57 PM

Harbormaster completed remote builds in B20396: Diff 49565.Oct 24 2018, 4:57 PM

make unset symmetric; academic, but good principle

Harbormaster completed remote builds in B20397: Diff 49566.Oct 24 2018, 5:01 PM

cem added reviewers: kib, emaste, markj.Oct 24 2018, 5:25 PM

cem added a subscriber: cem.

Well, when you are trying to acquire an already locked mutex, you typically accessing some resource which state is inconsistent. You in fact trade the panic into undefined behavior and potentially corrupting the state, which means that the return from ddb to running system is unsafe.

I understand the motivation of your change. Typical answer is that ddb should not call any function that requires locks.

I very rarely return from ddb into the live system, so typically your change would be an improvement for my uses.

This revision is now accepted and ready to land.Oct 24 2018, 6:50 PM

kib added a reviewer: jhb.Oct 24 2018, 6:51 PM

In D17687#377924, @kib wrote:

Well, when you are trying to acquire an already locked mutex, you typically accessing some resource which state is inconsistent. You in fact trade the panic into undefined behavior and potentially corrupting the state, which means that the return from ddb to running system is unsafe.

I understand the motivation of your change. Typical answer is that ddb should not call any function that requires locks.

I very rarely return from ddb into the live system, so typically your change would be an improvement for my uses.

I totally agree. I raised many of these points in an internal discussion at Isilon. But as you say, it's probably a net-win. Thanks for the review.

zeising added a subscriber: zeising.Oct 25 2018, 12:51 PM

Did a SMP and UP build. Ran a few kdb commands followed by "continue". Tested "dump" with both builds. No problems seen.

Well, it wouldn't panic in ddb itself as ddb catches any panics and reenters. However, it does mean that while you may fail to lock get mutex, I think you can end up unlocking the mutex (e.g. if you panicked while holding a mutex and called a DDB function that tried to recurse and the mutex was not marked as recursive). However, non-panic entries into DDB probably don't hold any locks (at least existing ones).

In D17687#378257, @jhb wrote:

Well, it wouldn't panic in ddb itself as ddb catches any panics and reenters. However, it does mean that while you may fail to lock get mutex, I think you can end up unlocking the mutex (e.g. if you panicked while holding a mutex and called a DDB function that tried to recurse and the mutex was not marked as recursive). However, non-panic entries into DDB probably don't hold any locks (at least existing ones).

sysctl entry locks Giant. I added ddb functions lockgiant and unlockgiant. They worked, but when I left ddb, it panicked because sysctl tried to unlock Giant, but sysctl didn't own the lock.

I still think this is a net-win, but I'm open to argument.

Another option is to add a stopsched ddb command that sets curthread->td_stopsched = 1. The user could set it, as needed, when they are willing to corrupt lock state because they do not plan to exit ddb. The problem is, the user would have to know that it exists.

IMO once you enter ddb at runtime and do arbitrary debugging commands, any expectation that the system must behave as if you hadn’t entered ddb go out the window. So in that sense, I don’t think this introduces a regression.

That said, I am supportive of any reasonable hardening against accidental foot shooting.

I plan to commit this soon, unless there are any objections. (I don't see any above.)

In D17687#378266, @cem wrote:

IMO once you enter ddb at runtime and do arbitrary debugging commands, any expectation that the system must behave as if you hadn’t entered ddb go out the window. So in that sense, I don’t think this introduces a regression.

I don't quite agree with that, btw. I made the ddb 'kill' command use a try-lock on purpose so you have a way on the console to kill a process gracefully if you are so hosed you can't fork new processes. (And I've used this on multiple occasions).

To be clear though, I'm not suggesting to not commit the change, but just breaking into ddb to examine state shouldn't break anything (and I think this doesn't change that, per se).

I'm surprised you got a panic in DDB just trying to run a command though for a non-panic entry. It seems like it is a poorly written command in that case.

It's an Isilon command. It uses non-DDB code that locks a mutex. In that sense, yes, it's poorly written, but I'd rather relax that notion and allow mutexes (and therefore more "normal", non-DDB) code to be used. Ideally, there would be DDB code for everything we need; in practice, of course, there isn't.

Thanks for the feedback, @jhb.

Closed by commit rS339917: Always stop the scheduler when entering kdb (authored by vangyzen). · Explain WhyOct 30 2018, 2:54 PM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: imp. · View Herald TranscriptOct 30 2018, 2:54 PM

wulf mentioned this in D15070: evdev - special lock-less mode for keyboard drivers.Oct 31 2018, 6:50 PM