Set curthread->td_stopsched when entering kdb via any vector.
Previously, it was only set when entering via panic, so when
entering kdb another way, mutexes and such were still "live",
and an attempt to lock an already locked mutex would panic.
Details
It compiled, with and without SMP. I haven't tested it yet.
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Well, when you are trying to acquire an already locked mutex, you typically accessing some resource which state is inconsistent. You in fact trade the panic into undefined behavior and potentially corrupting the state, which means that the return from ddb to running system is unsafe.
I understand the motivation of your change. Typical answer is that ddb should not call any function that requires locks.
I very rarely return from ddb into the live system, so typically your change would be an improvement for my uses.
I totally agree. I raised many of these points in an internal discussion at Isilon. But as you say, it's probably a net-win. Thanks for the review.
Did a SMP and UP build. Ran a few kdb commands followed by "continue". Tested "dump" with both builds. No problems seen.
Well, it wouldn't panic in ddb itself as ddb catches any panics and reenters. However, it does mean that while you may fail to lock get mutex, I think you can end up unlocking the mutex (e.g. if you panicked while holding a mutex and called a DDB function that tried to recurse and the mutex was not marked as recursive). However, non-panic entries into DDB probably don't hold any locks (at least existing ones).
sysctl entry locks Giant. I added ddb functions lockgiant and unlockgiant. They worked, but when I left ddb, it panicked because sysctl tried to unlock Giant, but sysctl didn't own the lock.
I still think this is a net-win, but I'm open to argument.
Another option is to add a stopsched ddb command that sets curthread->td_stopsched = 1. The user could set it, as needed, when they are willing to corrupt lock state because they do not plan to exit ddb. The problem is, the user would have to know that it exists.
IMO once you enter ddb at runtime and do arbitrary debugging commands, any expectation that the system must behave as if you hadn’t entered ddb go out the window. So in that sense, I don’t think this introduces a regression.
That said, I am supportive of any reasonable hardening against accidental foot shooting.
I plan to commit this soon, unless there are any objections. (I don't see any above.)
I don't quite agree with that, btw. I made the ddb 'kill' command use a try-lock on purpose so you have a way on the console to kill a process gracefully if you are so hosed you can't fork new processes. (And I've used this on multiple occasions).
To be clear though, I'm not suggesting to not commit the change, but just breaking into ddb to examine state shouldn't break anything (and I think this doesn't change that, per se).
I'm surprised you got a panic in DDB just trying to run a command though for a non-panic entry. It seems like it is a poorly written command in that case.
It's an Isilon command. It uses non-DDB code that locks a mutex. In that sense, yes, it's poorly written, but I'd rather relax that notion and allow mutexes (and therefore more "normal", non-DDB) code to be used. Ideally, there would be DDB code for everything we need; in practice, of course, there isn't.
Thanks for the feedback, @jhb.