Page MenuHomeFreeBSD

set kdb_why to "trap" when calling kdb_trap from trap_fatal
ClosedPublic

Authored by avg on Apr 16 2018, 1:32 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Dec 10, 5:43 AM
Unknown Object (File)
Fri, Dec 6, 4:02 PM
Unknown Object (File)
Mon, Nov 25, 7:58 PM
Unknown Object (File)
Nov 22 2024, 4:33 PM
Unknown Object (File)
Nov 16 2024, 6:06 AM
Unknown Object (File)
Nov 11 2024, 2:58 PM
Unknown Object (File)
Oct 30 2024, 3:20 PM
Unknown Object (File)
Oct 29 2024, 5:28 PM
Subscribers

Details

Summary

This will allow to hook a ddb script to "kdb.enter.trap" event.
Previously there was no specific name for this event, so it could only
be handled by either "kdb.enter.unknown" or "kdb.enter.default" hooks.
Both are very unspecific.

Having a specific event is useful because the fatal trap condition is
very similar to panic but it has an additional property that the current
stack frame is the frame where the trap occurred. So, both a register
dump and a stack bottom dump have additional information that can help
analyze the problem.

I have added the event only on architectures that have trap_fatal()
function defined. I haven't looked at other architectures.
Their maintainers can add support for the event later.

Sample script:
kdb.enter.trap=bt; show reg; x/aS $rsp,20; x/agx $rsp,20

Test Plan

Tested on amd64.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 16213
Build 16164: arc lint + arc unit

Event Timeline

kib added inline comments.
sys/amd64/amd64/trap.c
819

Please move locals at the start of the function (under #ifdef). Also, one line is enough, they are both bool.

This revision is now accepted and ready to land.Apr 16 2018, 1:54 PM

style change suggested by kib

This revision now requires review to proceed.Apr 16 2018, 2:13 PM
This revision is now accepted and ready to land.Apr 16 2018, 2:13 PM

fix up the previous update, new variables should be under KDB

This revision now requires review to proceed.Apr 16 2018, 2:19 PM
sys/powerpc/powerpc/trap.c
460

The renter logic is probably excessive since kdb_trap() just returns immediately. I'd argue that we should fix the calling code here to only call kdb_trap if !kdb_active. That would permit using a simpler code flow:

if (debugger_on_panic && kdb_active == 0) {
    kdb_why = KDB_WHY_TRAP;
    handled = kdb_trap(...);
    kdb_why = KDB_WHY_UNSET;
    if (handled)
        return;
}

BTW, I wonder if this feature is why one can get stuck sometimes where 'c' from kdb keeps reporting the same panic (because handled is true so we retry the same instruction and never get around to actually panic'ing and writing out the dump)? In that case it seems like a bit of a misfeature.

sys/powerpc/powerpc/trap.c
460

Thank you for pointing this out. I didn't pay attention to kdb_trap vs kdb_active.
From a further code inspection it seems that kdb_active should never be set in this code path.
If a trap happens while already in kdb, then a check in trap() should catch that and divert to kdb_reenter. And if we were not in kdb and kdb entry was initiated in a different context, then kdb_active is set only after the world is frozen (via stop_cpus_hard).
So, I do not see how kdb_active could possibly be set when this code runs.
I guess I'll just remove the check?

sys/powerpc/powerpc/trap.c
460

BTW, I wonder if this feature is why one can get stuck sometimes where 'c' from kdb keeps reporting the same panic (because handled is true so we retry the same instruction and never get around to actually panic'ing and writing out the dump)?

I am not sure if that's directly related.
Actually, the current behavior seems reasonable.
What would you expect kdb to do differently in that scenario?

sys/powerpc/powerpc/trap.c
460

Normally if you call panic() from anywhere else, kdb_trap is invoked as a result of a breakpoint() in kdb_enter(). When the trap handler returns after kdb_trap() returns, execution resumes post-breakpoint so it continues inside of panic() and writes out the dump.

However, if you have a fatal trap, then kdb is entered here before panic is called. If you continue from here, the trap handler returns without calling panic and just retries the instruction so you are stuck in an infinite loop with no way to break out unless you manually modify $PC in ddb before continuing or some other evil. You have to explicitly call 'dump' to get a working dump unlike every other panic() in the kernel. That is what I find unreasonable and inconsistent. Just compare what happens if you do 'sysctl debug.kdb.panic=1' and then continue with doing 'sysctl debug.kdb.trap=1'.

sys/powerpc/powerpc/trap.c
460

I see now... But what if, while in kdb, you modified memory and fixed the cause the trap and you want 'continue' to actually continue and not panic? Quite far fetched, I know.

In any case, if we want to change / fix that behavior that would be a separate change.
I'll handle the kdb_active check in this review.

sys/powerpc/powerpc/trap.c
460

Yes, I think whether or not to keep the kdb_trap is a topic for a different review. My thoughts on that are:

I tried having a more generic version of that via 'options RESTARTABLE_PANICS' but I ended up removing it as I didn't really find any compelling use cases. Of course, there is the really gross kassert_panic() hack in the tree instead but it just blindly continues rather than requiring user intervention.

In practice, I don't know that anyone is really going to be fixing fatal faults on the fly in ddb, so I would probably favor just removing it. I think if we keep it we should add a separate knob to enable it's use (debugger_on_trap or the like) instead of overloading debugger_on_panic so that the default behavior out of the box is consistent. If someone wants to try to fix fatal traps in ddb, they can enable the new sysctl and deal with the unusual behavior.

Don't check for kdb reentry in trap_fatal(), it's impossible.
trap() checks for it earlier and calls kdb_reentry().

rebase on the latest head

This revision is now accepted and ready to land.Apr 18 2018, 7:19 PM
This revision was automatically updated to reflect the committed changes.