Page MenuHomeFreeBSD

Defer ptracestop signals that cannot be immediately delivered
ClosedPublic

Authored by badger on Jan 20 2017, 4:51 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 14, 3:06 PM
Unknown Object (File)
Fri, Nov 8, 7:59 AM
Unknown Object (File)
Mon, Oct 28, 2:39 PM
Unknown Object (File)
Oct 16 2024, 10:13 PM
Unknown Object (File)
Oct 10 2024, 6:29 AM
Unknown Object (File)
Oct 1 2024, 8:09 PM
Unknown Object (File)
Sep 30 2024, 11:08 PM
Unknown Object (File)
Sep 26 2024, 7:22 AM

Details

Summary

Related to: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212607

ptracestop() calls other than the one in kern_sig.c discard any signal
requested by the debugger, such as the SIGKILL from a PT_KILL. In this patch,
I stash that signal and deliver it when we evetually pass through ast().
I haven't tested very much yet, I was just hoping for some feedback on the approach
at this point.

Test Plan

Gdb should not hang when a PT_KILL is requested from a ptracestop() call
other than the one in kern_sig.c (below, we use the syscall enter ptracestop):

$ gdb712 /bin/ls
...
Reading symbols from /bin/ls...Reading symbols from /usr/lib/debug//bin/ls.debug...done.
done.
(gdb) catch syscall read
Catchpoint 1 (syscall 'read' [3])
(gdb) r
Starting program: /bin/ls 

Catchpoint 1 (call to syscall read), 0x000000080061e3fa in _read () from /libexec/ld-elf.so.1
(gdb) q
A debugging session is active.

    Inferior 1 [process 2981] will be killed.

    Quit anyway? (y or n) y
$

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

badger retitled this revision from to Defer ptracestop signals that cannot be immediately delivered.
badger updated this object.
badger edited the test plan for this revision. (Show Details)
badger edited the test plan for this revision. (Show Details)
badger edited the test plan for this revision. (Show Details)

Mmm, I had instead been toying with a version that fixed the other callers to post the signal instead to ignoring the return value. (I had also written up some PT_KILL tests). I'm not tied to that solution (though if the tests are correct I'd like to add those). One other area that is "wrong" with PT_KILL is that we can continue to stop the process to report events that occur before the process is truly killed. That is why I haven't finalized my patch as I am still trying to convince myself the tests are correct and that PT_KILL is fully fixed now. In particular, I wouldn't want PT_KILL of a multithreaded application to report thread exit events for the other LWPs as they all exit as each of those need a PT_CONTINUE. Instead, I think PT_KILL probably needs to clear the ptrace() event mask. I also wasn't sure if we needed some of the PT_KILL cleanup to happen as a result of posting the signal so that, e.g. 'kill -9 <pid>' doesn't get hung if there are suspended threads.

My WIP is here: https://github.com/freebsd/freebsd/compare/master...bsdjhb:PT_KILL

I guess I hadn't actually started posting the signal yet, but instead was trying to figure out how the logic should work (e.g. does it need to explicitly ignore SIGTRAP and not just 0?)

jhb added a subscriber: kib.

(Added kib@ to get his input on this as well)

So in the places where there's a printf now ("Dropping signal from ptracestop") you'd tdsignal() (or some other variant) the return value of ptracestop()? Won't that cause a second, unexpected stop when the time comes to actually act on the signal? (the ptracestop() in kern_sig.c, which, in this review's patch, I skip if there's already a pending signal from the debugger)

I think that I have some preference to the patch posted in this review, over the solution where callers of ptracestop() handle further signals. But then, shouldn't ptracestop() clear p_xsig is the signal is posted ?

Also, what is the reason to return td_xsig() from ptracestop() after the patch ?

As I said previously, I'm not strongly tied to any particular solution (and was still thinking about what the right solution is). I'm quite happy to defer to kib@'s judgement on not losing signals from ptracestop(). That said, I do think the PT_KILL tests I've added are still relevant and it would be good to verify that those are fixed by this change. (I have some other tests that I need to add to ensure we don't get extra stops for thread exit events if we PT_KILL a multithreaded process as well, but that is orthogonal to this change)

In D9260#192516, @kib wrote:

I think that I have some preference to the patch posted in this review, over the solution where callers of ptracestop() handle further signals. But then, shouldn't ptracestop() clear p_xsig is the signal is posted ?

Hmm, because the signal has been deferred, but someone might examine p_xsig before it is delivered and act as if it had already been delivered? I'd thought that p_xsig would be overwritten before anything would use it to make decisions, but I may be wrong. I will look further.

Also, what is the reason to return td_xsig() from ptracestop() after the patch ?

td_xsig being set is used from issignal() to determine if ptracestop() previously wanted to deliver a signal but couldn't; we want to skip the stop in this case since the signal originated from the debugger. (Did I understand that question correctly?)

In D9260#193435, @jhb wrote:

As I said previously, I'm not strongly tied to any particular solution (and was still thinking about what the right solution is). I'm quite happy to defer to kib@'s judgement on not losing signals from ptracestop(). That said, I do think the PT_KILL tests I've added are still relevant and it would be good to verify that those are fixed by this change. (I have some other tests that I need to add to ensure we don't get extra stops for thread exit events if we PT_KILL a multithreaded process as well, but that is orthogonal to this change)

Okay. I will try your tests against this change.

In testing, I find that sending a signal from multiple ptracestop() calls before having a chance to deliver any signals does not work as I'd expect. I now think it would be best, if a ptracestop() call wants to send a new signal to the process but there is already a deferred signal, it should be replaced by the new one.

E.g., if you

  1. Hit a syscall entry hook
  2. Try to send a signal (it will be deferred)
  3. Hit a syscall return hook
  4. Try to send another signal (also will be deferred)

I expect only the signal from step 4 to sent to the process when it is continued. Any thoughts otherwise?

In D9260#192516, @kib wrote:

I think that I have some preference to the patch posted in this review, over the solution where callers of ptracestop() handle further signals. But then, shouldn't ptracestop() clear p_xsig is the signal is posted ?

Hmm, because the signal has been deferred, but someone might examine p_xsig before it is delivered and act as if it had already been delivered? I'd thought that p_xsig would be overwritten before anything would use it to make decisions, but I may be wrong. I will look further.

Also, what is the reason to return td_xsig() from ptracestop() after the patch ?

td_xsig being set is used from issignal() to determine if ptracestop() previously wanted to deliver a signal but couldn't; we want to skip the stop in this case since the signal originated from the debugger. (Did I understand that question correctly?)

I used bad wording in my questions, it seems. I mean, as a vague idea, to move the post-processing after ptracestop() in issignal() into ptracestop() itself, repeating it until td_xsig is cleared. The code which adds signal to the queue, and possibly calling tdnotify().

If ptracestop() enqueues the signals, then different signal requests from different ptrace(2) requests will be queued as well, answering your later question.

I did not experimented with this, so it may very well be not possible to do.

Take a different direction

  • Add ptrace requested signals to the signal queue, flagging them with a

new flag KSI_PTRACE to indicate they originated from ptrace. We won't
stop again for such a signal.

  • Write several new test cases to exercise the signal replacement code.
  • Put in some special logic to ensure PT_KILL is handled correctly.

John - your PT_KILL tests are in this diff since I've been running with them, but I'll remove them before commit to preserve provenance (unless you ask me otherwise).

Please include the tests in your change.

The extra case I am referring to is if the debugger sends a PT_KILL while other events like a thread birth event is still pending. Also, another instance of that is if you PT_KILL a process with two threads, you shouldn't get a stopevent for the first thread exiting before the final thread really exits due to sigexit(). I think we can add tests for that and fix those as a followup though.

sys/kern/kern_sig.c
2509 ↗(On Diff #25065)

s/interested/interesting/

2604 ↗(On Diff #25065)

Can probably drop this blank line.

sys/kern/sys_process.c
1136 ↗(On Diff #25065)

Perhaps this should be in the PT_KILL case before the 'goto sendsig'? I think we might end up needing other logic here such as clearing the ptrace event mask, though I'm fine with fixing that as a separate followup if so.

sys/sys/signalvar.h
240 ↗(On Diff #25065)

Tab after #define.

tests/sys/kern/ptrace_test.c
2420 ↗(On Diff #25065)

Perhaps a blank line here before the function definition?

sys/kern/sys_process.c
1136 ↗(On Diff #25065)

I was aiming to avoid differences between, e.g., PT_CONTINUE with SIGKILL and PT_KILL (the latter being described in the manpage as behaving "as if PT_CONTINUE had been used with SIGKILL given as the signal"). But now that you ask the question, I'm thinking there may be requests where this isn't appropriate.

Re: clearing ptrace event mask: tdsendsignal will do this once the thread in ptracestop() wakes up and runs it, but perhaps it could race with another thread that stops incorrectly before the mask is cleared. In fact, the more I think about the mask clearing in tdsendsignal, the less certain I am that it is correct at all. It will clear the ptrace mask if SIGKILL is sent (from anywhere), but in the case the signal was sent from some source other than ptrace, the debugger may later catch and discard the SIGKILL. I'm not sure how the mask would be restored in that case.

  • Style/spelling corrections
  • Style/spelling corrections
sys/kern/kern_sig.c
405 ↗(On Diff #25114)

Why did you added this condition ? What would happen in sigqueue_add() was unable to add the ksi from ptracestop() to the queue ?

2530 ↗(On Diff #25114)
if (si != NULL && ...
2593 ↗(On Diff #25114)

Again, please use si != NULL way to check for non-null pointer.

2598 ↗(On Diff #25114)

if (td->td_xsig != 0)

2608 ↗(On Diff #25114)

Shouldn't td_xsig cleared there and in previous if () body ?

sys/kern/kern_sig.c
405 ↗(On Diff #25114)

A signal in that case would be dropped (unless it's SIGKILL), the same as if you tried to sigqueue() on a full signal queue. If it can't be queued, the signal can't be identified as ptrace generated. This is to prevent ptracestops from signals that themselves were genereated by ptracestop.

2608 ↗(On Diff #25114)

I don't think there is any impact to not zero it; next time it's needed it will be reset to the appropriate value before another thread can look at it. That is, I don't see any path where td_xsig is not owned by this thread exclusively except when performing signal exchange.

  • Make comparisons against NULL/0 explicit
sys/kern/sys_process.c
1136 ↗(On Diff #25065)

I think perhaps something like this is actually desirable:

diff
diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c
index ddde0dc..f03a36b 100644
--- a/sys/kern/kern_sig.c
+++ b/sys/kern/kern_sig.c
@@ -2527,7 +2527,7 @@ ptracestop(struct thread *td, int sig, ksiginfo_t *si)
 
        td->td_xsig = sig;
 
-       if (si != NULL && P_KILLED(p)) {
+       if (P_KILLED(p)) {
                /*
                 * Ensure that, if we've been PT_KILLed, the exit status
                 * reflects that. Another thread may also be in ptracestop(),
@@ -2535,6 +2535,7 @@ ptracestop(struct thread *td, int sig, ksiginfo_t *si)
                 * unsuspended first.
                 */
                td->td_xsig = SIGKILL;
+               p->p_ptevents = 0;
        } else if (si == NULL || (si->ksi_flags & KSI_PTRACE) == 0) {
                td->td_dbgflags |= TDB_XSIG;
                CTR4(KTR_PTRACE, "ptracestop: tid %d (pid %d) flags %#x sig %d",

That is, whichever thread hits the next ptracestop(), for whatever reason, will ensure a SIGKILL is queued up and then clear the stop mask so it can hurry off to handle the SIGKILL. The reason I think this is better than clearing p_ptevents in sys_process.c is that it should guarantee that the process exits with SIGKILL, which I think is useful for tests, if nothing else (I don't think gdb cares).

If p_ptevents is cleared here in sys_process.c, there is still a race between the thread that accepted the SIGKILL and exit(). Another thread may exit() normally before the SIGKILL is handled. The above diff should prevent it.

sys/kern/kern_sig.c
405 ↗(On Diff #25114)

Ideally, the ptrace(2) user should be informed about this condition. I understand that this is hard.

Another option would be to add something like sq_ptrace, where ptrace-requested signals can be safely added to bitmask in case memory allocation fails.

sys/kern/kern_sig.c
405 ↗(On Diff #25114)

Hmm. I think adding an sq_ptrace is probably correct. The only drawback I see is some added complexity. Informing the ptrace(2) user is not only hard, but I think adds a FreeBSD-only maintenance burden for debugger developers; I'm not aware of another OS that provides a mechanism for that.

sys/kern/kern_sig.c
405 ↗(On Diff #25114)

The mechanism to inform would be just ENOMEM error returned from the ptrace(2) call. I mentioned that it is hard because there is not an easy way to propagate the error back and to provide a way to correlate it with the failed call.

sq_ptrace does not look that complicated. It should by copied/pasted from sq_kill everywhere except one place.

  • Add sq_ptrace
  • check for P_KILLED inside while loop
  • Add tests for full sigqueue and competing stop events
sys/kern/kern_sig.c
405 ↗(On Diff #25114)

Right, the propagation bit is the part where I imagined some different mechanism would need to be employed. Regardless, after looking more closely, I agree that sq_ptrace doesn't really add much complexity and seems like a better solution, so I've gone forward with it.

sys/kern/sys_process.c
1136 ↗(On Diff #25065)

I looked again and can't see any issue with setting the P_WKILLED flag here instead of only for PT_KILL. If you've spotted something, however, please point me to it.

I added a test for the race I described above which passes. I think it still may be possible for a thread to exit() after a PT_KILL, but this is a narrow corner case which I think is safe to address separately. I'd also prefer to address the clearing of p_ptevents in tdsendsignal separately, mainly because I want to go get a clue about procfs events too.

kib added a reviewer: kib.
kib added inline comments.
sys/kern/kern_sig.c
418 ↗(On Diff #25318)
} else if (

i.e. merge with the next line

This revision is now accepted and ready to land.Feb 17 2017, 10:04 PM
badger edited edge metadata.
  • style fix
This revision now requires review to proceed.Feb 18 2017, 3:14 AM
This revision was automatically updated to reflect the committed changes.