Page MenuHomeFreeBSD

ptrace: Add PT_COREDUMP request (WIP)
AbandonedPublic

Authored by mgorny_gentoo.org on Apr 10 2021, 11:40 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Mar 8, 1:20 PM
Unknown Object (File)
Fri, Mar 8, 1:20 PM
Unknown Object (File)
Fri, Mar 8, 1:20 PM
Unknown Object (File)
Fri, Mar 8, 1:20 PM
Unknown Object (File)
Fri, Mar 8, 1:20 PM
Unknown Object (File)
Fri, Mar 8, 1:20 PM
Unknown Object (File)
Fri, Mar 8, 1:16 PM
Unknown Object (File)
Fri, Mar 1, 6:50 AM
Subscribers
None

Details

Reviewers
emaste
kib
jhb
Test Plan

The test program I'm using right now:

#include <sys/types.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

#include <assert.h>
#include <signal.h>
#include <spawn.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>

volatile int g_val = 0;

void* thread1(void* data) {
    while (1)
        sleep(1);
    return NULL;
}

void* thread2(void* data) {
    sleep(1);
    int ret = raise(SIGSTOP);
    assert(ret != -1);
    return NULL;
}

int main() {
    int ret;
    pid_t pid = fork();
    assert(pid != -1);

    if (pid == 0) {
        /* child -- debugged program */
        printf("child pid: %ld\n", getpid());

        /* request tracing */
        ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
        assert(ret != -1);

        /* start threads */
        pthread_t t1, t2;
        ret = pthread_create(&t1, NULL, thread1, NULL);
        assert(ret == 0);
        ret = pthread_create(&t2, NULL, thread2, NULL);
        assert(ret == 0);

        /* join threads */
        ret = pthread_join(t1, NULL);
        assert(ret == 0);
        ret = pthread_join(t2, NULL);
        assert(ret == 0);

        _exit(0);
    }

    /* parent -- the debugger */
    printf("parent pid: %ld\n", getpid());

    pid_t waited;

    /* SIGSTOP for the process */
    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    assert(WIFSTOPPED(ret));
    assert(WSTOPSIG(ret) == SIGSTOP);

    /* get thread list */
    lwpid_t lwps[4];
    ret = ptrace(PT_GETLWPLIST, pid, lwps, 4);
    for (int i = 0; i < ret; ++i)
        printf("lwp[%d] = %d\n", i, lwps[i]);

    /* request coredump */
    struct ptrace_coredump cd = {};
    cd.pc_fd = open("/tmp/mycore", O_WRONLY|O_CREAT, 0666);
    cd.pc_limit = 0;
    ret = ptrace(PT_COREDUMP, lwps[0], (void*)&cd, sizeof(cd));
    assert(ret == 0);
    close(cd.pc_fd);

    /* resume the process */
#if 0 /* TODO: make it stop after coredump */
    ret = ptrace(PT_CONTINUE, waited, (void*)1, 0);
    assert(ret == 0);
#endif

    /* wait for exit */
    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    assert(WIFEXITED(ret));
    assert(WEXITSTATUS(ret) == 0);

    return 0;
}

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

mgorny_gentoo.org created this revision.
mgorny_gentoo.org added inline comments.
sys/kern/kern_sig.c
106 ↗(On Diff #87195)

I suppose we'd want to move this to a header file. Could you suggest which one would be good for it?

3271 ↗(On Diff #87195)

This is moved unchanged from below, to make it available in sigexit().

coredump(9) cannot work this way. It assumes that td == curthread, and that all other threads are stopped.

To restate, to make it work, you need to;

  1. send ast to the target process (perhaps by setting some new flag, together with TDF_ASTPENDING)
  2. in ast handler, do thread_single(SINGLE_BOUNDARY), then coredump(), then thread_single_end(SINGLE_BOUNDARY)
sys/kern/kern_sig.c
106 ↗(On Diff #87195)

sys/signalvar.h I suppose

3274 ↗(On Diff #87195)

Don't do this, spell allproc_lock directly.

sys/sys/ptrace.h
77

Why not PT_COREDUMP?

In D29691#665860, @kib wrote:

coredump(9) cannot work this way. It assumes that td == curthread, and that all other threads are stopped.

Well, the latter should already be the case with ptrace(2), though I may have missed some extra check for that.

As for the other part, do you think it would be relatively easy and worth the effort to make it work on arbitrary thread? If not, them I'm going to figure out what you meant the next week ;-).

sys/kern/kern_sig.c
3274 ↗(On Diff #87195)

I didn't touch this part but I suppose I can change it, sure. I'm wondering why it was done this way originally, though.

sys/sys/ptrace.h
77

I'm using the same name as NetBSD already does. And yes, I would prefer PT_DUMPCORE. If we don't care about NetBSD compatibility, then I suppose it's fine to change it.

In D29691#665860, @kib wrote:

coredump(9) cannot work this way. It assumes that td == curthread, and that all other threads are stopped.

Well, the latter should already be the case with ptrace(2), though I may have missed some extra check for that.

Actually yes, we have a blanket check that the target process is stopped on the kern_ptrace(9) entry, except for several commands where it should be not.

Hmm, but we do not have any means to ensure that the target process stays stopped for the duration of the command. So for instance PT_IO, which needs to drop the debugee lock, could end up with the process suddenly running. For PT_IO/proc_rwmem() it is probably not too big deal, but for coredumping having threads created/destroyed, or mapping changed, could be fatal in the sense of allowing access to the freed memory etc.

Another issue I see there is wrong attribution of the used resources and applied limits.

Also, who should be the owner of the core dump, what limits should be applicable to it, for instance RLIMIT_SIZE and RLIMIT_CORE? Might be, it is time to think about the interface, e.g. you might pass the file descriptor where to write the core, instead of path.
Etc.

As for the other part, do you think it would be relatively easy and worth the effort to make it work on arbitrary thread? If not, them I'm going to figure out what you meant the next week ;-).

It is relatively simple to schedule AST and then single-thread the process. After we agree on approach, I am sure will help you with the proper use of our kernel infrastructure for the task.

Passing of a file descriptor was something I've been considering initially but using the path was an easier way to get a working prototype. I suppose we can switch to using an fd (passed via data, I guess?) to make things simpler and more robust. Since we lose any compatibility with NetBSD, we can also rename it to PT_COREDUMP.

Hmm, I would definitely rather pass an fd. If we supported a path you'd probably have to deal with Capsicum restrictions on the path lookup as well perhaps (maybe that would fail appropriately by default?)

@kib, so if I agree with the FD approach, is there anything left to decide? If not, could you start giving me some pointers?

I suppose the first thing to do would be splitting the 'core' of coredump() function dealing with the fd, correct?

@kib, so if I agree with the FD approach, is there anything left to decide? If not, could you start giving me some pointers?

I suppose the first thing to do would be splitting the 'core' of coredump() function dealing with the fd, correct?

I do not think that passing fd as data is a wise idea. It probably makes sense to define a structure that gets a pack of arguments for the call, one of them fd.

From the API point of view, there are some more questions:

  • kernel obtains adv lock on the core file. Should we do the same for PT_COREDUMP, or delegate it to userspace?
  • dump could be compressed, should be allow to control it from PT_COREDUMP op, or global settings must be used?

Perhaps ACORE accounting flag should not be set for the process.

I suspect there would be not much to split from coredump() itself, after we have the vnode, it is just a call to p_sysent->sv_coredump().

In D29691#667524, @kib wrote:

@kib, so if I agree with the FD approach, is there anything left to decide? If not, could you start giving me some pointers?

I suppose the first thing to do would be splitting the 'core' of coredump() function dealing with the fd, correct?

I do not think that passing fd as data is a wise idea. It probably makes sense to define a structure that gets a pack of arguments for the call, one of them fd.

Makes sense.

From the API point of view, there are some more questions:

  • kernel obtains adv lock on the core file. Should we do the same for PT_COREDUMP, or delegate it to userspace?
  • dump could be compressed, should be allow to control it from PT_COREDUMP op, or global settings must be used?

I'm personally not a fan of doing more in kernel space than is absolutely necessary, so I'd say the ptrace req should just write() uncompressed data into the fd and let userspace decide the rest. In either case, we could include some kind of flags field that we could use to control behavior in the future if need be. Or just rely on size of the structure changing (and being passed as data).

But that's just my opinion. You're in better position to make that kind of judgment than I am ;-).

I'm personally not a fan of doing more in kernel space than is absolutely necessary

Agreed, but we already have the ability to compress core dumps in the kernel so it is probably straightforward to plumb it through into this interface.

I'm personally not a fan of doing more in kernel space than is absolutely necessary

Agreed, but we already have the ability to compress core dumps in the kernel so it is probably straightforward to plumb it through into this interface.

It does not require any plumbing, compression just happens if elf sv_coredump is called and compression is enabled. It indeed should be connected to some argument if we want to manage it manually for PT_COREDUMP.

May be more interesting option is the control over errors when dumping segments. Right now code exits the dump if some segment cannot be read (this could happen if e.g. file was mapped and then truncated). It is reasonable to add a tweak to skip the rest of the segment and move ahead.

In D29691#667586, @kib wrote:

I'm personally not a fan of doing more in kernel space than is absolutely necessary

Agreed, but we already have the ability to compress core dumps in the kernel so it is probably straightforward to plumb it through into this interface.

It does not require any plumbing, compression just happens if elf sv_coredump is called and compression is enabled. It indeed should be connected to some argument if we want to manage it manually for PT_COREDUMP.

May be more interesting option is the control over errors when dumping segments. Right now code exits the dump if some segment cannot be read (this could happen if e.g. file was mapped and then truncated). It is reasonable to add a tweak to skip the rest of the segment and move ahead.

I see that sv_coredump() has a unused flags argument. I suppose the latter could be cleanly plugged into that. Not sure if it fits general 'compression' idea — I guess it depends on how much control we want. For my use, having a flag o force disable compression would be good enough, I guess.

mgorny_gentoo.org marked 2 inline comments as done.
mgorny_gentoo.org retitled this revision from ptrace: Add PT_COREDUMP request to ptrace: Add PT_COREDUMP request (WIP).
mgorny_gentoo.org edited the summary of this revision. (Show Details)

Ok, so I was trying to reimplement it to write into a fd. I've managed to make it successfully dump the core... of the debugger ;-). However, when I tried to make it dump the right core, my guesswork lead me to this version… which unfortunately panics:

db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe012d12ae30  
vpanic() at vpanic+0x181/frame 0xfffffe012d12ae80
panic() at panic+0x43/frame 0xfffffe012d12aee0
uiomove_faultflag() at uiomove_faultflag+0x1bb/frame 0xfffffe012d12af20         
vn_io_fault_uiomove() at vn_io_fault_uiomove+0x31/frame 0xfffffe012d12af90      
dmu_write_uio_dnode() at dmu_write_uio_dnode+0x104/frame 0xfffffe012d12b010     
dmu_write_uio_dbuf() at dmu_write_uio_dbuf+0x42/frame 0xfffffe012d12b040        
zfs_write() at zfs_write+0x67c/frame 0xfffffe012d12b1f0
zfs_freebsd_write() at zfs_freebsd_write+0x44/frame 0xfffffe012d12b210          
VOP_WRITE_APV() at VOP_WRITE_APV+0xa7/frame 0xfffffe012d12b320
vn_io_fault_doio() at vn_io_fault_doio+0xb5/frame 0xfffffe012d12b380            
vn_io_fault1() at vn_io_fault1+0x16c/frame 0xfffffe012d12b4d0
vn_rdwr() at vn_rdwr+0x23b/frame 0xfffffe012d12b590
vn_rdwr_inchunks() at vn_rdwr_inchunks+0x90/frame 0xfffffe012d12b610            
elf64_coredump() at elf64_coredump+0x1129/frame 0xfffffe012d12b730
kern_ptrace() at kern_ptrace+0x954/frame 0xfffffe012d12b890
sys_ptrace() at sys_ptrace+0x1af/frame 0xfffffe012d12bac0
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe012d12bbf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe012d12bbf0
--- syscall (26, FreeBSD ELF64, sys_ptrace), rip = 0x8003816ba, rsp = 0x7fffffff
da68, rbp = 0x7fffffffdaa0 ---
KDB: enter: panic
[ thread pid 799 tid 100669 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1287d2e(%rip)

Could give me a pointer of a few things I'm doing wrong here?

Ok, so I was trying to reimplement it to write into a fd. I've managed to make it successfully dump the core... of the debugger ;-). However, when I tried to make it dump the right core, my guesswork lead me to this version… which unfortunately panics:

db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe012d12ae30  
vpanic() at vpanic+0x181/frame 0xfffffe012d12ae80
panic() at panic+0x43/frame 0xfffffe012d12aee0
uiomove_faultflag() at uiomove_faultflag+0x1bb/frame 0xfffffe012d12af20         
vn_io_fault_uiomove() at vn_io_fault_uiomove+0x31/frame 0xfffffe012d12af90      
dmu_write_uio_dnode() at dmu_write_uio_dnode+0x104/frame 0xfffffe012d12b010     
dmu_write_uio_dbuf() at dmu_write_uio_dbuf+0x42/frame 0xfffffe012d12b040        
zfs_write() at zfs_write+0x67c/frame 0xfffffe012d12b1f0
zfs_freebsd_write() at zfs_freebsd_write+0x44/frame 0xfffffe012d12b210          
VOP_WRITE_APV() at VOP_WRITE_APV+0xa7/frame 0xfffffe012d12b320
vn_io_fault_doio() at vn_io_fault_doio+0xb5/frame 0xfffffe012d12b380            
vn_io_fault1() at vn_io_fault1+0x16c/frame 0xfffffe012d12b4d0
vn_rdwr() at vn_rdwr+0x23b/frame 0xfffffe012d12b590
vn_rdwr_inchunks() at vn_rdwr_inchunks+0x90/frame 0xfffffe012d12b610            
elf64_coredump() at elf64_coredump+0x1129/frame 0xfffffe012d12b730
kern_ptrace() at kern_ptrace+0x954/frame 0xfffffe012d12b890
sys_ptrace() at sys_ptrace+0x1af/frame 0xfffffe012d12bac0
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe012d12bbf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe012d12bbf0
--- syscall (26, FreeBSD ELF64, sys_ptrace), rip = 0x8003816ba, rsp = 0x7fffffff
da68, rbp = 0x7fffffffdaa0 ---
KDB: enter: panic
[ thread pid 799 tid 100669 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1287d2e(%rip)

Could give me a pointer of a few things I'm doing wrong here?

And what is the panic message? I have to guess?

So let me guess. I imagine that the message was panic: uiomove proc, which was caused by the following assert

KASSERT(uio->uio_segflg != UIO_USERSPACE || uio->uio_td == curthread,
	    ("uiomove proc"));

This is exactly what I worried about from the beginning, core dumping code cannot deal with dump of non-current process. You need to switch context to the target and have it call sv_coredump(). My recommendation is to send AST and handle it by starting single-threaded region, where to do the actual call to sv_coredump(). If you need help with this, I sure will provide sample code and look at the issues.

In D29691#668838, @kib wrote:

And what is the panic message? I have to guess?

I'm sorry, it scrolled out of the screen.

db> show panic 
panic: uiomove proc

So let me guess. I imagine that the message was panic: uiomove proc, which was caused by the following assert

KASSERT(uio->uio_segflg != UIO_USERSPACE || uio->uio_td == curthread,
	    ("uiomove proc"));

This is exactly what I worried about from the beginning, core dumping code cannot deal with dump of non-current process. You need to switch context to the target and have it call sv_coredump(). My recommendation is to send AST and handle it by starting single-threaded region, where to do the actual call to sv_coredump(). If you need help with this, I sure will provide sample code and look at the issues.

Ah, sorry, I've read your comment wrong. I thought the problem will happen when a different thread of tracee was current. In that case, I know what to try next, thanks. I'll let you know if I can't figure out how to do it then.

Ok. I'd use a few more pointers. I think I'm supposed to add a flag to td_flags (reusing one of the values marked as unused?), add new fields to struct proc to pass the necessary data, and then do the work inside ast() function. What I can't figure out is how to block the ptrace(2) call and trigger the context switch.

Very rought draft is below. While thinking about the structure, I realized that PT_COREDUMP must be like PT_CONTINUE, i.e. we should continue the execution of the process and request the dump. It might be that you want to stop right after the dump, which should be signified by a flag to the op.

Code below uses P2_COREDUMP proc flag as an indicator, but I do not think it can survive to the production-ready patch. In particular, if more that one PT_COREDUMP request is in flight, it is broken. Also it might be better to only unsuspend one thread, but lets handle this later.

diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c
index 8981091b50ed..82a26fd21729 100644
--- a/sys/kern/subr_trap.c
+++ b/sys/kern/subr_trap.c
@@ -355,6 +355,22 @@ ast(struct trapframe *framep)
 		kern_sigprocmask(td, SIG_SETMASK, &td->td_oldsigmask, NULL, 0);
 	}
 
+	if ((p->p_flag2 & P2_COREDUMP) != 0) {
+		PROC_LOCK(p);
+		if ((p->p_flag2 & P2_COREDUMP) != 0) {
+			if (p->p_sysent->sv_coredump != NULL) {
+				if (thread_single(p, SINGLE_BOUNDARY) != 0)
+					error = EBUSY;
+				else {
+					error = p->p_sysent->sv_coredump(td, vp, limit, 0);
+					thread_single_end(p, SINGLE_BOUNDARY);
+				}
+			p->p_flag2 &= ~P2_COREDUMP;
+			wakeup(p);
+		}
+		PROC_UNLOCK(p);
+	}
+
 #ifdef RACCT
 	if (__predict_false(racct_enable && p->p_throttled != 0))
 		racct_proc_throttled(p);
diff --git a/sys/kern/sys_process.c b/sys/kern/sys_process.c
index 39749848b3d6..b32dcf3dc350 100644
--- a/sys/kern/sys_process.c
+++ b/sys/kern/sys_process.c
@@ -983,6 +983,7 @@ kern_ptrace(struct thread *td, int req, pid_t pid, void *addr, int data)
 		case PT_TO_SCE:
 		case PT_TO_SCX:
 		case PT_SYSCALL:
+		case PT_COREDUMP:
 			if (addr != (void *)1) {
 				error = ptrace_set_pc(td2,
 				    (u_long)(uintfptr_t)addr);
@@ -1015,6 +1016,10 @@ kern_ptrace(struct thread *td, int req, pid_t pid, void *addr, int data)
 				CTR3(KTR_PTRACE,
 				    "PT_CONTINUE: pid %d, PC = %#lx, sig = %d",
 				    p->p_pid, (u_long)(uintfptr_t)addr, data);
+			case PT_COREDUMP:
+				CTR3(KTR_PTRACE,
+				    "PT_COREDUMP: pid %d, PC = %#lx, sig = %d",
+				    p->p_pid, (u_long)(uintfptr_t)addr, data);
 				break;
 			}
 			break;
@@ -1073,6 +1078,11 @@ kern_ptrace(struct thread *td, int req, pid_t pid, void *addr, int data)
 	sendsig:
 		MPASS(proctree_locked == 0);
 
+		if ((p->p_flag2 & P2_COREDUMP) != 0) {
+			/* XXXKIB do something reasonable */
+		}
+		p->p_flag2 |= P2_COREDUMP;
+
 		/*
 		 * Clear the pending event for the thread that just
 		 * reported its event (p_xthread).  This may not be
@@ -1328,6 +1338,12 @@ kern_ptrace(struct thread *td, int req, pid_t pid, void *addr, int data)
 	}
 
 out:
+	if (req == PT_COREDUMP) {
+		while ((p->p_flag2 & P2_COREDUMP) != 0)
+			msleep(p, &p->p_mtx, PPAUSE, "crdmp", 0);
+		/* get the results etc */
+	}
+
 	/* Drop our hold on this process now that the request has completed. */
 	_PRELE(p);
 fail:
diff --git a/sys/sys/proc.h b/sys/sys/proc.h
index 238f3bbf4afe..4847de1fd6d8 100644
--- a/sys/sys/proc.h
+++ b/sys/sys/proc.h
@@ -830,6 +830,7 @@ struct proc {
 						   after exec */
 #define	P2_NEWLY_JAILED		0x00002000	/* Jailed, not yet execed */
 #define	P2_ITSTOPPED		0x00004000
+#define	P2_COREDUMP		0x00008000
 
 /* Flags protected by proctree_lock, kept in p_treeflags. */
 #define	P_TREE_ORPHANED		0x00000001	/* Reparented, on orphan list */

Thanks for your suggestions. I've had to make a few modifications, particularly regarding addr and data use.

Right now I'm hitting another panic. I've commented out some code to rule it out, and it's really weird — it looks like thread_single_end() immediately after thread_single() doesn't work:

panic: thread_single_end from other thread 0xfffffe00f57b9300 0
cpuid = 3
time = 1618858995
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01291e3ac0  
vpanic() at vpanic+0x181/frame 0xfffffe01291e3b10
panic() at panic+0x43/frame 0xfffffe01291e3b70
thread_single_end() at thread_single_end+0x1e8/frame 0xfffffe01291e3bb0         
ast() at ast+0x4bb/frame 0xfffffe01291e3bf0
fast_syscall_common() at fast_syscall_common+0x1a5/frame 0xfffffe01291e3bf0     
--- syscall (0, FreeBSD ELF64, nosys), rip = 0x8003803da, rsp = 0x7fffffffda38, 
rbp = 0x7fffffffda60 ---
KDB: enter: panic
[ thread pid 9471 tid 100667 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1286b8e(%rip)

Ok, so apparently I'm hitting this condition:

	if ((p->p_flag & P_HADTHREADS) == 0 && mode != SINGLE_ALLPROC)
		return (0);

FWICS kern_fork.c runs thread_single* conditionally to P_HADTHREADS, so I suppose I should do the same.

That said, wouldn't it make more sense to make the API more consistent, i.e. either have thread_single_end handle this gracefully or have thread_single throw an assert as well? I can make a patch if you agree.

Ok, so apparently I'm hitting this condition:

	if ((p->p_flag & P_HADTHREADS) == 0 && mode != SINGLE_ALLPROC)
		return (0);

FWICS kern_fork.c runs thread_single* conditionally to P_HADTHREADS, so I suppose I should do the same.

That said, wouldn't it make more sense to make the API more consistent, i.e. either have thread_single_end handle this gracefully or have thread_single throw an assert as well? I can make a patch if you agree.

I suspect that my code might cause more than one thread to observe P2_COREDUMP, BTW. I believe something more involved needs to be done: either the flag should be set for specific thread, or the first thread that observes the flag in AST should clear it, and then you need some other flag (?) to sync this thread with debugger.

There might be an interesting corner case with the per-thread instead of per-process XXX_COREDUMP flag. Namely, after you selected the thread, nothing ensures that the thread does not terminate before ever leaving to userspace and having chance to execute AST handler. Process is held in kern_ptrace(), that makes it impossible for it to terminate.

No, I do not think that thread_single_end() can handle P_HADTHREADS, you need to add this check to the call site.

First working version, with ability to set coredump size limit.

In D29691#670121, @kib wrote:

I suspect that my code might cause more than one thread to observe P2_COREDUMP, BTW. I believe something more involved needs to be done: either the flag should be set for specific thread, or the first thread that observes the flag in AST should clear it, and then you need some other flag (?) to sync this thread with debugger.

Yes, I think this should be per-thread, to let the user request a coredump for any specific thread. I'll try to tackle this next.

There might be an interesting corner case with the per-thread instead of per-process XXX_COREDUMP flag. Namely, after you selected the thread, nothing ensures that the thread does not terminate before ever leaving to userspace and having chance to execute AST handler. Process is held in kern_ptrace(), that makes it impossible for it to terminate.

What would happen to td2 then? Would it become invalid, or can we use to check if thread's still alive?

No, I do not think that thread_single_end() can handle P_HADTHREADS, you need to add this check to the call site.

Ok. But shouldn't then thread_single fail as well, instead of returning 0?

mgorny_gentoo.org edited the test plan for this revision. (Show Details)

Enable enabling coredump per LWP. It mostly works — when passed PID, or LWPIDs of two spawned threads. However, if I pass the LWPID of the initial thread, the test program hangs (haven't verified if it's hanging in ptrace() yet or somewhere else).

sys/kern/sys_process.c
1046

Indent should be +4.

1098

E.g. return EBUSY.

1100

I think you need to inform the victim about AST pending.

sys/sys/proc.h
723 ↗(On Diff #87799)

I suggest to not add these fields to the struct proc, they are almost never used, but have to be allocated for each process. Instead, add single pointer, which points to the structure containing these members.

Structure needs to be malloc-ed by the debugger thread, it should be not allocated on stack, because sleeping thread waiting for the coredump to finish, might get it stack swapped out.

Also this pointer != NULL may serve better than P2_COREDUMP. The structure should also include an indicator of the finished dump, which solves another use of P2_COREDUMP.

sys/sys/ptrace.h
186

Note that you continue execution of the target process to get the coredump. Two notes.

First, I believe this action should be similar to PT_CONTINUE, ie. caller should be able to specify $pc or signal to deliver.

Second issue is more delicate. Currently we continue all threads in the process, but the first thread that notices P2_COREDUMP on exit to userspace, and should do thread_single(). But it is racy, other threads might not notice AST/P2_COREDUMP, continue execution, return to userspace and do something that would break consistency of the coredump (comparing with the moment when the PT_COREDUMP request was issued). I am not sure yet how to handle it.

mgorny_gentoo.org added inline comments.
sys/sys/ptrace.h
186

Note that you continue execution of the target process to get the coredump.

I know but I presume we want to eventually fix the patch to stop it after taking the coredump.

First, I believe this action should be similar to PT_CONTINUE, ie. caller should be able to specify $pc or signal to deliver.

To be honest, I don't see the purpose of that, given the above.

Second issue is more delicate. Currently we continue all threads in the process, but the first thread that notices P2_COREDUMP on exit to userspace, and should do thread_single(). But it is racy, other threads might not notice AST/P2_COREDUMP, continue execution, return to userspace and do something that would break consistency of the coredump (comparing with the moment when the PT_COREDUMP request was issued). I am not sure yet how to handle it.

Yes, it doesn't sound that we're doing it right. I suppose one option would be to temporarily suspend all remaining threads. Or maybe even all threads if that's legal; I suppose that could prevent the program from continuing.

I thought about different ways to fix the race with run-away threads, and I do not like any of them. I think that the best is to not allow the race in first place. Basically, this can be done by only unsuspending single thread, but the victim must be selected wisely, this is somewhat non-trivial. Perhaps the candidate must be parked in ptracestop().

Please handle all issues we already determined in the existing patch, except the race. I will take over the diff then and handle the race in the proper way.

lib/libc/sys/ptrace.2
822

of the
.Vt struct ptrace_coredump

sys/kern/sys_process.c
1097

I do not see why this assert must held.

sys/sys/proc.h
382 ↗(On Diff #87801)

Please move the fields into dedicated structure, only pointer to which is stored in struct proc. See my previous note for more elaboration.

I'm sorry, I've had a newer version locally but I've forgotten to update the diff here. Lemme implement the newest comments, retest and update.

Update keep state per thread, on the thread explicitly requested via ptrace(2) and malloc-ed rather than inline.

sys/kern/sys_process.c
1066

Note that somebody else might have detached the target while we unlocked the process. Same holds for the wait for coredump to finish. I will handle this.

1070

coredump_data cannot be NULL due to M_WAITOK

1077

Style: no need for (), '?' should be on this line, not on the continuation one.

1080

no need for ()

td_flags is locked by thread lock.

1360

Style: empty line after locals declaration.

Again, td_flags is synchronized by thread lock. This makes me think that td_dbgflags and TDB_COREDUMP would be better choice.

1362

MPASS(td2->td_coredump != NULL);

1367

I think locking would be still right if you just free() under the process lock.

Thanks for the input, I'll do that in a minute. In the meantime, I've been thinking how to unsuspend one thread only. A primitive solution I can think of is using both process and thread flags, and assuming that if P2_COREDUMP is set, then only threads having TDF_COREDUMP are resumed.

Hmm, this makes me wonder if we're not going to hang if we're trying to get a coredump from a thread that's PT_SUSPEND-ed.

mgorny_gentoo.org marked 3 inline comments as done.

Switched to td_dbgflags. Removed unnecessary OOM handling. Added locking for TDF_ASTPENDING update.

Disable coredump compression for the time being.

TODO: add a PT_COREDUMP flag for this?

Some minor notes, actually for me. Do not bother fixing them, I am working on updated patch handling the two races.

sys/kern/sys_process.c
552

This needs COMPAT32 handling

1076

It should be td2, not td.