Page MenuHomeFreeBSD

WIP: try to make libthr destroy_after_cancel test less racy
AbandonedPublic

Authored by arichardson on Dec 11 2020, 4:59 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 26, 1:29 PM
Unknown Object (File)
Fri, Nov 22, 10:55 PM
Unknown Object (File)
Mon, Nov 18, 8:42 AM
Unknown Object (File)
Mon, Nov 18, 7:38 AM
Unknown Object (File)
Fri, Nov 8, 11:26 AM
Unknown Object (File)
Thu, Nov 7, 6:19 PM
Unknown Object (File)
Oct 4 2024, 7:39 PM
Unknown Object (File)
Sep 19 2024, 7:06 AM
Subscribers

Details

Summary

This test is almost always deadlocking in the CheriBSD Jenkins. Since
FreeBSD CI is green, I thought I'd try upstream HEAD and it turns out this
tests also deadlocks sometimes if you run it in a loop. This produces the
following truss output:

<new thread 100160>
sigfastblock(0x1,0x801812538)                    = 0 (0x0)
_umtx_op(0x8014cf008,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8018127c8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x8010a6f38,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
thr_kill(100160,SIGTHR)                          = 0 (0x0)
SIGNAL 32 (SIGTHR) code=SI_LWP pid=913 uid=0
sigreturn(0x7fffdfffdab0)                        EJUSTRETURN
thr_wake(0x18740)                                = 0 (0x0)
thr_wake(0x18740)                                = 0 (0x0)
^C_umtx_op(0x801812500,UMTX_OP_WAIT,0x18740,0x0,0x0) ERR#4 'Interrupted system call'
SIGNAL 2 (SIGINT) code=SI_KERNEL

Attaching GDB reveals that thread 1 is blocked in pthread_join and thread 2
is inside pthread_cond_wait:

`
Thread 2 (LWP 100132 of process 892):
#0  _umtx_op_err () at /local/scratch/alr48/cheri/cheribsd/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:40
#1  0x00000008010a2360 in _thr_umtx_timedwait_uint (mtx=0x8014f2008, id=id@entry=0, clockid=<optimized out>, abstime=<optimized out>, shared=<optimized out>, shared@entry=0) at /local/scratch/alr48/cheri/cheribsd/lib/libthr/thread/thr_umtx.c:247
#2  0x0000000801099319 in _thr_sleep (curthread=curthread@entry=0x801812500, clockid=0, abstime=abstime@entry=0x0) at /local/scratch/alr48/cheri/cheribsd/lib/libthr/thread/thr_kern.c:199
#3  0x000000080109498f in cond_wait_user (cvp=0x801829100, mp=0x8014d1008, abstime=0x0, cancel=1) at /local/scratch/alr48/cheri/cheribsd/lib/libthr/thread/thr_cond.c:320
#4  cond_wait_common (cond=<optimized out>, cond@entry=0x102a270 <cond>, mutex=<optimized out>, mutex@entry=0x102a268 <mutex>, abstime=abstime@entry=0x0, cancel=cancel@entry=1) at /local/scratch/alr48/cheri/cheribsd/lib/libthr/thread/thr_cond.c:380
#5  0x0000000801094c01 in __thr_cond_wait (cond=0x8014f2008, cond@entry=0x102a270 <cond>, mutex=0xf, mutex@entry=0x102a268 <mutex>) at /local/scratch/alr48/cheri/cheribsd/lib/libthr/thread/thr_cond.c:395
#6  0x00000000010272d4 in destroy_after_cancel_threadfunc (arg=<optimized out>) at /local/scratch/alr48/cheri/cheribsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:569
#7  0x0000000801095b1b in thread_start (curthread=0x801812500) at /local/scratch/alr48/cheri/cheribsd/lib/libthr/thread/thr_create.c:309
#8  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdfffe000

Thread 1 (LWP 100073 of process 892):
#0  _umtx_op_err () at /local/scratch/alr48/cheri/cheribsd/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:40
#1  0x0000000801097853 in join_common (pthread=0x801812500, thread_return=thread_return@entry=0x0, abstime=abstime@entry=0x0, peek=<optimized out>) at /local/scratch/alr48/cheri/cheribsd/lib/libthr/thread/thr_join.c:147
#2  0x000000080109758b in _thr_join (pthread=0x801812500, thread_return=0x2, thread_return@entry=0x0) at /local/scratch/alr48/cheri/cheribsd/lib/libthr/thread/thr_join.c:62
#3  0x0000000001026f5b in atfu_destroy_after_cancel_body (tc=<optimized out>) at /local/scratch/alr48/cheri/cheribsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:614
#4  0x000000080107d11c in atf_tc_run (tc=0x102a250 <atfu_destroy_after_cancel_tc>, tc@entry=0x801819040, resfile=resfile@entry=0x801070b59 "%s: WARNING: %s\n") at /local/scratch/alr48/cheri/cheribsd/contrib/atf/atf-c/tc.c:1020
#5  0x000000080107f19e in atf_tp_run (tp=tp@entry=0x7fffffffda28, tcname=tcname@entry=0x801819040 "destroy_after_cancel", resfile=<optimized out>) at /local/scratch/alr48/cheri/cheribsd/contrib/atf/atf-c/tp.c:201
#6  0x000000080107fbc1 in run_tc (tp=0x7fffffffda28, p=0x7fffffffda40, exitcode=<optimized out>) at /local/scratch/alr48/cheri/cheribsd/contrib/atf/atf-c/detail/tp_main.c:504
#7  controlled_main (argc=<optimized out>, argv=0x7fffffffead0, add_tcs_hook=0x1024d00 <atfu_tp_add_tcs>, exitcode=<optimized out>) at /local/scratch/alr48/cheri/cheribsd/contrib/atf/atf-c/detail/tp_main.c:574
#8  atf_tp_main (argc=<optimized out>, argc@entry=2, argv=argv@entry=0x7fffffffead0, add_tcs_hook=0x1024d00 <atfu_tp_add_tcs>) at /local/scratch/alr48/cheri/cheribsd/contrib/atf/atf-c/detail/tp_main.c:604
#9  0x0000000001024cf1 in main (argc=25240832, argc@entry=2, argv=0x2, argv@entry=0x7fffffffead0) at /local/scratch/alr48/cheri/cheribsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:684
#10 0x0000000001024ab2 in _start (ap=<optimized out>, cleanup=<optimized out>) at /local/scratch/alr48/cheri/cheribsd/lib/csu/amd64/crt1_c.c:75
`

With the current version of the test I get an EBUSY error 2/3 times (it
seems the condvar still has waiters when cancellation happens). If I
uncomment the fprintf statements, it passes most of the time, but also
sometimes gives me an EBUSY.

Note: I added the pthread_mutex_isowned_np((pthread_mutex_t *)arg) check
in the cleanup callback since it seems like the mutex is usually not
owned when thread is cancelled.

I'm not sure if the test is broken, or if the libthr implementation should
ensure that there can't be a lost wakeup and/or having waiters on the
condvar after cancel.

Test Plan

no more 300s timeout with this version, but fails 2/3 times

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Lint Coverage
Unit
No Test Coverage
Build Status
Buildable 35350
Build 32277: arc lint + arc unit

Event Timeline

I believe that libpthread must ensure that the cancellation request is not lost. If it is, as your data indicates, perhaps we have a bug in libthr.

I do not see a race in the test case itself, but also I was not able to identify a problem in thr_cond.c + cancellation handling from inspection. I also tried several hundreds runs of the distilled test locally, and I did not get hang or EBUSY.

Is your data, i.e. truss output + gdb traces, consistent ? I am asking because I do not see actions in the victim thread after cancel signal caused umtx op to return EINTR.
Perhaps it is better to use ktrace/kdump -H and then attach gdb. After that I might ask for runs with some debugging patches.

In D27574#616406, @kib wrote:

I believe that libpthread must ensure that the cancellation request is not lost. If it is, as your data indicates, perhaps we have a bug in libthr.

I do not see a race in the test case itself, but also I was not able to identify a problem in thr_cond.c + cancellation handling from inspection. I also tried several hundreds runs of the distilled test locally, and I did not get hang or EBUSY.

Is your data, i.e. truss output + gdb traces, consistent ? I am asking because I do not see actions in the victim thread after cancel signal caused umtx op to return EINTR.
Perhaps it is better to use ktrace/kdump -H and then attach gdb. After that I might ask for runs with some debugging patches.

I wonder if the problem is that I'm running this on a single-CPU QEMU instance, so only one of the threads can be scheduled at a time.

Here is what I get on latest HEAD (r368578):

[Switching to LWP 100076 of process 970]
_umtx_op_err () at /local/scratch/alr48/cheri/freebsd/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:40
40      RSYSCALL_ERR(_umtx_op)
(gdb) thread apply all bt

Thread 2 (LWP 100091 of process 970):
#0  _umtx_op_err () at /local/scratch/alr48/cheri/freebsd/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:40
#1  0x0000000800282350 in _thr_umtx_timedwait_uint (mtx=0x8006dd008, id=<optimized out>, clockid=<optimized out>, abstime=<optimized out>, shared=<optimized out>) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_umtx.c:236
#2  0x0000000800278fb9 in _thr_sleep (curthread=<optimized out>, clockid=0, abstime=0x0) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_kern.c:199
#3  0x000000080027463b in cond_wait_user (cvp=0x800a2c100, mp=0x8006bc008, abstime=0x0, cancel=1) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_cond.c:320
#4  cond_wait_common (cond=<optimized out>, mutex=<optimized out>, abstime=0x0, cancel=1) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_cond.c:380
#5  0x0000000800274921 in __thr_cond_wait (cond=0x8006dd008, mutex=0xf) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_cond.c:395
#6  0x00000000002056d4 in destroy_after_cancel_threadfunc (arg=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:557
#7  0x000000080027583b in thread_start (curthread=0x800a12500) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_create.c:292
#8  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdfffe000

Thread 1 (LWP 100076 of process 970):
#0  _umtx_op_err () at /local/scratch/alr48/cheri/freebsd/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:40
#1  0x0000000800277573 in join_common (pthread=0x800a12500, thread_return=0x0, abstime=0x0, peek=<optimized out>) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_join.c:147
#2  0x00000008002772ab in _thr_join (pthread=0x800a12500, thread_return=0x2) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_join.c:62
#3  0x000000000020550f in atfu_destroy_after_cancel_body (tc=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:589
#4  0x000000080025d0cc in atf_tc_run (tc=0x208270 <atfu_destroy_after_cancel_tc>, resfile=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/tc.c:1008
#5  0x000000080025f14e in atf_tp_run (tp=<optimized out>, tcname=<optimized out>, resfile=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/tp.c:201
#6  0x000000080025fb71 in run_tc (tp=0x7fffffffda68, p=0x7fffffffda80, exitcode=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/detail/tp_main.c:504
#7  controlled_main (argc=<optimized out>, argv=0x7fffffffeb10, add_tcs_hook=0x2037f0 <atfu_tp_add_tcs>, exitcode=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/detail/tp_main.c:574
#8  atf_tp_main (argc=<optimized out>, argv=0x7fffffffeb10, add_tcs_hook=0x2037f0 <atfu_tp_add_tcs>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/a--Type <RET> for more, q to quit, c to continue without paging--
tf-c/detail/tp_main.c:604
#9  0x00000000002037ef in main (argc=10560768, argv=0x2) at /local/scratch/alr48/cheri/freebsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:658
#10 0x0000000000203600 in _start (ap=<optimized out>, cleanup=<optimized out>) at /local/scratch/alr48/cheri/freebsd/lib/csu/amd64/crt1_c.c:75

kdump -H:

970 100076 cond_test RET   write 73/0x49
970 100076 cond_test CALL  write(0x2,0x7fffffffd300,0x7d)
970 100076 cond_test GIO   fd 2 wrote 125 bytes
    "cond_test: WARNING: No isolation nor timeout control is being applied; you may get unexpected failures; see atf-test-case(4)
    "
970 100076 cond_test RET   write 125/0x7d
970 100076 cond_test CALL  mmap(0,0x21000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
970 100076 cond_test RET   mmap 34366799872/0x8006bc000
970 100076 cond_test CALL  mmap(0,0x1000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
970 100076 cond_test RET   mmap 34366935040/0x8006dd000
970 100076 cond_test CALL  mmap(0x7fffdfdfd000,0x201000,0x3<PROT_READ|PROT_WRITE>,0x400<MAP_STACK>,0xffffffff,0)
970 100076 cond_test RET   mmap 140736949374976/0x7fffdfdfd000
970 100076 cond_test CALL  mprotect(0x7fffdfdfd000,0x1000,0<PROT_NONE>)
970 100076 cond_test RET   mprotect 0
970 100076 cond_test CALL  thr_new(0x7fffffffd4d0,0x68)
970 100076 cond_test RET   thr_new 0
970 100076 cond_test CALL  _umtx_op(0x8002864a8,0xf<UMTX_OP_WAIT_UINT_PRIVATE>,0,0,0)
970 100091 cond_test RET   fork 0
970 100091 cond_test CALL  sigfastblock(0x1,0x800a12538)
970 100091 cond_test RET   sigfastblock 0
970 100091 cond_test CALL  _umtx_op(0x8006bc008,0x16<UMTX_OP_MUTEX_WAKE2>,0,0,0)
970 100091 cond_test RET   _umtx_op 0
970 100091 cond_test CALL  _umtx_op(0x800a127c0,0x15<UMTX_OP_NWAKE_PRIVATE>,0x1,0,0)
970 100076 cond_test RET   _umtx_op 0
970 100076 cond_test CALL  thr_kill(0x186fb,SIGTHR)
970 100076 cond_test RET   thr_kill 0
970 100076 cond_test CALL  _umtx_op(0x800a12500,0x2<UMTX_OP_WAIT>,0x186fb,0,0)
970 100091 cond_test RET   _umtx_op 0
970 100091 cond_test PSIG  SIGTHR caught handler=0x80027e1b0 mask=0x0 code=SI_LWP
970 100091 cond_test CALL  sigreturn(0x7fffdfffdac0)
970 100091 cond_test RET   sigreturn JUSTRETURN
970 100091 cond_test CALL  thr_wake(0x186fb)
970 100091 cond_test RET   thr_wake 0
970 100091 cond_test CALL  thr_wake(0x186fb)
970 100091 cond_test RET   thr_wake 0
970 100091 cond_test CALL  _umtx_op(0x8006dd008,0xf<UMTX_OP_WAIT_UINT_PRIVATE>,0,0,0)

Indeed, if I start QEMU with -smp 4, I only got the freeze after a few thousand runs of the test, whereas for single-CPU it happens (almost?) every time.

Could you please apply the following debugging patch and provide both the same data as before (ktrace + gdb backtraces), and hopefully the kernel printf with TDP_WAKEUP headline ?

diff --git a/sys/kern/subr_sleepqueue.c b/sys/kern/subr_sleepqueue.c
index 3fe407926c9..5b7a1b75bd2 100644
--- a/sys/kern/subr_sleepqueue.c
+++ b/sys/kern/subr_sleepqueue.c
@@ -433,6 +433,7 @@ sleepq_sleepcnt(const void *wchan, int queue)
 	return (sq->sq_blockedcnt[queue]);
 }
 
+#include <sys/stack.h>
 static int
 sleepq_check_ast_sc_locked(struct thread *td, struct sleepqueue_chain *sc)
 {
@@ -443,6 +444,10 @@ sleepq_check_ast_sc_locked(struct thread *td, struct sleepqueue_chain *sc)
 
 	ret = 0;
 	if ((td->td_pflags & TDP_WAKEUP) != 0) {
+struct stack st;
+stack_save(&st);
+printf("TDP_WAKEUP cleanup: tid %d pid %d comm td->td_proc->p_comm %s\n", td->td_tid, td->td_proc->p_pid, td->td_proc->p_comm);
+stack_print_ddb(&st);
 		td->td_pflags &= ~TDP_WAKEUP;
 		ret = EINTR;
 		thread_lock(td);
In D27574#616626, @kib wrote:

Could you please apply the following debugging patch and provide both the same data as before (ktrace + gdb backtraces), and hopefully the kernel printf with TDP_WAKEUP headline ?

diff --git a/sys/kern/subr_sleepqueue.c b/sys/kern/subr_sleepqueue.c
index 3fe407926c9..5b7a1b75bd2 100644
--- a/sys/kern/subr_sleepqueue.c
+++ b/sys/kern/subr_sleepqueue.c
@@ -433,6 +433,7 @@ sleepq_sleepcnt(const void *wchan, int queue)
 	return (sq->sq_blockedcnt[queue]);
 }
 
+#include <sys/stack.h>
 static int
 sleepq_check_ast_sc_locked(struct thread *td, struct sleepqueue_chain *sc)
 {
@@ -443,6 +444,10 @@ sleepq_check_ast_sc_locked(struct thread *td, struct sleepqueue_chain *sc)
 
 	ret = 0;
 	if ((td->td_pflags & TDP_WAKEUP) != 0) {
+struct stack st;
+stack_save(&st);
+printf("TDP_WAKEUP cleanup: tid %d pid %d comm td->td_proc->p_comm %s\n", td->td_tid, td->td_proc->p_pid, td->td_proc->p_comm);
+stack_print_ddb(&st);
 		td->td_pflags &= ~TDP_WAKEUP;
 		ret = EINTR;
 		thread_lock(td);

I don't see the printf.

    "cond_test: WARNING: No isolation nor timeout control is being applied; you may get unexpected failures; see atf-test-case(4)
    "
792 100073 cond_test RET   write 125/0x7d
792 100073 cond_test CALL  mmap(0,0x21000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
792 100073 cond_test RET   mmap 34366799872/0x8006bc000
792 100073 cond_test CALL  mmap(0,0x1000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
792 100073 cond_test RET   mmap 34366935040/0x8006dd000
792 100073 cond_test CALL  mmap(0x7fffdfdfd000,0x201000,0x3<PROT_READ|PROT_WRITE>,0x400<MAP_STACK>,0xffffffff,0)
792 100073 cond_test RET   mmap 140736949374976/0x7fffdfdfd000
792 100073 cond_test CALL  mprotect(0x7fffdfdfd000,0x1000,0<PROT_NONE>)
792 100073 cond_test RET   mprotect 0
792 100073 cond_test CALL  thr_new(0x7fffffffd4c0,0x68)
792 100073 cond_test RET   thr_new 0
792 100073 cond_test CALL  _umtx_op(0x8002864a8,0xf<UMTX_OP_WAIT_UINT_PRIVATE>,0,0,0)
792 100090 cond_test RET   fork 0
792 100090 cond_test CALL  sigfastblock(0x1,0x800a12538)
792 100090 cond_test RET   sigfastblock 0
792 100090 cond_test CALL  _umtx_op(0x8006bc008,0x16<UMTX_OP_MUTEX_WAKE2>,0,0,0)
792 100090 cond_test RET   _umtx_op 0
792 100090 cond_test CALL  _umtx_op(0x800a127c0,0x15<UMTX_OP_NWAKE_PRIVATE>,0x1,0,0)
792 100073 cond_test RET   _umtx_op 0
792 100073 cond_test CALL  thr_kill(0x186fa,SIGTHR)
792 100073 cond_test RET   thr_kill 0
792 100073 cond_test CALL  _umtx_op(0x800a12500,0x2<UMTX_OP_WAIT>,0x186fa,0,0)
792 100090 cond_test RET   _umtx_op 0
792 100090 cond_test PSIG  SIGTHR caught handler=0x80027e1b0 mask=0x0 code=SI_LWP
792 100090 cond_test CALL  sigreturn(0x7fffdfffdac0)
792 100090 cond_test RET   sigreturn JUSTRETURN
792 100090 cond_test CALL  thr_wake(0x186fa)
792 100090 cond_test RET   thr_wake 0
792 100090 cond_test CALL  thr_wake(0x186fa)
792 100090 cond_test RET   thr_wake 0
792 100090 cond_test CALL  _umtx_op(0x8006dd008,0xf<UMTX_OP_WAIT_UINT_PRIVATE>,0,0,0)

GDB

Thread 2 (LWP 100090 of process 792):
#0  _umtx_op_err () at /local/scratch/alr48/cheri/freebsd/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:40
#1  0x0000000800282350 in _thr_umtx_timedwait_uint (mtx=0x8006dd008, id=<optimized out>, clockid=<optimized out>, abstime=<optimized out>, shared=<optimized out>) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_umtx.c:236
#2  0x0000000800278fb9 in _thr_sleep (curthread=<optimized out>, clockid=0, abstime=0x0) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_kern.c:199
#3  0x000000080027463b in cond_wait_user (cvp=0x800a29120, mp=0x8006bc008, abstime=0x0, cancel=1) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_cond.c:320
#4  cond_wait_common (cond=<optimized out>, mutex=<optimized out>, abstime=0x0, cancel=1) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_cond.c:380
#5  0x0000000800274921 in __thr_cond_wait (cond=0x8006dd008, mutex=0xf) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_cond.c:395
#6  0x00000000002056d4 in destroy_after_cancel_threadfunc (arg=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:557
#7  0x000000080027583b in thread_start (curthread=0x800a12500) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_create.c:292
#8  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdfffe000

Thread 1 (LWP 100073 of process 792):
#0  _umtx_op_err () at /local/scratch/alr48/cheri/freebsd/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:40
#1  0x0000000800277573 in join_common (pthread=0x800a12500, thread_return=0x0, abstime=0x0, peek=<optimized out>) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_join.c:147
#2  0x00000008002772ab in _thr_join (pthread=0x800a12500, thread_return=0x2) at /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_join.c:62
#3  0x000000000020550f in atfu_destroy_after_cancel_body (tc=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:589
#4  0x000000080025d0cc in atf_tc_run (tc=0x208270 <atfu_destroy_after_cancel_tc>, resfile=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/tc.c:1008
#5  0x000000080025f14e in atf_tp_run (tp=<optimized out>, tcname=<optimized out>, resfile=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/tp.c:201
#6  0x000000080025fb71 in run_tc (tp=0x7fffffffda58, p=0x7fffffffda70, exitcode=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/detail/tp_main.c:504
#7  controlled_main (argc=<optimized out>, argv=0x7fffffffeb08, add_tcs_hook=0x2037f0 <atfu_tp_add_tcs>, exitcode=<optimized out>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/detail/tp_main.c:574
#8  atf_tp_main (argc=<optimized out>, argv=0x7fffffffeb08, add_tcs_hook=0x2037f0 <atfu_tp_add_tcs>) at /local/scratch/alr48/cheri/freebsd/contrib/atf/atf-c/detail/tp_main.c:604
#9  0x00000000002037ef in main (argc=10560768, argv=0x2) at /local/scratch/alr48/cheri/freebsd/contrib/netbsd-tests/lib/libpthread/t_cond.c:658
#10 0x0000000000203600 in _start (ap=<optimized out>, cleanup=<optimized out>) at /local/scratch/alr48/cheri/freebsd/lib/csu/amd64/crt1_c.c:75

I can reproduce the hang on a SMP system using cpuset -l 0 ./cond_test destroy_after_cancel, whereas cpuset -l 0,1 ./cond_test destroy_after_cancel seems to pass 90+% of the time.

cpuset -l 0 worked for me, thanks. It is strange that debugging patch did not printed anything for you, it might indicate yet another issue.

But for me it did printed, and I see the problem. D27597 fixed the hang on my machine.