x86: fix a race during running thread stack capture resulting in unclaimed NMI
AbandonedPublic
Actions

Authored by mjg on Jan 17 2020, 10:30 PM.

Details

Reviewers

kib
jeff
jhb
markj

Summary

Stack saving handler will abort if curthread does not match the thread set for capture. However, no attempt is made to keep said thread on cpu and by the time NMI is received it could have been preempted. Claim the NMI even if the thread does not match and instead let the issuer know about the failure by setting stack depth to 0.

Another fix would to be thread lock around this.

Also note there is still a hypothetical bug where the pointer matches, but the actual thread exited and the area got reallocated.

Test Plan

to be validated yet, ran into by pho in https://people.freebsd.org/~pho/stress/log/mjguzik022.txt

NMI/cpu0 ... going to debugger
timeout stopping cpus
[ thread pid 97046 tid 101977 ]
Stopped at      link_elf_search_symbol+0x31:    nopw
db> bt
Tracing pid 97046 tid 101977 td 0xfffff801618b9000
link_elf_search_symbol() at link_elf_search_symbol+0x31/frame 0xfffffe017c6776b0
linker_debug_search_symbol_name() at linker_debug_search_symbol_name+0x93/frame 0xfffffe017c677720
linker_search_symbol_name_flags() at linker_search_symbol_name_flags+0x76/frame 0xfffffe017c677750
stack_sbuf_print_flags() at stack_sbuf_print_flags+0xb5/frame 0xfffffe017c6777e0
sysctl_kern_proc_kstack() at sysctl_kern_proc_kstack+0x1c2/frame 0xfffffe017c6778a0
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x7b/frame 0xfffffe017c6778e0
sysctl_root() at sysctl_root+0x20a/frame 0xfffffe017c677960
userland_sysctl() at userland_sysctl+0x17b/frame 0xfffffe017c677a10
sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe017c677ac0
amd64_syscall() at amd64_syscall+0x2f1/frame 0xfffffe017c677bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe017c677bf0
--- syscall (202, FreeBSD ELF64, sys___sysctl), rip = 0x80041c53a, rsp = 0x7fffffffcd28, rbp = 0x7fffffffcd60 ---

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

mjg created this revision.Jan 17 2020, 10:30 PM

Herald added a subscriber: imp. · View Herald TranscriptJan 17 2020, 10:30 PM

Harbormaster completed remote builds in B28746: Diff 66939.Jan 17 2020, 10:30 PM

mjg edited the summary of this revision. (Show Details)Jan 17 2020, 10:31 PM

mjg edited the summary of this revision. (Show Details)

Yes, this should fix the problem.

On the other hand, why do we need NMI delivery mode for IPI_TRACE, after all ? The handler ignores the IPI if interrupts were disabled, which is reasonable since we could have seen the context switch. But then we could use normal interrupt as well ?

This revision is now accepted and ready to land.Jan 18 2020, 12:24 PM

kib added a reviewer: markj.Jan 18 2020, 12:24 PM

I don't quite understand the problem: in stack_save_td_running() we hold the thread's lock and we assert TD_IS_RUNNING(). To switch off-CPU, the target thread must acquire the thread lock and call mi_switch(). So, it should spin until the NMI is received.

The problem is that td_oncpu updates are no longer synchronized by the thread lock in sched_switch() since r355784. It is possible for stack_save_td_running() to see TD_IS_RUNNING(td) && td->td_oncpu == NOCPU. In particular, there is nothing validating that td_oncpu is a valid CPU ID.

this will be spilt into different commits. based on the observation by markj:

prevent re-reading of td_oncpu in various places
assert passed cpu is in range in ipi_send_cpu
validate cpu number in stack_save_td_running

This revision now requires review to proceed.Jan 19 2020, 5:22 AM

kib added inline comments.Jan 19 2020, 7:35 PM

sys/x86/x86/stack_machdep.c
163	I do not believe there is any stack(9) consumer that handles EAGAIN by retry. I looked at procstat(1) kstack and it also give up on the thread when an error is returning, printing <running>. I think either stack_save_td_running() should pause (with unlock) and retry, or its kernel callers should do the same upon EAGAIN.

In D23243#509976, @mjg wrote:

this will be spilt into different commits. based on the observation by markj:

prevent re-reading of td_oncpu in various places

assert passed cpu is in range in ipi_send_cpu

validate cpu number in stack_save_td_running

Why not just set newtd->td_oncpu before calling cpu_switch() and cpu_throw() in ULE?

In D23243#510157, @markj wrote:

In D23243#509976, @mjg wrote:

this will be spilt into different commits. based on the observation by markj:

prevent re-reading of td_oncpu in various places

assert passed cpu is in range in ipi_send_cpu

validate cpu number in stack_save_td_running

Why not just set newtd->td_oncpu before calling cpu_switch() and cpu_throw() in ULE?

To be clear, like https://reviews.freebsd.org/D23270 , which fixes the bug for me.

I think adding an assert to ipi_send_cpu() is still worth doing, but IMO it is not really useful to have a lockless td_oncpu, and we would need to audit all other accesses.

I am fine with changing stack_save_td_running() to use a regular IPI but that is a separate issue.

I committed the assert in r356897. If changed assignment is really legitimate I'm more than happy to abandon this rev.

D23270