Fixes: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231793
Refactor sample ring buffer ring handling to make it more robust to long running callchain collection handling
r338112 introduced a regression that exposed a number of race conditions within the management of the sample buffers. This simplifies the handling and moves the decision to overwrite a callchain sample that has taken too long out of the NMI in to the hardlock handler. With this change the problem no longer shows up as a ring corruption but as the code spending all of its time in callchain collection.
- Makes the producer / consumer index incrementing monotonic, making it easier (for me at least) to reason about.
- Moves the decision to overwrite a sample from NMI context to interrupt context where we can enforce serialization.
- Puts a time limit on waiting to collect a user callchain - putting a bound on head-of-line blocking causing samples to be dropped
- Removes the flush routine which was previously needed to purge dangling references to the pmc from the sample buffers but now is only a source of a race condition on unload.
Currently one can lock up or crash HEAD by running:
pmcstat -S inst_retired.any_p -T and then hitting ^C
After this change it is no longer possible.