Page MenuHomeFreeBSD

hwpmc: Refactor sample ring buffer handling
AbandonedPublic

Authored by mmacy on Sep 4 2018, 8:05 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 12, 9:20 AM
Unknown Object (File)
Sat, Jan 11, 9:15 AM
Unknown Object (File)
Nov 27 2024, 12:30 PM
Unknown Object (File)
Nov 22 2024, 8:24 AM
Unknown Object (File)
Nov 16 2024, 8:04 AM
Unknown Object (File)
Nov 13 2024, 6:20 PM
Unknown Object (File)
Nov 13 2024, 5:05 PM
Unknown Object (File)
Nov 8 2024, 7:17 AM

Details

Summary

Fixes: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231793

Refactor sample ring buffer ring handling to make it more robust to long running callchain collection handling

r338112 introduced a regression that exposed a number of race conditions within the management of the sample buffers. This simplifies the handling and moves the decision to overwrite a callchain sample that has taken too long out of the NMI in to the hardlock handler. With this change the problem no longer shows up as a ring corruption but as the code spending all of its time in callchain collection.

  • Makes the producer / consumer index incrementing monotonic, making it easier (for me at least) to reason about.
  • Moves the decision to overwrite a sample from NMI context to interrupt context where we can enforce serialization.
  • Puts a time limit on waiting to collect a user callchain - putting a bound on head-of-line blocking causing samples to be dropped
  • Removes the flush routine which was previously needed to purge dangling references to the pmc from the sample buffers but now is only a source of a race condition on unload.

Currently one can lock up or crash HEAD by running:
pmcstat -S inst_retired.any_p -T and then hitting ^C

After this change it is no longer possible.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 19913

Event Timeline

mmacy retitled this revision from hwpmc: Refactor sample ring buffer handilng to hwpmc: Refactor sample ring buffer handling.

I see a "panic: [pmc,4980] pm=0x9e81380 runcount 0" on i386 with this patch. If I compile the hwpmc.ko module with "-O0" I do not get the panic.
https://people.freebsd.org/~pho/stress/log/mmacy032.txt
https://people.freebsd.org/~pho/kernel+vmcore.1106-x4.txz

Can you add an explanation what exactly changes intend to fix, and how they achieve robustness improvements.

sys/dev/hwpmc/hwpmc_mod.c
4878

If you need to synchronize between the interrupt context and normal kernel code, then dedicating normal spinlock would be better then abusing spinlock implementation bits.

I mean, if you do this for NMI interrupting userspace, spinlock would work. If this is called from NMI handler when the kernel was interrupted, you do not need neither spinlock nor spinlock_enter() (but I believe that this code is only used for the first case).

4907

This assert validity actually depends on the speed and load of the disk device or even network/NFS response time. I think it might be useful as a debugging tool on the specific machine, but not suitable for commit. Logging some message, as above, is useful.

  • no need to flush any more
  • remove invalid assert
  • remove unused force parameter

remove invalid assert

mmacy marked an inline comment as done.Oct 2 2018, 8:43 PM
mmacy added inline comments.
sys/dev/hwpmc/hwpmc_mod.c
4878

It's to serialize access between hardclock and non interrupt contexts manipulating samples. spinlock_{enter, exit} are a lot cheaper than atomics. I don't know that there is otherwise an architecture agnostic intr disable.

In D17011#363656, @kib wrote:

Can you add an explanation what exactly changes intend to fix, and how they achieve robustness improvements.

  • Makes the producer / consumer index incrementing monotonic, making it easier (for me at least) to reason about.
  • Moves the decision to overwrite a sample from NMI context to interrupt context where we can enforce serialization.
  • Puts a time limit on waiting to collect a user callchain - putting a bound on head-of-line blocking causing samples to be dropped
  • Removes the flush routine which was previously needed to purge dangling references to the pmc from the sample buffers but now is only a source of a race condition on unload.

Currently one can lock up or crash HEAD by running:
pmcstat -S inst_retired.any_p -T and then hitting ^C

After this change it is no longer possible.

@kib @markj Any further comments before I send this to RE?

sys/dev/hwpmc/hwpmc_mod.c
4685

ps_considx is a 64-bit value, which can't be updated atomically on all platforms. I think this handler could observe a torn write by the interrupted thread, though it seems that might be harmless for now.

Why did you change this away from ps_read/_write? Does it fix a bug?

4817

The indentation of the continuation line is wrong.

4837

The indentation looks wrong here.

4882

I know it's not new to this change, but PMC_SAMPLE_INUSE is a pretty bad name - any non-zero value for ps_nsamples means that the sample is "in use." You might call it PMC_SAMPLE_PENDING or so instead.

4960

Add a pm_stats entry for this?

4997–4999

Style: indentation is wrong.

5184

The cast should be to uint64_t.

5570

You don't need the else, just write pmc_sample_mask = pmc_nsamples - 1; after the block.

mmacy marked an inline comment as done.Oct 3 2018, 8:01 PM

I’ll update the patch with the other feedback.

sys/dev/hwpmc/hwpmc_mod.c
4685

I’ll look and see if a partial update could cause problems.

The change makes the code for looking at the delta simpler as I no longer have to deal with wrap. Primarily it makes it easier for me to reason about.

sys/dev/hwpmc/hwpmc_mod.c
4685

I suspect a partial update will not cause problems with the current code, it just seems fragile. I don't object to that part of the change, just noting that it could lead to bugs.

I had the same impression, but now you need to add these considx != prodix checks instead.

mmacy marked an inline comment as done.Oct 3 2018, 9:07 PM
mmacy added inline comments.
sys/dev/hwpmc/hwpmc_mod.c
4685

Its merits are subjective. Coming from having spent a lot of time in NIC drivers, index wrap was one of the things that annoyed me most. I'm quite happy with adding considx != prodidx.

mmacy marked an inline comment as done.Oct 3 2018, 9:25 PM
mmacy added inline comments.
sys/dev/hwpmc/hwpmc_mod.c
5184

_int64_t_ we don't want it to go negative.

This revision is now accepted and ready to land.Oct 4 2018, 9:40 PM
This revision was automatically updated to reflect the committed changes.
mmacy added a subscriber: mjg.

After this change sampling under load is truly terrible - rejecting 99+% of samples:

+ pmcstat -S unhalted_core_cycles -O ppid.pmcstat sleep 10
+ pmcstat -R ppid.pmcstat -z100 -G ppid.stacks
CONVERSION STATISTICS:
 #exec/elf                                1
 #samples/total                           566445
 #callchain/dubious-frames                565995
rajeeb.barman_amd.com added inline comments.
head/sys/dev/hwpmc/hwpmc_mod.c
4837 ↗(On Diff #48765)

Shouldn't be ps->ps_nsamples != PMC_USER_CALLCHAIN_PENDING

instead of ps->ps_nsamples == PMC_USER_CALLCHAIN_PENDING

??

head/sys/dev/hwpmc/hwpmc_mod.c
4837 ↗(On Diff #48765)

Yes, I think you're right.

head/sys/dev/hwpmc/hwpmc_mod.c
4837 ↗(On Diff #48765)

Hello there. I can confirm making this change takes care of the majority of lost samples, thank you for noting it!

We still need to verify validity of samples collected but it does look promising.

shreyank.amartya_amd.com added inline comments.
head/sys/dev/hwpmc/hwpmc_mod.c
3195 ↗(On Diff #48765)

In a scenario where pmc is attached to a target process other than owner (similar to pmcstat -t <pid>), pmc_start() returns without setting the machine dependent state of the pmc to start. Should this be modified to return only when a forced context switch occurs?