Page MenuHomeFreeBSD

x86: Refactor kernel-mode NMI handling
AcceptedPublic

Authored by bnovkov on Aug 23 2024, 4:34 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 6, 5:34 AM
Unknown Object (File)
Fri, Nov 1, 9:31 AM
Unknown Object (File)
Sun, Oct 20, 2:28 AM
Unknown Object (File)
Sat, Oct 19, 4:28 AM
Unknown Object (File)
Oct 9 2024, 10:31 AM
Unknown Object (File)
Oct 6 2024, 10:00 AM
Unknown Object (File)
Oct 6 2024, 6:44 AM
Unknown Object (File)
Oct 5 2024, 9:04 PM
Subscribers

Details

Reviewers
kib
br
markj
avg
Summary

This refactor aims to add the ability to share performance counter interrupts by refactoring the kernel-mode NMI handler.
The handler now allows multiple drivers to service the same interrupt (e.g. hwpmc(4) and hwt(4)'s Intel Processor Trace backend).

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/amd64/amd64/trap.c
221

Should we return immediately if handled = true here?

228

The "we can return immediately" fragment doesn't really make sense anymore.

bnovkov added inline comments.
sys/amd64/amd64/trap.c
221

I avoided this since it could be possible that a single PC interrupt gets consumed by multiple handlers, at least in theory.
I'm not sure if this situation possible in practice though.

I do not quite like the patch. May I suggest to have a list of nmi handlers, registered dynamically?

I think registration for single-linked list can be done without sync with nmi processing, and de-registration can simply NULL the function pointer in the list element. This way you would avoid a need to sync with (AKA disable) NMI for list manipulation. But then we remove the need to patch trap.c for each NMI consumer.

sys/amd64/amd64/trap.c
219

This and the next line can be joined I believe.

Is HWT option already defined for this patch to land?

230

Again, join the lines.

In D46421#1058638, @kib wrote:

I do not quite like the patch. May I suggest to have a list of nmi handlers, registered dynamically?

I think registration for single-linked list can be done without sync with nmi processing, and de-registration can simply NULL the function pointer in the list element. This way you would avoid a need to sync with (AKA disable) NMI for list manipulation. But then we remove the need to patch trap.c for each NMI consumer.

I like the approach, but after implementing it and thinking about the matter some more I'm not sure how to proceed.
I'd like to hear what you think before updating the diff.

You can find the dynamic registration patch here. It implements the approach you described above, with the addition of using a UMA_ZONE_NOFREE zone for allocating handler entries and masking the PCINT LVT before invoking the handlers.
However, I'm not sure that adding this additional complexity is entirely justified.
Only the PCINT LVT is currently configured to use an NMI, and according to the AMD and Intel manuals the only PCINT consumers (so far) are performance counters and Intel PT.
Adding dynamic kernel-mode NMI handler registration seems a bit overkill when we're dealing with a maximum of two handlers.

I intended this approach to be applied to all nmi handlers, including the nmi_handle_intr(). [This in particular means that toggling of pcint should be inside the handlers]. Also, with some work, parts of the asm code for nmi from exception.S can be moved to C.

Why do you need zone for the handlers list elements? Due to NOFREE? I do not think it is justifiable.

Why do you check for existing handler registration when registering? If anything, the action should be a panic IMO.

With these changes, it is quite small addition, and most important, it makes code much more modular, which I wanted: remove the knowledge of hwpmc and now pt from trap.c.

bnovkov retitled this revision from amd64: Refactor kernel-mode NMI handling to x86: Refactor kernel-mode NMI handling .

Address @kib 's comments:

  • NMI handlers are now dynamically registered
  • Added same changes to i386/trap.c

I've updated the patch and tested this approach by having pmcstat and the PT driver run at the same time. I have not encountered any crashes while testing.

There is, however, a peculiar issue I've encountered while working on this.
If I dynamically registers the ipi_nmi_handler, I get dropped into ddb when running the test described above.
However, as you can see in the patch, everything works fine if ipi_nmi_handler is invoked before the dynamic handlers.
I'm not sure what's causing this issue.

In D46421#1059782, @kib wrote:

I intended this approach to be applied to all nmi handlers, including the nmi_handle_intr(). [This in particular means that toggling of pcint should be inside the handlers]. Also, with some work, parts of the asm code for nmi from exception.S can be moved to C.

With these changes, it is quite small addition, and most important, it makes code much more modular, which I wanted: remove the knowledge of hwpmc and now pt from trap.c.

That makes sense, thanks for clearing things up!

Why do you check for existing handler registration when registering? If anything, the action should be a panic IMO.

I was on the fence about that one, I've changed it to a panic.

Why do you need zone for the handlers list elements? Due to NOFREE? I do not think it is justifiable.

Right, the handler list elements are never deallocated which is why I've created a NOFREE zone.

Please upload diffs with full context, otherwise the phab feature to suggest diff does not work.

I still suggest to not use zone, NOFREE for several elements is hardly a justification for the whole zone. Plain malloc would be fine there.

sys/x86/x86/cpu_machdep.c
896 ↗(On Diff #142681)

This could be *nmi_handlers_head. The head pointer does not need to be the struct, and you do not use it as full struct.

908 ↗(On Diff #142681)

Indent is wrong. It is worth to check for NULL func using plain atomic_load_ptr() before issuing CAS.

917 ↗(On Diff #142681)

If CAS failed, I do not think it would succeed for any later try. You need to re-read nmi_handlers_head.next on failure. See how atomic_fcmpset_ptr() is used to reuse of the load done as part of CAS.

927 ↗(On Diff #142681)

Wrong indent.

Why using CAS there and not just plain store if hp->func is equal to handler?

In D46421#1060536, @kib wrote:

Please upload diffs with full context, otherwise the phab feature to suggest diff does not work.

Sorry, don't know how that slipped through. The updated diff has full context.

I still suggest to not use zone, NOFREE for several elements is hardly a justification for the whole zone. Plain malloc would be fine there.

No problem, I've switched to plain malloc.

sys/x86/x86/cpu_machdep.c
917 ↗(On Diff #142681)

You are right, atomic_fcmpset_ptr was what I needed, thanks

sys/x86/x86/cpu_machdep.c
905 ↗(On Diff #142871)

If you use atomic_load() for accessing hp->func, it should be done consistently I think, i.e. for each load and store.

917 ↗(On Diff #142871)

Indent should be +4 spaces

You do not need the loop body if you use atomic_fcmpset(). nmi_handlers_head should be loaded into hp->next once, and then you just do fcmpset() in a while() loop.

Since the patch is almost finished, I looked at the overall structure and I do not see a reason to use atomics for accesses to ->func. Instead, note that CAS is used against ->next, and there we do need atomics, and perhaps store release on assignment, as well as load acquire on traversing.

sys/x86/x86/cpu_machdep.c
907 ↗(On Diff #143098)
hpf = atomic_load_ptr(&hp->func);
MPASS(hpf != handler);
if (hpf == NULL &&

My only concern with the current deregistration approach is a hypothetical situation where a handler resides in a loadable module and, thus, can be unloaded after deregistration.
I like the clever simplicity of the current code where list elements are never removed and only the head pointer can ever be modified.
But maybe it would be good for nmi_remove_handler to wait for a completion of the handler if it's running.
E.g., we could add an atomic counter to struct nmi_handler which would be incremented when the handler is started and decremented when the handler is done.
Then, nmi_remove_handler could spin-wait on that counter if it's not zero (after having cleared func field to prevent future calls).

sys/x86/x86/cpu_machdep.c
892–895 ↗(On Diff #143452)

Style-wise I would prefer if nmi_handler structure declaration and definition of nmi_handlers_head variable were split.

sys/x86/x86/cpu_machdep.c
907 ↗(On Diff #143452)

hp->next load is not different than nmi_handlers_head load, both must have acquire semantic

931 ↗(On Diff #143452)

again, loading next should have acq semantic, so that it sync/with rel in store of links, and we see func store if we see next element

957 ↗(On Diff #143452)

Same acq

sys/x86/x86/cpu_machdep.c
907 ↗(On Diff #143452)

Please correct me if I'm interpreting something wrong, but would that really be necessary?
We only ever use atomics on nmi_handlers_head and not on the ->next links, shouldn't a plain load be sufficient for traversal after we atomically load the list head?

sys/x86/x86/cpu_machdep.c
907 ↗(On Diff #143452)

When you store head you link the new element into the list. What makes a guarantee that if we found the element on the list, we read the func value written there before the linkage?

Address @kib 's and @avg 's comments:

  • Use atomic loads for links
  • nmi_remove_handler will now wait for a handler to finish before returning
In D46421#1064708, @avg wrote:

My only concern with the current deregistration approach is a hypothetical situation where a handler resides in a loadable module and, thus, can be unloaded after deregistration.

Thank you, I hadn't thought about this type of situation, the code should properly handle this now.

sys/x86/x86/cpu_machdep.c
907 ↗(On Diff #143452)

Ah, you are right, I've switched to atomic_load for the links.

Another general suggestion is that using CK_SLIST from ConcurrencyKit (included into FreeBSD) might help to make the code more compact and expressive.
But I haven't thought through details, so please feel free to ignore this suggestion if CK does not provide something that could be readily used.
The proposed code is completely good too.

sys/x86/x86/cpu_machdep.c
942 ↗(On Diff #143721)

IMO, this can / should be cpu_spinwait.

This revision is now accepted and ready to land.Sep 25 2024, 8:14 PM
In D46421#1066706, @avg wrote:

Another general suggestion is that using CK_SLIST from ConcurrencyKit (included into FreeBSD) might help to make the code more compact and expressive.
But I haven't thought through details, so please feel free to ignore this suggestion if CK does not provide something that could be readily used.
The proposed code is completely good too.

Yeah, please disregard this.
CK lists / queues are many readers / single writer.
So, unless you'd want to add a lock serializing registrations and deregitrations, those CK containers are not suitable.

At the same time, it might not be such a bad idea :-)

Since registration already uses malloc, it is not wait-free and it does not need to be wait-free in general.
Only nmi_handle_intr has to be wait-free.
So, adding the lock and reducing atomic requirements to handle interactions between a single writer (a single thread (de-)registering a handler) and a single reader (or maybe multiple if we can have concurrent NMI-s) may simplify atomic operations a little bit.
And CK could be used too.

Oh, thinking about this triggered some old memories and I've dug up my old musings on this topic.
The code was never really tested and I cannot guarantee that it's correct, but it seemed correct to the younger me and it still seems correct the now me.
https://people.freebsd.org/~avg/wfhandler.v8.h

Please feel free to ignore it or to pick any ideas, pieces of code, etc.

sys/x86/x86/cpu_machdep.c
910 ↗(On Diff #143721)

IMO load of ->func does not need to have atomic semantic, but it is fine to have it.

914 ↗(On Diff #143721)

The indent should be same as on the previous line

926 ↗(On Diff #143721)

The indent is wrong, it should be same as on the previous line

941 ↗(On Diff #143721)

I do not see a need for this load to have acquire semantic. Can you explain why?

971 ↗(On Diff #143721)

Similarly, why this add needs to be rel?

bnovkov marked an inline comment as done.

Address @kib 's and @avg 's comments.

This revision now requires review to proceed.Oct 2 2024, 4:52 PM
sys/x86/x86/cpu_machdep.c
941 ↗(On Diff #143721)

right, this was unnecessary since reordering isn't an issue here, I've removed the acq/rel semantics from the running counter. thanks for catching this!

kib added inline comments.
sys/x86/x86/cpu_machdep.c
910 ↗(On Diff #144133)

No need for this load to be atomic, you do not use atomic store for NULLing func.

969 ↗(On Diff #144133)

Same.

This revision is now accepted and ready to land.Oct 2 2024, 9:27 PM