Page MenuHomeFreeBSD

safer wait-free iteration of shared interrupt handlers
ClosedPublic

Authored by avg on Jun 19 2018, 3:25 PM.
Tags
None
Referenced Files
F107382362: D15905.id44523.diff
Mon, Jan 13, 8:48 AM
Unknown Object (File)
Thu, Jan 9, 8:54 AM
Unknown Object (File)
Thu, Jan 9, 3:49 AM
Unknown Object (File)
Sun, Dec 29, 6:01 AM
Unknown Object (File)
Thu, Dec 26, 12:09 PM
Unknown Object (File)
Nov 30 2024, 12:41 PM
Unknown Object (File)
Oct 26 2024, 8:56 AM
Unknown Object (File)
Oct 11 2024, 6:35 AM

Details

Summary

Preamble

This change matters only for shared interrupts and I realize that those
are becoming a thing of the past (and quickly). I also understand that
the problem that I am trying to solve is extremely rare. Nevertheless,
I would like to make the code a bit more complex and a little bit safer.

The change

This is a fix for https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229106
plus some changes.

The idea of the fix is somewhat similar to the idea of the epoch
based reclamation. Atomic operations and memory fences are used to
ensure that ie_handlers list is always safe to navigate. A new element
is fully set up before getting linked into the list. And a removed
element is not reclaimed until we know that the list is not being
iterated.

There are some simplifications comparing to the general epoch based
reclamation. All writers are serialized using a mutex, so we do not
need to worry about concurrent modifications. Also, all read accesses
from the open context are serialized too. So, there can only be a
single unsynchronized reader at a time. It's either the code running in
the ISR context or an interrupt thread. Thus, instead of epoch numbers
it is possible to simply mark the start and end of ie_handlers
iteration.

Some details.
I changed ie_handlers from TAILQ to SLIST for two reasons:

  • SLIST is easy to make always consistent for lock-less iteration
  • SLIST is sufficient for all required operations given that its modifications are not performance critical

I use a previously unused SLIST_FOREACH_PREVPTR to search for a place
where to insert or remove an element. I do not use any SLIST_INSERT or
SLIST_REMOVE variants, but roll out their equivalents necessary to
ensure atomic properties. That somewhat breaks SLIST encapsulation.
It also reinvents CK_SLIST a little bit. I felt that using CK_SLIST
when most of accesses are done under a lock is an unneeded overhead.

I actually use two different mechanisms to wait for a completion of the
unsynchronized access. The ISR context atomically increments and
decrements 'running' indicator and the removals spin until it becomes
zero. The assumption is that interrupt filters are very fast.

The ithread code uses a similar but a different indicator. That
indicator is set and cleared under a spinlock. The removals also check
the indicator under the spinlock. If it's not zero, then the code
sets a special 'waiting' flag and uses msleep(9) to wait. The ithread
code wakes up the waiter(s) when clearing the running indicator. I did
it this way, because I assumed that the ithread processing can take much
longer than the filters, so it is not practical to spin for all that
time. But maybe a regular mutex would serve here better than the sleep
plus wakeup.

I also changed _intr_drain() that is used by LinuxKPI to do the same
kind of waiting as the interrupt handler removal does.

I removed a check for 'cold'. I believe that msleep() should work fine
in all contexts where intr_event_remove_handler could be called.

I made intr_event_execute_handlers() static because it's not used
outside of kern_intr.c. So, I decided to "appropriate" it.

Also, previously intr_event_update() was not called if a filter-only
intr_hanlder was removed.

A couple of white space fixes.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 18349
Build 18068: arc lint + arc unit

Event Timeline

CK_SLIST when most of accesses are done under a lock is an unneeded overhead.

Can you elaborate on that overhead?

Also: this approach seems somewhat complicated compared to just slapping a lock around read accesses to the list, if the protection is only needed for already-slow shared interrupts and the lock is not shared across different interrupt numbers.

In D15905#337019, @cem wrote:

CK_SLIST when most of accesses are done under a lock is an unneeded overhead.

Can you elaborate on that overhead?

I mean that even if the list is accessed under a mutex the CK_SLIST operations would still use all the fences because they wouldn't know about the mutex.
Or if the code already has fences around a larger block that includes CL_LIST operations.
Maybe it's a trivial cost and nothing to worry about.

Also: this approach seems somewhat complicated compared to just slapping a lock around read accesses to the list, if the protection is only needed for already-slow shared interrupts and the lock is not shared across different interrupt numbers.

The shared interrupts are not inherently "already" slow. E.g. you can have two fast interrupt filters sharing the same interrupt.

sys/kern/kern_intr.c
816

I believe this load needs the acquire semantic, so that whatever happens in free become properly ordered with the iteration.

Did you looked at sys/seq.h ?

In D15905#337055, @avg wrote:
In D15905#337019, @cem wrote:

CK_SLIST when most of accesses are done under a lock is an unneeded overhead.

Can you elaborate on that overhead?

I mean that even if the list is accessed under a mutex the CK_SLIST operations would still use all the fences because they wouldn't know about the mutex.
Or if the code already has fences around a larger block that includes CL_LIST operations.
Maybe it's a trivial cost and nothing to worry about.

Yeah, maybe.

Also: this approach seems somewhat complicated compared to just slapping a lock around read accesses to the list, if the protection is only needed for already-slow shared interrupts and the lock is not shared across different interrupt numbers.

The shared interrupts are not inherently "already" slow. E.g. you can have two fast interrupt filters sharing the same interrupt.

Sure, but they will already ping-pong the linked list walk between cores if there is contention there; might as well ping pong the real lock instead.

In D15905#337073, @cem wrote:

Sure, but they will already ping-pong the linked list walk between cores if there is contention there; might as well ping pong the real lock instead.

I am not sure what exactly you mean by this.
What kind of contention?

In D15905#337213, @avg wrote:

I am not sure what exactly you mean by this.
What kind of contention?

Ah, I had assumed that independent handlers had independent ithreads, but I see that this is not the case. Such handlers are already totally serialized by intr_event_execute_handlers.

Maybe we could allocate independent ithreads for each handler on a shared interrupt source. I'm not sure what the ramifications of that are or if there is a good reason we do not already do that. It may just be that shared interrupts mostly date to the UP days.

sys/kern/kern_intr.c
816

Thank you for catching this.

Yes, I did look at seq.h.
But I think that it's not entirely applicable.
Initially I thought that the interrupt event code (in kern_intr.c) works under very simple constraints.
By now I realized that the situation is much more messy.
For example, I see that some architectures (!x86) use that facility to install IPI handlers.
Also, if I understand the code right, handlers for PPI (processor private interrupts such as, for example, LAPIC timer interrupt) can also be installed using the interrupt event interface.
Examples are platforms that use INTRNG (subr_intr.c) and some others, notably riscv.
So, that means that it is possible that intr_event_handle can be called with the same interrupt event argument concurrently on different cores. That complicates things.
I suspect that this possibility was not in the original design for interrupt events, but it is not prohibited.
I suspect that those "per-CPU" interrupts are never shared, but that is not explicitly prohibited.
I suspect that those "per-CPU" interrupts are never torn down, but that is not explicitly prohibited.

Maybe we could straighten up the interrupt event code (or its use).
But I do not want to introduce any implicit or hidden assumptions in the new code.
So, I think that I have to make it work for multiple concurrent readers scenario as well.

In D15905#337244, @cem wrote:

Maybe we could allocate independent ithreads for each handler on a shared interrupt source. I'm not sure what the ramifications of that are or if there is a good reason we do not already do that. It may just be that shared interrupts mostly date to the UP days.

This was implemented in the code that you recently removed (INTR_FILTER).

In D15905#337244, @cem wrote:

Maybe we could allocate independent ithreads for each handler on a shared interrupt source. I'm not sure what the ramifications of that are or if there is a good reason we do not already do that. It may just be that shared interrupts mostly date to the UP days.

Additionally, anything "interrupt thread" is completely irrelevant to interrupt filters that can also share an interrupt.

In D15905#337246, @avg wrote:
In D15905#337244, @cem wrote:

Maybe we could allocate independent ithreads for each handler on a shared interrupt source. I'm not sure what the ramifications of that are or if there is a good reason we do not already do that. It may just be that shared interrupts mostly date to the UP days.

This was implemented in the code that you recently removed (INTR_FILTER).

The code may have been in-tree, but it was not used. Also, I think it did more than just that, but I don't recall. What were the ramifications of using it, and was there a good reason it was never enabled? Or just inertia? Feel free to reverse the default if it makes sense — I just want one copy of the interrupt code.

In D15905#337247, @avg wrote:

Additionally, anything "interrupt thread" is completely irrelevant to interrupt filters that can also share an interrupt.

Yes, that is understood. The comment you are replying to is specifically talking about ithread handlers, and not filters.

Rework the proposed change.

This version still changes TAILQ to SLIST for maximum consistency.
Although, TAILQ would probably do as well given that the lock-free
iteration is always done from the head and in the forward direction.

I went back to the technique of deferring a removal of an interrupt
handler to the interrupt event's interrupt thread, if it exists.
Previously this was done only if the interrupt thread was already
running. Now it's done always in order to avoid races with the interrupt
thread getting to run just after the check. And with
intr_event_handle() too.

Another change is to shorten the time of waiting for intr_event_handle()
to become quiescent (and to avoid writer starvation). The new technique
uses two "phases", the current phase and the new phase. The phase is
switched (flipped) after a handler is removed from the interrupt event.
This way, newly arriving interrupts are accounted against the new phase
while the remover waits for the previous phase to quiesce. Two phases
are sufficient because all modifications are strictly serialized and
each modification drains the current phase before allowing the next
modification.

I also needed to add another flavor of SLIST_FOREACH,
SLIST_FOREACH_PREVPTR_SAFE. The new variant provides access to the next
pointer in the previous element and it is safe with respect to removing
of the current element.

The code still has some changes needed only for testing, they are marked
with TESTING.

Finally, in one place I need to use a memory order like C++
memory_order_consume. That is, I want to do a load of a variable and to
ensure that accesses that depend on that variable (and only they) would
be ordered after the load. The acquire order would certainly fulfill
that guarantee, but I have a feeling that on all supported platforms
that guarantee is provided by default.

By the way, here is a "purified" variant of the interrupt handling code without all the complications of the interrupt threads, software interrupts, interactions with other subsystems, etc.
This is purely a list of handlers that can be iterated in a wait-free manner.
I think that it could be used, for example, for installing NMI handlers in a more regular fashion than what we have now.
https://reviews.freebsd.org/P185
I hope that that code makes the idea behind this change a little bit more clear.

In D15905#337248, @cem wrote:

The code may have been in-tree, but it was not used. Also, I think it did more than just that, but I don't recall. What were the ramifications of using it, and was there a good reason it was never enabled? Or just inertia? Feel free to reverse the default if it makes sense — I just want one copy of the interrupt code.

The idea of the code was to have a separate interrupt thread for each filter + ithread-handler pair. If a handler had only a filter, then nothing changed. If a handler had only an ithread-handler, then nothing changed too, all such handlers were called from the same ithread. Only for filter + ithread-handler pairs there was a separate thread, so that the thread was scheduled only if the filter requested it.
I think that that was a very good improvement for its time. I think that it didn't get enabled by default because of a single vague report that the new code caused an interrupt loss. Maybe the code had a bug indeed, maybe not.
Anyway, I think that the ship has sailed. Shared interrupts seem to be going extinct. In the PC world they seem to be used only for legacy or non-critical on-board devices such as older USB controllers and consumer audio.
Anything where latency (and performance in general) matters gets a dedicated interrupt, typically an MSI. Network, storage, etc.

Stop me or I will commit this.
I will [pretend to] ignore any post-commit comments from the reviewers too.

It would be nice to separate the mechanical TAILQ -> SLIST conversion from the rest of the changes. That is trivial and can go in with no objection or review. It would also make it easier to review the smaller diff but more significant change to the interrupt code.

sys/kern/kern_intr.c
68

Probably rename this before commit

cem requested changes to this revision.EditedJun 25 2018, 8:02 PM

Please update with TAILQ->SLIST patch removed so it's easier for reviewers to see the meat of the change. (And then I'd ask for 24h.)

This revision now requires changes to proceed.Jun 25 2018, 8:02 PM
In D15905#338982, @cem wrote:

Please update with TAILQ->SLIST patch removed so it's easier for reviewers to see the meat of the change. (And then I'd ask for 24h.)

That's a reasonable request.
I have created a review for TAILQ->SLIST, see D16016.

In D15905#339107, @avg wrote:

That's a reasonable request.
I have created a review for TAILQ->SLIST, see D16016.

Thanks!

sys/kern/kern_intr.c
68

I might even leave it in unconditionally. Failpoints are useful to have in-tree and are low cost.

598–599

I can't pretend I understand the semantics of our atomic_store_<rel|acq>...() routines, but if kib says this is the right one, I'll believe it.

599

SLIST_INSERT_PREVPTR_ATOMIC, maybe. Could signal it is "hidden" with some underscores if you're not sure it has the exact semantics you want. I don't think open-coding SLIST-modification in a consumer file makes sense.

702

Seems like phase should just be bool.

801–804

Hm. Maybe it makes sense to put this in queue.h too? "_SLIST_REMOVE_PREVPTR_NOTRASH()?"

It's somewhat valuable to keep the QMD_SLIST_CHECK_PREVPTR(prevp, elm); even if we remove the TRASHIT() on the removed node's sle_next pointer.

1269

Why is this ifndef testing?

sys/kern/kern_intr.c
1269

Oops, this one is a mismerge. It's supposed to be around KASSERT. The reason is that my test setup constantly fires an interrupt regardless of any drivers attached. So, it's possible that there is an interrupt handler installed when intr_event_handle is called, but there is no interrupt handler by the time intr_event_schedule_thread is called.

sys/kern/kern_intr.c
599

I just fear a combinatorial explosion of queue.h macro variations. { INSERT, REMOVE } X{BEFORE, AFTER, PREVPTR} X { ATOMIC, non-atomic } X { NOTRASH, trash } X { etc }.
Maybe I should really use CK_SLIST instead of SLIST here.
I feel somewhat uncomfortable doing that as CK is too new for me and adding extensions to CK could be a harder / longer process as it is contributed code.

702

I look at it as a one-bit integer :)

@avg Does FreeBSD plan to support Alpha? I honestly wouldn't bother with memory_order_consume unless the FreeBSD model mandates it. All modern architectures are sane with respect to data-dependent loads. Plus, there is a lot more code that would actually need to be using it, that doesn't.

On x86* and other TSO architectures, note that the fences will be emitted. With respect to code movement, the loads will be not be elided in eitherway. I doubt there would be measurable overhead.

sbahra_repnop.org added inline comments.
sys/kern/kern_intr.c
695

This requires atomic / volatile load, formally speaking.

703

My understanding here is ie->ie_phase is single-writer, many reader, yes? (in this stage of the state machine)

719

Small optimization, but if you want, you only need to do the acquire barrier after exiting the loop, no need to do it on every iteration.

1212

This load should be using an atomic. In ck, you can use ck_pr_load_int(&ie->ie_phase) to make the compiler do the right thing but FreeBSD atomics also provide this.

In this case, you want acquire or at least a load fence, and not memory_order_consume. The load is not actually data dependent, this can be viewed as a control dependency instead.

1216

This seems a bit weird to me. I might be missing a detail.

Is it guaranteed the list won't be mutated once ie_active[phase] is set to 1? If so, you need to ensure the atomic is ordered with respect to the subsequent load, which means a full fence is required (ck_pr_fence_atomic_load or the equivalent of full fence / seq_cst in FreeBSD). Note, that full fence isn't needed for TSO architectures though, because atomic_add will be implemented using atomic, which will have same semantics as full fence (you want something like add_seq_cst, in reality).

If you don't want full fence, then you'll need CK_LIST-like fence semantics. If list will be actively mutated, you'll also need those fences (and ensure write-side ordering linking of a node appropriately).

This revision now requires changes to proceed.Jul 12 2018, 1:53 AM
sys/kern/kern_intr.c
695

ie_phase is volatile and on all FreeBSD supported platforms load and stores are atomic without any special arrangements.

703

Yes, ie_phase is always single-writer and many readers.

719

That's a good suggestion. Thanks!

1212

Umm... not sure if we are on the same page here.
My comment in the code is about loading ie_phase -> phase and then accessing ie_active[phase]. These two operations are data dependent right?

1216

Is it guaranteed the list won't be mutated once ie_active[phase] is set to 1?

No, it's not guaranteed.

The design is this.
Readers increment / decrement ie_active[phase] around the section where the list is accessed (where phase is a snapshot of ie_phase).
A writer that wants to remove a list element first switches the ie_phase, then modifies the list and finally waits for ie_active[previous phase] to go to zero. Only then the writer can free the removed element.

sys/kern/kern_intr.c
695

By specification, volatile has no atomicity guarantees and movement is only guaranteed WRT to other volatile unless there is data dependency.

If you are saying the FreeBSD kernel memory model guarantees atomicity, then ok, but it seems haphazard (as it cannot be guaranteed absent other guarantees of no split load, store or coalescing). GCC will do the right thing in general though, but atomic load is strong guarantee.

https://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf

1216

You need a full fence to ensure phase counter increment is not re-ordered with the subsequent load on linked list on relaxed memory model architectures. Store-load requires full fence.

sys/kern/kern_intr.c
695

I understand what you say, but FreeBSD code has that assumption in many places and it is documented in atomic(9), so I do not see any reason to make this code stand out.
FreeBSD simply does not support any architectures where a split read or write of a naturally sized int can occur.
Please see the link.

1216

Do you mean atomic_add_acq_int(&ie->ie_active[phase], 1) on line 1228 above?

If yes, then I must admit that I am a little bit confused about how operations like that work.
This is an add, so it must first virtually load the current value and then store the new value. And those load and store must appear as a single atomic operation. In this case the operation also has the acquire semantics that makes sense only for loads. So, I hoped that atomic_add_acq guarantees that no subsequent loads and stores can be reordered before that add. Is that not so? Then, what purpose the acquire order could possibly serve for atomic add?

sys/kern/kern_intr.c
1216

Sorry for confusion. Yeah, I'm referring to the atomic_add_acq_int operation. This means a full fence, in this case, seq_cst will suffice or full fence after a relaxed atomic_add.

sys/kern/kern_intr.c
1216

Okay, got it. But still not clear about the reason...
Would it be correct to say that from the memory ordering point of view the atomic add appears like load + store and the acquire order applies only to the load part of the add?

@cem, you are shown as having requested a change but looking into the comment history I am not sure what that change might be.
Could you please clarify?

In D15905#348328, @avg wrote:

@cem, you are shown as having requested a change but looking into the comment history I am not sure what that change might be.
Could you please clarify?

It was this: https://reviews.freebsd.org/D15905#338982

Which has been completed. I don't know of a way in phabricator to un-request changes other than approving, which isn't quite the same thing.

Per suggestions from Samy Al Bahra:

  • turn a memory order attribute of an atomic operation within a loop into a standalone memory fence after the loop
  • use a sequential consistency fence to prevent re-ordering of atomic_add and the subsequent loads

remove TESTING / KFAIL code

In D15905#348331, @cem wrote:

It was this: https://reviews.freebsd.org/D15905#338982

Which has been completed. I don't know of a way in phabricator to un-request changes other than approving, which isn't quite the same thing.

Ah, I see.
I wonder if there is some trick to overcome this.
Maybe removing yourself from reviewers and then re-adding back.

I am considering all previous requests for changes as outdated and I am going to commit this.
Please state if you still have any objections.

cem removed a reviewer: cem.
cem added a subscriber: cem.

Hm resigning does not actually remove you as a reviewer.

No objection from me. I do not grok FreeBSD atomic rel/acq semantics well enough to say this with any authority, but it looks good to me.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 3 2018, 2:27 PM
This revision was automatically updated to reflect the committed changes.