Could be atomic operations were expensive in the past, yet now the
branch is likely more expensive. As such simply always use atomic add
for counter increment.
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 46123 Build 43012: arc lint + arc unit
Event Timeline
This looks odd. This looks really odd. True, use of atomics might merely be required for PPI, but I'm wondering why one would bother to avoid atomics for non-PPI. Are there some processors where the atomic add is really expensive? Otherwise it seems more sensible to simply blanket use atomic add.
Isn't non-PPI the common case? It makes sense to me to want to avoid atomics when possible, then. A related problem is that interrupt counters are allocated from a single array of longs, so it's possible for counters for distinct interrupts on different cores to reside in the same cache line. In this case, LL/SC atomics could be quite expensive if different cores are contending for ownership of the cache line.
I'm skeptical that unconditional use of atomics is preferable to what's there today. I don't have numbers, but I know that counter(9)'s use of atomics to implement per-CPU counters on arm/arm64 has a non-negligible cost (in the sense that some networking benchmarks benefit from removing packet counters). A possible compromise, which admittedly will require more work, would be to use counter(9) for interrupt counters. Then, a counter increment will involve an atomic operation on some platforms, but at least the affected cache line will be private to the CPU, and you wouldn't need a branch. This approach would be welcome on x86 as well, I think.
Ah ha. Nothing I'd seen had referenced this, notably neither 686450c8984c nor bff6be3e9b2c. That does make sense.
What seems to going on is the intrcnt/intrnames interface simply seems badly inadequate now. The counters should really be attached to either the intr_event structure, or the various architecture "src" structures. Spreading them out would likely push them onto distinct cache lines and fix rather a lot of issues (see D35608).
Abandoning this, D36610 gets the issue of spreading the interrupt counters apart and likely onto separate cache lines.