I'm open to feedback and suggestions for a different solution; apologies
for the long description.
I've seen a hang on a couple of systems using i915kms, apparently caused
by the following livelock:
1. Thread T1 holds a LinuxKPI spin mutex, which means that it called
local_bh_disable(), which means that it's pinned to the CPU, i.e.,
the scheduler can't migrate it to a different CPU.
2. T1 enters the seqlock-protected section in intel_engin_context_in()
or intel_engine_context_out().
3. T1 is interrupted by a timer interrupt, which schedules a callout
thread T2 on the same CPU. T2 executes rps_timer(), which calls
intel_engine_get_busy_time(), which spins on the seqlock owned by T1.
4. T1 never runs since it can't be migrated to another CPU, has a
lower scheduling priority than T2, and there is no priority
propagation mechanism which allows T2 to avoid starving T1.
On Linux the problem doesn't happen since write_seqlock_irqsave()
disables interrupts. In general LinuxKPI does not implement the same
irq/irqsave semantics: FreeBSD executes very little code in hard
interrupt context, so LinuxKPI drivers tend to follow the FreeBSD model
and execute interrupt handlers in threaded contexts. That is, most KPIs
do not disable interrupts even where Linux does.
So what can we do? local_bh_disable() doesn't actually disable tasklets
and softirq handlers, so its implementation seems wrong and maybe we can
just make it a no-op. Then T1 could be scheduled on another CPU. But
on Linux, local_bh_disable() disables preemption, so consumers might
assume that it provides consistent access to per-CPU data structures,
and we may well be on a system with a single CPU, especially now that
some folks are working on GVT-d.
The solution here is to enter a critical section when
write_seqlock_irqsave() is called. Then the write seqlock holder will
never be preempted. This won't work if something ever tries to acquire
a spin lock in a write_seqlock section, but no existing consumers do.
An alternative may be to have the reader yield the CPU if it fails too
many times, but Linux doesn't (need to) do this, nor does our native
seqlock implementation, so it feels too hacky.