This compound patch was inspired by a test case that was sent to me where ULE did very badly. It is simply:
cpuset -l <single cpu> shell of your choice
while forever &
cksum /dev/ssd
SSDs can respond in dozens or hundreds of microseconds. The checksum starts out interactive but quickly reaches 30+% cpu on my system and is marked for batch processing. At that point it now waits on its timeslice behind the loop every time it is woken up. This means it runs once per-100ms until it potentially decays back to an interactive thread where it briefly preempts and runs back up to 30% cpu where it loses interactivity again. This oscillation happens over the course of many seconds because the interactivity score is purposefully harder to enter than leave. ULE was not designed with high frequency wakeups for batch processes. The code in tdq_runq_add() ensures that the loop will get to complete its slice if it is preempted. We don't preempt batch priority tasks. We do try to limit latency for cpu hogs by scaling down the slice but even running once per-tick here is insufficient for more than about 3% cpu on my system. If we want this to behave well, we have to allow some limited timeshare preemption.
The fix I'm experimenting with, is allowing a kind of limited preemption for timeshare threads whose priorities are far enough apart. This somewhat defeats the fairness mechanism in the timeshare queue. What it does is check if the priority delta exceeds a threshold and then sets NEEDRESCHED if it does. To defeat the optimization where preempted threads are placed back at the head of the queue, we have to re-insert this favorable thread at the head queue position. Since priorities for batch are determined by %cpu, this allows the checksum thread to hit a stable cpu consumed. On a small time scale it is actually falling in and out of the priority range that allows for limited preemption so it runs in short bursts but this oscillation in state happens many times per-second. This change would exacerbate the effect of negative nice but not completely eliminate the fairness. I have mixed feelings about the timeshare_delta as a result.
Along the way I noticed that we were not setting NEEDRESCHED on remote wakeups and we were generating somewhat excessive IPIs and preemption. I also found two bugs in non-default sysctl settings that I fixed. I will document those inline below.
This makes buildworld take 1% longer on my 16 core/32 thread amd TR. It reduces preemption by over 30%. The timeshare "preemption" adds 10% more NEEDRESCHED calls but the missing remote RESCHED doubles the total count. I believe that is actually a bug and needs to be fixed despite the minor slowdown.