Page MenuHomeFreeBSD

Reduce unnecessary preemption, add a preemption knob for timeshare, fix missing NEEDRESCHED
Needs ReviewPublic

Authored by jeff on Jun 24 2018, 12:07 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mar 18 2024, 5:55 AM
Unknown Object (File)
Mar 16 2024, 2:13 AM
Unknown Object (File)
Feb 17 2024, 3:43 AM
Unknown Object (File)
Feb 13 2024, 10:35 PM
Unknown Object (File)
Jan 8 2024, 5:42 PM
Unknown Object (File)
Jan 2 2024, 6:24 PM
Unknown Object (File)
Dec 22 2023, 10:16 PM
Unknown Object (File)
Nov 26 2023, 2:48 AM

Details

Summary

This compound patch was inspired by a test case that was sent to me where ULE did very badly. It is simply:

cpuset -l <single cpu> shell of your choice
while forever &
cksum /dev/ssd

SSDs can respond in dozens or hundreds of microseconds. The checksum starts out interactive but quickly reaches 30+% cpu on my system and is marked for batch processing. At that point it now waits on its timeslice behind the loop every time it is woken up. This means it runs once per-100ms until it potentially decays back to an interactive thread where it briefly preempts and runs back up to 30% cpu where it loses interactivity again. This oscillation happens over the course of many seconds because the interactivity score is purposefully harder to enter than leave. ULE was not designed with high frequency wakeups for batch processes. The code in tdq_runq_add() ensures that the loop will get to complete its slice if it is preempted. We don't preempt batch priority tasks. We do try to limit latency for cpu hogs by scaling down the slice but even running once per-tick here is insufficient for more than about 3% cpu on my system. If we want this to behave well, we have to allow some limited timeshare preemption.

The fix I'm experimenting with, is allowing a kind of limited preemption for timeshare threads whose priorities are far enough apart. This somewhat defeats the fairness mechanism in the timeshare queue. What it does is check if the priority delta exceeds a threshold and then sets NEEDRESCHED if it does. To defeat the optimization where preempted threads are placed back at the head of the queue, we have to re-insert this favorable thread at the head queue position. Since priorities for batch are determined by %cpu, this allows the checksum thread to hit a stable cpu consumed. On a small time scale it is actually falling in and out of the priority range that allows for limited preemption so it runs in short bursts but this oscillation in state happens many times per-second. This change would exacerbate the effect of negative nice but not completely eliminate the fairness. I have mixed feelings about the timeshare_delta as a result.

Along the way I noticed that we were not setting NEEDRESCHED on remote wakeups and we were generating somewhat excessive IPIs and preemption. I also found two bugs in non-default sysctl settings that I fixed. I will document those inline below.

This makes buildworld take 1% longer on my 16 core/32 thread amd TR. It reduces preemption by over 30%. The timeshare "preemption" adds 10% more NEEDRESCHED calls but the missing remote RESCHED doubles the total count. I believe that is actually a bug and needs to be fixed despite the minor slowdown.

Test Plan

I am hoping to get testing in environments where the reduced preemption would be more significant than the increased NEEDRESCHED switches. Things with lots of I/O and kernel time are good candidates.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sched_ule.c
467–468

This avoids excessive NEEDRESCHED if the thread won't run immediately.

469

If curthread had been preempted after it has added itself to a sleep queue this will be false.

504

Here what will happen is that we will send the IPI but not actually preempt for more than the IPI handler. So if we hit while we're in kernel code it will not switch until ast(). If we hit while we're in user code it will switch in the ast() from IPI return. In this way we will lower the latency for interactive tasks without increasing kernel contention from preemption.

1640–1641

This didn't work for values of sched_interact that diverged greatly from MAX_INTERACT - MIN_INTERACT. It just so happens it produces reasonable values with the defaults.

2246

With a static_boost of 1 and no prio set from the caller you would elevate to priority 1.

2418

This is responsible for most of the reduction in preemption counts. In this way the IPI will still notify the remote processor immediately of the scheduling event but we don't necessarily have to respond right away.

sched_ule.c
500–501

Unfortunately this works when you are considering two threads that are more than delta apart but not three threads.

Thread a is pri 0, thread b is + .5delta and thread c is 1.5delta.

Thread c is running
Thread a "preempts" thread c and is moved to runqueue posotion to ridx.
Thread c is at ridx behind thread a.
Thread b is scheduled but doesn't trigger this path since it's not better than a, it is added at runqueue position ridx + pri behind both a and c.

a finishes running or finishes its slice
thread c now runs, whereas if a hadn't been present, thread b would run.

If you reverse the order of a and b you get the same inversion.

This improves the situation and given stochastic wake times will probably average out ok. It doesn't yet feel as satisfying as it could be though.

@jeff I do not completely understand why in this scenario cksum runs behind the loop. Does cksum get a priority worse than the loop?
Otherwise, I would expect that even without preemption TDF_NEEDRESCHED would produce a similar effect.

Hmm, now I see a difference between 4BSD and ULE.
If kick_other_cpu does not preempt the remote CPU, then it does this:

pcpu->pc_curthread->td_flags |= TDF_NEEDRESCHED;
ipi_cpu(cpuid, IPI_AST);

On the other hand, tdq_notify either preempts or does nothing at all.

Also, what's a practical difference between TDF_NEEDRESCHED + IPI_AST and IPI_PREEMPT ?

In D15985#338884, @avg wrote:

@jeff I do not completely understand why in this scenario cksum runs behind the loop. Does cksum get a priority worse than the loop?
Otherwise, I would expect that even without preemption TDF_NEEDRESCHED would produce a similar effect.

https://people.freebsd.org/~jeff/FBSD_14_10ULE_aj.jm_cx10_09.pdf

I think this pdf is brief and informative if I may say so myself. Timeshare threads in ULE enforce faireness through a kind of calendar queue. In 4BSD this is accomplished by periodically adjusting priorities and moving threads on the run-queue. In ULE I was trying to avoid a periodic scheduling thread so what I did instead was to insert the thread into the queue at a position relative to the current priority. As time goes on, a queue index marches forward bringing even low priorities to the head of the queue. This combined with giving full slices to the batch queue creates a condition where you have to wait a full 100ms for a timeslice and this thread really wants to sleep and run for tens of microseconds at a time.

In D15985#338891, @avg wrote:

Hmm, now I see a difference between 4BSD and ULE.
If kick_other_cpu does not preempt the remote CPU, then it does this:

pcpu->pc_curthread->td_flags |= TDF_NEEDRESCHED;
ipi_cpu(cpuid, IPI_AST);

On the other hand, tdq_notify either preempts or does nothing at all.

I had just forgotten about IPI_AST but I like the way I have implemented it here better. It gives the remote scheduler a chance to look at what's going on and make a new decision.

I think what I would like to do is commit this with the timeshare preempt delta disabled until I get more experience with it and see if I can reason out a better algorithm.

Please review with that in mind.

In D15985#338992, @jeff wrote:

I had just forgotten about IPI_AST but I like the way I have implemented it here better. It gives the remote scheduler a chance to look at what's going on and make a new decision.

I just hoped that maybe a smaller change could fix the problem.
Honestly, to me this change looks overly specialized towards the problem it solves (rather than a general improvement of the scheduling logic).
Also, I don't quite like that _sched_shouldpreempt_ which was a pure function now becomes a function with non-obvious side effects.

In D15985#339080, @avg wrote:
In D15985#338992, @jeff wrote:

I had just forgotten about IPI_AST but I like the way I have implemented it here better. It gives the remote scheduler a chance to look at what's going on and make a new decision.

I just hoped that maybe a smaller change could fix the problem.
Honestly, to me this change looks overly specialized towards the problem it solves (rather than a general improvement of the scheduling logic).
Also, I don't quite like that _sched_shouldpreempt_ which was a pure function now becomes a function with non-obvious side effects.

The majority of the patch is actually unrelated to the specific issue. Only the timeshare delta and tdq_runq_elevate are for the problem. The others were just issues I noticed when reviewing the code while looking into this issue. I can break them off and commit them one at a time if it becomes more clear. I'm not even certain I want to commit the timeshare_delta as it is because I am hoping someone may suggest a more elegant fix.

In D15985#339084, @jeff wrote:

The majority of the patch is actually unrelated to the specific issue. Only the timeshare delta and tdq_runq_elevate are for the problem. The others were just issues I noticed when reviewing the code while looking into this issue. I can break them off and commit them one at a time if it becomes more clear. I'm not even certain I want to commit the timeshare_delta as it is because I am hoping someone may suggest a more elegant fix.

IMO, it would be better to do changes this way. Thank you!

In D15985#339132, @avg wrote:

IMO, it would be better to do changes this way. Thank you!

Ok but I'm not likely to make a bunch of little reviews. These are inter-related enough to be presented as a whole.

I ran this on a Netflix 100g box. I observed no measurable difference in CPU time. So I think this patch is "neutral" for the perspective of our (mostly kernel) workload.

An interesting approach might be to put non-interactive threads on the real-time queue if they are preempted or sleep/wakeup before they have consumed their entire time slice. This shouldn't change anything for a totally CPU-bound thread unless it gets preempted. For the cksum example, the thread would temporarily be treated more like a low-priority interactive thread until in manages to use up its time slice and gets put back on the timeshare queue where it will have to sit and wait for its turn to run again. In a situation like this, there could be more than one thread removed from the time share queue.

One complication is that we would need to keep track of the total thread runtime across multiple sleep/wakeup cycles from the time that it was pulled out of the time share queue. Another is a potential fairness problem. If a thread uses 90% of its slice before going to sleep, how easy will it be to suspend the thread and put it back on the time share queue after it consumes the next 10% as opposed to letting it consume an entire slice more? If this is difficult, then when the wakeup happens we could randomly start it running 10% of the time or put it on the time share queue 90% of the time. That should average out to being fair overall. This may also complicate thread stealing as well.

The other changes look reasonable after my brief reading.

Another potential advantage of the scheme that I suggested is that I think it should reduce thrashing in certain circumstances. For instance, if there are a bunch of cksum-like threads running in parallel, only a limited number of them will be pulled out of the time share queue. Once sufficient threads have pulled off that queue to fully occupy the CPU, no more will be started until some of the first bunch reach the end of their time slices. The current implementation will rapidly churn through the contents of the time share queue as each thread goes to sleep and triggers the next thread in the queue to be started.

Also, a thread that has an average runtime between wakeups that is much less than the batch time slice needs to be able to interrupt a totally CPU-bound thread.

I ran this on a Netflix 100g box. I observed no measurable difference in CPU time. So I think this patch is "neutral" for the perspective of our (mostly kernel) workload.

I did before and after poudriere runs. both were about 10 1/2 hours. The run after the patch was about 0.5% faster. I don't know how much noise there is in the benchmark results so I don't know if this result is significant.