Page MenuHomeFreeBSD

sched_ule: rate limit work stealing
AbandonedPublic

Authored by gallatin on Apr 30 2020, 11:38 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Sep 20, 2:06 AM
Unknown Object (File)
Wed, Sep 18, 5:14 PM
Unknown Object (File)
Wed, Sep 18, 10:01 AM
Unknown Object (File)
Tue, Sep 17, 3:15 PM
Unknown Object (File)
Tue, Sep 17, 10:24 AM
Unknown Object (File)
Mon, Sep 16, 7:48 PM
Unknown Object (File)
Sun, Sep 15, 11:48 PM
Unknown Object (File)
Sun, Sep 15, 3:13 AM
Subscribers

Details

Summary

When executing a workload where threads are waking up and sleeping frequently, such as a web server workload with RACK TCP using tcphpts pacing, the scheduler frequently calls cpu_search_highest() to try to find a more highly loaded core and steal work from it. Netflix 100Gb/s servers call cpu_search_highest as much as 7 million times per second.

This change introduces a rate-limiting mechanism to limit this search for work to a configurable interval per-core. Limiting the search to once every 10ms moves the cpu_search() code from our 3-4th hottest on the system down into the noise, and drops load by a few percent and increases throughput by 4-5Gb/s on experimental machines with multiple 100GbE links.

Note that I used getsbinuptime() intentionally, as it didn't seem like we needed the precision of sbinuptime() and when using it, tsc_get_timecount_low() surfaced as a hot function in profiling and we saw less CPU savings.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 30827

Event Timeline

I was thinking about something like this many times. It would be nice to skip this often useless work. But each time I am stopped by worry that it may cause CPU under-utilization on a systems with load average about 1 per SMT thread. 10ms interval you mentioned sounds quite large to me, but honestly I can imagine some synchronous irregular load patterns where any chosen interval will cause a pessimization, dramatically increasing execution latency.

Alternatively I was thinking to not completely block stealing, but limit it to only some levels of CPU topology, something between threads of one core and one socket, so that we check neighboring threads often (it should be cheap), while other sockets only sometimes. It should still significantly reduce search time (not as much as full block, but still), while if we find nothing to steal around us, then there is a high chance there is nothing to steal at all. Better would be to check if there are some idle CPUs around, since if they are idle, it means they either already did their search and were not assigned any work after that, but this require more significant changes to cpu_search() and not completely strict due to likely races.

In D24645#542519, @mav wrote:

I was thinking about something like this many times. It would be nice to skip this often useless work. But each time I am stopped by worry that it may cause CPU under-utilization on a systems with load average about 1 per SMT thread. 10ms interval you mentioned sounds quite large to me, but honestly I can imagine some synchronous irregular load patterns where any chosen interval will cause a pessimization, dramatically increasing execution latency.

I don't propose this as a general solution, which is why I left it disabled by default. However, for a Netflix web workload, we don't see such patterns. I'm not sure I'd want this setting enabled on my desktop though!

Alternatively I was thinking to not completely block stealing, but limit it to only some levels of CPU topology, something between threads of one core and one socket, so that we check neighboring threads often (it should be cheap), while other sockets only sometimes. It should still significantly reduce search time (not as much as full block, but still), while if we find nothing to steal around us, then there is a high chance there is nothing to steal at all. Better would be to check if there are some idle CPUs around, since if they are idle, it means they either already did their search and were not assigned any work after that, but this require more significant changes to cpu_search() and not completely strict due to likely races.

Don't we already have some of this via kern.sched.trysteal_limit ? I guess you are proposing adjusting trysteal_limit, and looking frequently at cores that share L2, and less frequently the farther out you go? I like that idea in general. But I'm not confident in my ability to implement or tune it.

Do you object to what I have now? I can keep it as a local patch at Netflix, but I suspect it may be helpful for somebody else, which is why I want to get it upstream.

Don't we already have some of this via kern.sched.trysteal_limit ? I guess you are proposing adjusting trysteal_limit, and looking frequently at cores that share L2, and less frequently the farther out you go? I like that idea in general. But I'm not confident in my ability to implement or tune it.

Yes, form of trysteal_limit, but based on time. Nice sides of soft thresholds is that they are more forgiving to mistuning than hard ones.

Do you object to what I have now? I can keep it as a local patch at Netflix, but I suspect it may be helpful for somebody else, which is why I want to get it upstream.

I do not strictly object, but I am always trying to avoid custom tunables and workarounds as much as I can, preferring generic solution enabled by default. Also I can't say that I like getsbituptime() and 64-bit multiplication in a hot path, but it could be worse.

Also lets see what others think about it.

First off, lots of good discussion here. I think this is the start of a good approach and I support committing it disabled.

What I would do is use the group topology share level as a modifier so we cap the time by level. You will steal often from HTT neighbors, less so for cache neighbors, much less so from socket neighbors. This would nicely scale the work. You might never limit stealing from HTT neighbors because that's your cache and you're just burning a little CPU. You might want 10s of ms for a socket level transfer on a modern machine.

I think this would make a lot of sense. We also need to know where the latency/cpu cost trade-off is. Can you tell us what the impact is if you make it 1ms?

FWIW when I first wrote stealing the simplest benchmark that demonstrated value was buildstone. In fact before it, ule pessimized builds.

Is this still an issue? I think some work by mav@ managed to depessimize some of cpu search in 2668bb2add8d11c56524ce9014b510412f8f6aa9 and aefe0a8c32d370f2fdd0d0771eb59f8845beda17, perhaps to the point where the above is moot.

If stealing right now reduces your perf, I think a ratelimit on limit is only papering over the real problem.

In D24645#889225, @mjg wrote:

Is this still an issue? I think some work by mav@ managed to depessimize some of cpu search in 2668bb2add8d11c56524ce9014b510412f8f6aa9 and aefe0a8c32d370f2fdd0d0771eb59f8845beda17, perhaps to the point where the above is moot.

If stealing right now reduces your perf, I think a ratelimit on limit is only papering over the real problem.

I'm not sure.. I'll test with and without later in the week.

In D24645#889225, @mjg wrote:

Is this still an issue? I think some work by mav@ managed to depessimize some of cpu search in 2668bb2add8d11c56524ce9014b510412f8f6aa9 and aefe0a8c32d370f2fdd0d0771eb59f8845beda17, perhaps to the point where the above is moot.

If stealing right now reduces your perf, I think a ratelimit on limit is only papering over the real problem.

I'd forgotten that we disable idle stealing entirely right now, so the patch is moot for us. I'm happy to abandon it.