sched_ule: Re-implement stealing on top of runq common-code
Needs ReviewPublic
Actions

Authored by olce on May 27 2024, 10:27 PM.

Details

Reviewers

jhb
markj
mav
jeff

Summary

Please see overview of project at D45393.

Stop using internal knowledge of runqueues, remove duplicated code (in
sometimes unnecessarily different variants).

runq_steal_from(): Suppress first thread special case

This special case, introduced as soon as commit "ULE 3.0"
(ae7a6b38d53f, r171482, from July 2007), has no apparent justification.
All the reasons we can second-guess are dubious at best. In absence of
objections, let's just remove this twist, which caused bugs in the past.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 59345
Build 56232: arc lint + arc unit

Event Timeline

olce created this revision.May 27 2024, 10:27 PM

Herald added a subscriber: imp. · View Herald TranscriptMay 27 2024, 10:27 PM

olce requested review of this revision.May 27 2024, 10:27 PM

Harbormaster completed remote builds in B57941: Diff 139157.May 27 2024, 10:27 PM

olce added a parent revision: D45387: runq: API rationalization, code factorization, revised implementation.May 27 2024, 10:29 PM

olce added a child revision: D45389: sched_ule: Use a single runqueue per CPU.

olce mentioned this in D45393: runq/sched: Switch to 256 distinct queues and extend the timeshare range.May 27 2024, 10:36 PM

olce edited the summary of this revision. (Show Details)May 27 2024, 10:38 PM

I suspect that first thread was skipped to avoid stealing a thread that was just scheduled to a CPU, but was unable to run yet.

The point was to move threads that are least likely to benefit from affinity because they are unlikely to run soon enough to take advantage of it. We leave a thread that may execute soon. I would want to see this patchset benchmarked together with a wide array of tests to make sure there's no regression in the totality of it. I don't feel particularly strongly about this case but taken together there is some chance of unintended consequences.

In D45388#1035721, @mav wrote:

I suspect that first thread was skipped to avoid stealing a thread that was just scheduled to a CPU, but was unable to run yet.

I don't think that's a possibility with the current code, right?

In D45388#1036064, @jeff wrote:

The point was to move threads that are least likely to benefit from affinity because they are unlikely to run soon enough to take advantage of it. We leave a thread that may execute soon. I would want to see this patchset benchmarked together with a wide array of tests to make sure there's no regression in the totality of it. I don't feel particularly strongly about this case but taken together there is some chance of unintended consequences.

I agree this may improve affinity in some cases, but at the same time we don't really know when the next thread on the queue is to run. Not stealing in this case also amounts to slightly violating the expected execution ordering and fairness.

As for benchmarking, of course this patchset needs wider benchmarking. I'll need help for that since I don't have that many different machines to test it on, and moreover it has to undergo testing with a variety of different workloads. I plan to contact olivier@ to see if he can test that patch set (perhaps slightly amended) at Netflix.

In D45388#1037293, @olce wrote:

In D45388#1035721, @mav wrote:

I suspect that first thread was skipped to avoid stealing a thread that was just scheduled to a CPU, but was unable to run yet.

I don't think that's a possibility with the current code, right?

In D45388#1036064, @jeff wrote:

The point was to move threads that are least likely to benefit from affinity because they are unlikely to run soon enough to take advantage of it. We leave a thread that may execute soon. I would want to see this patchset benchmarked together with a wide array of tests to make sure there's no regression in the totality of it. I don't feel particularly strongly about this case but taken together there is some chance of unintended consequences.

I agree this may improve affinity in some cases, but at the same time we don't really know when the next thread on the queue is to run. Not stealing in this case also amounts to slightly violating the expected execution ordering and fairness.

As for benchmarking, of course this patchset needs wider benchmarking. I'll need help for that since I don't have that many different machines to test it on, and moreover it has to undergo testing with a variety of different workloads. I plan to contact olivier@ to see if he can test that patch set (perhaps slightly amended) at Netflix.

It does steal in that case though, it will just prefer the second thread over the first. We don't permanently skip the first thread.

If we have a pervasive cpu hz granular clock we can do much better with affinity than I could when I wrote the scheduler using a 1khz or 128hz clock. These periods are too long for any real measure of the impact of affinity. This would be a worthwhile set of improvements. Precise time counting could tell us if it has been longer than X us since the thread was scheduled which is likely a better filter for whether affinity is valid.

I agree this may improve affinity in some cases, but at the same time we don't really know when the next thread on the queue is to run. Not stealing in this case also amounts to slightly violating the expected execution ordering and fairness.

It does steal in that case though, it will just prefer the second thread over the first. We don't permanently skip the first thread.

Yes, I'm aware. My sentence above should have read "(snip) Not stealing *it* in this case (snip)", sorry for the induced confusion. In practice, however, barring explicit cpuset assignment, the first thread is always skipped since steal_thresh is 2.

If we have a pervasive cpu hz granular clock we can do much better with affinity than I could when I wrote the scheduler using a 1khz or 128hz clock. These periods are too long for any real measure of the impact of affinity. This would be a worthwhile set of improvements. Precise time counting could tell us if it has been longer than X us since the thread was scheduled which is likely a better filter for whether affinity is valid.

Yes, but I doubt it's enough (it may cause odd decisions in some cases). A priori, I think that these refinements are likely to be compulsory to avoid odd behaviors:

Whether the new thread is from the same process as the previous one, or sharing the same vmspace.
Whether one thread has a tendency to trash cache a lot, which could perhaps be measured by HW performance counters (to eventually compute the percentage of cache thrashed by unit of time, and use it in conjunction with the runtime), or by computing statistics of performance for threads scheduled next soon after they start running again (but I fear this second mechanism may not be practical).

And, selecting a thread for affinity in violation of the "normal ordering" (which is not really defined currently) should probably lead to penalizing it (e.g., by adding to its elapsed runtime after the run). But this is probably going too far for the current discussion.