Page MenuHomeFreeBSD
Feed Advanced Search

Tue, Jun 4

jeff added a comment to D45396: vm_radix: define vm_radix_insert_lookup_lt and use in vm_page_rename.

I guess vm_page_insert() could be improved similarly, but that's more work.

Yes. It's not a lot of work, I just didn't want to get ahead of myself. I can put up a patch later this week when I get some more time.

IMO a better long-term direction there is to remove the memq (insertion into which is the purpose of looking up mpred in the first place) and use the radix tree for iteration instead, but that's a separate topic.

Yes, I've thought about that a little but haven't explored thoroughly. Honestly we may want to go that direction for bufs too. Maintaining tailq linkage can be costly as the neighbors may be cache cold. Privately we have added cache line prefetches in certain places which are surprisingly effective, but I don't know if we have an appetite for that sort of thing in tree.

If we actually want to take steps toward removing the linkage, we may need to provide better iterator primitives or at least conventions for pctrie, as otherwise scans may be more costly.

Tue, Jun 4, 4:37 PM

Mon, Jun 3

jeff added a comment to D45390: runq/sched: Switch to 256 distinct levels.

The runq index only moves at a rate slower than once per-tick when the system is very overloaded. This is how ULE implements priority decay. The difference in priority determines how many slices the lower (better) priority task will get for each slice the worse priority task will get.

You're absolutely right. I have been knowing that already but was so focused on the offset part that I forgot to consider how the head (the base index) moves (and indeed, unless the system has a high load, the head moves with tick frequency).

Two thoughts on that:

  1. It's easy to counter-balance the effect of the higher number of queues by just incrementing the head by 4 (the old RQ_PPQ) at each tick. I'll actually do that in the commit that switches to 256 queues, and then will change it to 1 in a separate commit (unless you prefer to wait for that part), just so it is later easier to bisect. If you're OK with the second commit, this won't change the revision here (since both commits would be grouped into it anyway).
  2. I've always found niceness in FreeBSD to be way too weak, so I wouldn't be against giving it a stronger (and possibly configurable) effect. That said, the side effect of the change here is still far from being enough from my POV. When using nice -n 20, I would really want it to only use at most 10% of the CPU time other processes with a nice value of 0 would get. I'd even argue to going as far as 1%. My expectation would be that nice -n 20 should be almost the same as using an idle priority. Another possible problem with our implementation of nice levels is that they are not logarithmic, which seems inconsistent with POSIX specifying that the nice() interface takes increments. This indeed hints at making increments have the same relative effect regardless of the value they are applied to, which then leads by composition to the logarithmic scale. It's what Linux does today IIRC. So I think there is work to do in this area also. It may make sense to try to avoid changing the nice behavior in this round of changes and save it for the time when we make other ones. Grouping those could be... nicer on users.

This won't just affect nice but it will be most evident with it. Run two cpu hogs that never terminate and consume 100% cpu on the same core with cpuset. Let one be nice +20. Report the % cpu consumed by each before and after this patch set. I believe the nice +20 process should get 1/4 the cpu it was allotted before.

I'll do the test just to be sure but I'm already convinced this is exactly what I'm going to observe.

Mon, Jun 3, 10:09 PM
jeff added a comment to D45388: sched_ule: Re-implement stealing on top of runq common-code.
In D45388#1035721, @mav wrote:

I suspect that first thread was skipped to avoid stealing a thread that was just scheduled to a CPU, but was unable to run yet.

I don't think that's a possibility with the current code, right?

The point was to move threads that are least likely to benefit from affinity because they are unlikely to run soon enough to take advantage of it. We leave a thread that may execute soon. I would want to see this patchset benchmarked together with a wide array of tests to make sure there's no regression in the totality of it. I don't feel particularly strongly about this case but taken together there is some chance of unintended consequences.

I agree this may improve affinity in some cases, but at the same time we don't really know when the next thread on the queue is to run. Not stealing in this case also amounts to slightly violating the expected execution ordering and fairness.

As for benchmarking, of course this patchset needs wider benchmarking. I'll need help for that since I don't have that many different machines to test it on, and moreover it has to undergo testing with a variety of different workloads. I plan to contact olivier@ to see if he can test that patch set (perhaps slightly amended) at Netflix.

Mon, Jun 3, 8:59 PM

Fri, May 31

jeff added a comment to D45388: sched_ule: Re-implement stealing on top of runq common-code.

This special case, introduced as soon as commit "ULE 3.0"
(ae7a6b38d53f, r171482, from July 2007), has no apparent justification.
All the reasons we can second-guess are dubious at best. In absence of
objections, let's just remove this twist, which caused bugs in the past.

Fri, May 31, 5:50 AM
jeff added a comment to D45390: runq/sched: Switch to 256 distinct levels.

There is one thing the original code author intended that we will have to validate. The relative impact of nice levels depends on their distance in the runq. ridx/idx (I'm not sure why they were renamed in one diff) march forward at a fixed rate in real time. Changing the number of queues and priorities changes that rate. I believe it will have the effect of increasing the impact of nice. We may need to change the way nice values are scaled to priorities to compensate.

The rename is to clear a possible confusion. After this change, these fields of struct tdq don't store the absolute index of a queue anymore, but rather the offset of the "start" queue in the range assigned to the timesharing selection policy. Moreover, I found that a single letter of difference sometimes impair quick reading, so I chose to be more explicit with _deq.

I don't think the relative impact of nice levels is changed. tdq_ts_deq_off (the old tdq_ridx) is not incremented by sched_clock() unless the queue it points to is empty, so the march forward, AFAIU, doesn't happen at a fixed rate, but rather depends on the time spent executing all queued threads (so at most the quantum by that number of threads, but will usually be less). Moreover, since the nice value is used in the computation of the *priority* to assign to a thread, that priority doesn't depend on the number of queues (as it should). Obviously, the final index of the chosen queue now changes (that's the point of this diff), but the only difference I can see is that the scheduler now will always run threads with lower numerical priority sooner than those with higher ones even if the priority difference is less than 4. Relative execution between these "clusters" of threads is unchanged. Do you agree?

Fri, May 31, 5:43 AM

Thu, May 30

jeff added a comment to D45390: runq/sched: Switch to 256 distinct levels.
In D45390#1035772, @mav wrote:

Differences of less than 4 (RQ_PPQ) are insignificant and are simply removed. No functional change (intended).

This is sure the easiest (least invasive answer) answer, but does it make it the right answer? Simply doing nothing here could expose the original code author's ideas. There could be some rationale that you are dropping. It would be nice to look what was it, unless you've done it already.

Thu, May 30, 1:41 AM