In D45396#1037489, @rlibby wrote:In D45396#1037485, @markj wrote:I guess vm_page_insert() could be improved similarly, but that's more work.
Yes. It's not a lot of work, I just didn't want to get ahead of myself. I can put up a patch later this week when I get some more time.
IMO a better long-term direction there is to remove the memq (insertion into which is the purpose of looking up mpred in the first place) and use the radix tree for iteration instead, but that's a separate topic.
Yes, I've thought about that a little but haven't explored thoroughly. Honestly we may want to go that direction for bufs too. Maintaining tailq linkage can be costly as the neighbors may be cache cold. Privately we have added cache line prefetches in certain places which are surprisingly effective, but I don't know if we have an appetite for that sort of thing in tree.
If we actually want to take steps toward removing the linkage, we may need to provide better iterator primitives or at least conventions for pctrie, as otherwise scans may be more costly.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 4 2024
Jun 3 2024
In D45390#1037322, @olce wrote:In D45390#1036063, @jeff wrote:The runq index only moves at a rate slower than once per-tick when the system is very overloaded. This is how ULE implements priority decay. The difference in priority determines how many slices the lower (better) priority task will get for each slice the worse priority task will get.
You're absolutely right. I have been knowing that already but was so focused on the offset part that I forgot to consider how the head (the base index) moves (and indeed, unless the system has a high load, the head moves with tick frequency).
Two thoughts on that:
- It's easy to counter-balance the effect of the higher number of queues by just incrementing the head by 4 (the old RQ_PPQ) at each tick. I'll actually do that in the commit that switches to 256 queues, and then will change it to 1 in a separate commit (unless you prefer to wait for that part), just so it is later easier to bisect. If you're OK with the second commit, this won't change the revision here (since both commits would be grouped into it anyway).
- I've always found niceness in FreeBSD to be way too weak, so I wouldn't be against giving it a stronger (and possibly configurable) effect. That said, the side effect of the change here is still far from being enough from my POV. When using nice -n 20, I would really want it to only use at most 10% of the CPU time other processes with a nice value of 0 would get. I'd even argue to going as far as 1%. My expectation would be that nice -n 20 should be almost the same as using an idle priority. Another possible problem with our implementation of nice levels is that they are not logarithmic, which seems inconsistent with POSIX specifying that the nice() interface takes increments. This indeed hints at making increments have the same relative effect regardless of the value they are applied to, which then leads by composition to the logarithmic scale. It's what Linux does today IIRC. So I think there is work to do in this area also. It may make sense to try to avoid changing the nice behavior in this round of changes and save it for the time when we make other ones. Grouping those could be... nicer on users.
This won't just affect nice but it will be most evident with it. Run two cpu hogs that never terminate and consume 100% cpu on the same core with cpuset. Let one be nice +20. Report the % cpu consumed by each before and after this patch set. I believe the nice +20 process should get 1/4 the cpu it was allotted before.
I'll do the test just to be sure but I'm already convinced this is exactly what I'm going to observe.
In D45388#1037293, @olce wrote:In D45388#1035721, @mav wrote:I suspect that first thread was skipped to avoid stealing a thread that was just scheduled to a CPU, but was unable to run yet.
I don't think that's a possibility with the current code, right?
In D45388#1036064, @jeff wrote:The point was to move threads that are least likely to benefit from affinity because they are unlikely to run soon enough to take advantage of it. We leave a thread that may execute soon. I would want to see this patchset benchmarked together with a wide array of tests to make sure there's no regression in the totality of it. I don't feel particularly strongly about this case but taken together there is some chance of unintended consequences.
I agree this may improve affinity in some cases, but at the same time we don't really know when the next thread on the queue is to run. Not stealing in this case also amounts to slightly violating the expected execution ordering and fairness.
As for benchmarking, of course this patchset needs wider benchmarking. I'll need help for that since I don't have that many different machines to test it on, and moreover it has to undergo testing with a variety of different workloads. I plan to contact olivier@ to see if he can test that patch set (perhaps slightly amended) at Netflix.
May 31 2024
This special case, introduced as soon as commit "ULE 3.0"
(ae7a6b38d53f, r171482, from July 2007), has no apparent justification.
All the reasons we can second-guess are dubious at best. In absence of
objections, let's just remove this twist, which caused bugs in the past.
In D45390#1035982, @olce wrote:In D45390#1035791, @jeff wrote:There is one thing the original code author intended that we will have to validate. The relative impact of nice levels depends on their distance in the runq. ridx/idx (I'm not sure why they were renamed in one diff) march forward at a fixed rate in real time. Changing the number of queues and priorities changes that rate. I believe it will have the effect of increasing the impact of nice. We may need to change the way nice values are scaled to priorities to compensate.
The rename is to clear a possible confusion. After this change, these fields of struct tdq don't store the absolute index of a queue anymore, but rather the offset of the "start" queue in the range assigned to the timesharing selection policy. Moreover, I found that a single letter of difference sometimes impair quick reading, so I chose to be more explicit with _deq.
I don't think the relative impact of nice levels is changed. tdq_ts_deq_off (the old tdq_ridx) is not incremented by sched_clock() unless the queue it points to is empty, so the march forward, AFAIU, doesn't happen at a fixed rate, but rather depends on the time spent executing all queued threads (so at most the quantum by that number of threads, but will usually be less). Moreover, since the nice value is used in the computation of the *priority* to assign to a thread, that priority doesn't depend on the number of queues (as it should). Obviously, the final index of the chosen queue now changes (that's the point of this diff), but the only difference I can see is that the scheduler now will always run threads with lower numerical priority sooner than those with higher ones even if the priority difference is less than 4. Relative execution between these "clusters" of threads is unchanged. Do you agree?
May 30 2024
In D45390#1035772, @mav wrote:Differences of less than 4 (RQ_PPQ) are insignificant and are simply removed. No functional change (intended).
This is sure the easiest (least invasive answer) answer, but does it make it the right answer? Simply doing nothing here could expose the original code author's ideas. There could be some rationale that you are dropping. It would be nice to look what was it, unless you've done it already.
Apr 19 2021
Dec 29 2020
I feel that I made the API overly complicated because I was trying to unify an API for administration and for programming. This example is more complex than necessary if you are simply trying to change the set for the process. A non-anonymous or numeric set is only required if you wish to refer to it later. Think of it more like a process group where you want to be able to apply that constraint to multiple things at once. Normal programs simply exist within their current process group and only in specific circumstances do you create a new one. The numbered sets are this 'group'. They exist in the middle of the hierarchy with the root set above them giving an absolute limit on available CPUs that may be from jail or the actual system. Below them exists anonymous sets for programs that have constrained themselves to a subset of the numbered set.
Dec 13 2020
Sep 2 2020
hilarious, thank you.
Aug 12 2020
Aug 10 2020
We should set reference bits or similar so that the LRU is updated lazily.
Jun 21 2020
Jun 20 2020
Jun 9 2020
May 5 2020
I approve of the approach.
May 1 2020
First off, lots of good discussion here. I think this is the start of a good approach and I support committing it disabled.
Apr 27 2020
Apr 21 2020
Have you thought about whether there are any side-effects in swap behavior from using the same object? Might we run into clustering behavior?
Mar 16 2020
I think there may be a bug in this. busy_sleep may not properly wait for an exclusive lock if it is sleeping to zero a page while it is sbusy and the caller requested an sbusy after zeroing. Currently no caller does this so it is not going to cause problems. There are simply too many flags and it complicates things but I do not see an easy way to drop many.
Mar 11 2020
One other thing; I have submitted patches to drm to address these changes. I will have to bump FreeBSD_version and fix one extra case in drm-legacy because it is not written in a way that it can use vm_page_busy_acquire().
Mar 10 2020
I agree with mark. CK_LIST is just a copy of queue.h with ck barriers added. We want stronger assertions to tie to the smr interface. We have wide arm64 platforms available to us for testing now. I think the risk of incorrect barriers is minimal if we are conservative. The only real question is how often to require acquire load.
Mar 9 2020
Mar 6 2020
I am ok with this but eventually it likely should not be shared busy.
Would you not prefer to use an invalid value rather than a bit to hold the free state? i.e. VHOLD_NO_SMR would be something like VHOLD_DEAD 0xffffffff. Some of your asserts would have to be adjusted. The first ref should switch dead with 1.
I don't mind adding accessors after the fact. It will help us continue to make them more natural and less obtrusive if we can make this work with them.
So In general the naming scheme is:
vn_* or v* in older code for vnode specific routines.
vop_ for routines specific to ops.
vfs_ for global or mount.
Mar 5 2020
Overall this is nice and a better first example of smr than radix.
Mar 1 2020
also please put a comment in the pwd structure describing how the synchronization works. I believe I understand but am not 100%.
This is a good opportunity to push for idiomatic access.
Feb 28 2020
Feb 27 2020
Feb 26 2020
Review feedback.
Upload the correct diff
Sometimes if you keep re-arranging pieces they shuffle into a more compact
representation.
Feb 25 2020
Fully handle sleep fail/nowait cases. Simplify a few lockless functions by
introducing another helper. Handle the object NULL assignment in a single
place with a compiler barrier.
I forgot to submit these comments