Scheduler and Mutex micro optimizations
Needs ReviewPublic
Actions

Authored by lex.studio.base_gmail.com on Feb 19 2026, 9:08 PM.

Details

Reviewers

kib
mav
olce

Summary

Scheduler and Mutex micro optimizations

This optimizations where made and tested with GENERIC configuration.

Test 1: 16 Threads

Metric	Optimized	Standard	Difference
Total Events	850074	811952	+38122 events
Avg Latency (ms)	2.26	2.36	-0.10
95th Percentile (ms)	16.21	17.32	-1.11
Events (avg/stddev)	53129.6250/620.02	50747.0000/584.61	Better throughput

Test 2: 32 Threads

Metric	Optimized	Standard	Difference
Total Events	506721	500707	+6014 events
Avg Latency (ms)	7.58	7.67	-0.09
95th Percentile (ms)	62.19	63.32	-1.13
Events (avg/stddev)	15835.0312/768.70	15647.0938/714.52	Better throughput

Test 3: 64 Threads

Metric	Optimized	Standard	Difference
Total Events	595346	583812	+11534 events
Avg Latency (ms)	12.88	13.16	-0.28
95th Percentile (ms)	102.97	106.75	-3.78
Events (avg/stddev)	9317.9062/408.55	9122.0625/342.80	Better distribution

*the tests were made using sysbench threads --threads=N --time=120 run. The system was Virtual Box with Ryzen 7 4800H (16) and VirtualBox had access to all of the threads.
*FreeBSD 16.0

For more informations (or if i need to post another patch or something) email me at lex.studio.base@gmail.com

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

lex.studio.base_gmail.com created this revision.Feb 19 2026, 9:08 PM

lex.studio.base_gmail.com created this object with edit policy "Administrators".

Herald added subscribers: olce, imp. · View Herald TranscriptFeb 19 2026, 9:08 PM

lex.studio.base_gmail.com requested review of this revision.Feb 19 2026, 9:08 PM

lex.studio.base_gmail.com awarded a token.Feb 19 2026, 9:13 PM

lex.studio.base_gmail.com rescinded a token.

lex.studio.base_gmail.com added reviewers: kib, mav, olce.Feb 19 2026, 9:19 PM

Could you please generate diff with full context, as described here and without separately attached patch? See https://wiki.freebsd.org/Phabricator.

Edited the diff , @mav

pi changed the edit policy from "Administrators" to "committers (Project)".Feb 21 2026, 11:45 AM

Being out-of-time (and off for the next week), I only reviewed changes to ULE, and the first one in this revision. But I'll also take a look at the rest when coming back.

General comments:

The inlining proposals I've reviewed in general look good, but I'm really unsure about their actual effect in the results you obtain.
Prefetching can be good, but it has to be done sufficiently early to be useful. There are several of occurrences here where it is useless for this reason. Even for prefetching changes I tagged with 'Fine.' (because they conceptually are), I'd like to know what is their practical effect in the results you're seeing.
__predict_false() and __predict_true() should not be used unless it's absolutely certain that almost all evaluations will be false or true and using them can be proved to have a positive practical effect. Branch predictors today (and that has been the case for more than a decade) are good enough to the point that any hint that does not meet these conditions is likely to be an overall pessimization.
There are gratuitous changes, such as comments removal (please never do that unless you have a good reason, and in some occurrences I reviewed, I don't see any good reason), or using the preprocessor where compilers have long been smart enough to do constant propagation and eliminate impossible branches, or changing the style again without good reason. Please refrain from doing that.
Benchmarking in a virtual machine can give you hints, but really is to be taken with a grain of salt, and final benchmarking should happen on a real machine.

It would be great if you can benchmark each change separately, or at least changes to each file separately, and for sched_ule.c, with additional runs where changes that I flagged as useless or likely to have no practical effect are removed and the others present or not.

sys/kern/kern_mutex.c
612–623	I think this reasoning and code is wrong. First, the prefetch does not happen ahead of the actual use, so should not have any positive effect (actually, it could be a slight pessimization). This is also inconsistent with not prefetching when `owner` is changed, which anyway is the same case (and there's no need for prefetching). Second, you try to save one `lv_mtx_owner()`, which is only a single bitwise-and, at the expense of a branch prediction, which is much more costly even assuming the annotation `__predict_false()` is right most of the time (and it may be not for heavily contended mutexes). All in all, I only see pessimizations here.
sys/kern/sched_ule.c
615	This one looks reasonable.
640	Reasonable too.
697	Feels reasonable too. clang sometimes makes inlining decisions that can be weird. Inlining this function may (didn't actually check) have the "drawback" that `tdq_choose()` will stop being inlined in `tdq_setlowpri()`, but I think there's indeed more potential synergies from inlining the `tdq_setowpri()` code in callers.
792–811	Fine.
884–897	Fine.
1117–1124	Please do not remove comments unless you have a good reason to (I see none here).
1317–1323	Fine.
1775–1787	We are not in the business of only optimizing for CPU-heavy threads. Most threads on a machine are quiescent and run infrequently, and you'll be penalizing them. Anyway, I doubt this has any practical effect. See also general comment about `__predict_true()` and `__predict_false()`.
1840–1844	Same here.
1906–1907	This one seems to make sense.
1929–1930	Gratuitous change. Compilers have been smart enough to do that for a very long time.
1934–1947	This change is OK, but not for the justification you're giving, which is most probably wrong. Indeed, inverting `` and `/` should not change anything in terms of assembly instructions used, as these are unsigned integers, and the compiler will use a shift instead of a multiply by 4 in both cases. IMO, the actual reason to do this change is that, given the ` 4` does not overflow, accuracy of the result will be better.
2038–2063	Yes, this function was designed exactly for that (nothing changes if `lu_span` is 0). It's a rare case, except if a thread is constantly doing a syscall that needs waiting in the kernel (I/O mostly), in which case really that should not matter in relative real thread time. So I don't expect this to have a visible practical effect.
2529–2534	Useless prefetch (too close to use).
2641	Most probably wrong prediction. See also general comment.
2884–2890	`sched_pctcpu_update()` must be call every tick, given the current decaying mechanism is sensitive to the timing of the calls. So this change is not acceptable.
3121–3127	Gratuitous. This slightly obfuscates the initial intent with no practical difference in produced assembly.